vendredi 29 avril 2022

Keep k random strings from input file of unknown length in C

Problem

I want to get input from a file of unknown length, separating the input to strings at a defined delimiter and saving some of them. After reading the entire file, I want to have in an array of strings, k random strings from the input that was read. Each string must have the same probability of beeing kept in the final array.

Important Part

I don't want to read the file, save all the strings somewhere and then pick k of them at random. I want while reading each string, update the array with k strings (if needed depending on probabilities) and move on to the next string forgeting the previous string. In other words memory usage for how many strings are saved thought-out executing the program must be on the low side.


Code Structure

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define DELIMITER ':'

FILE* getInputSource(char* source);

char** pickStrings(char* source, int k);
void printStrings(char** strings, int k);
void freeStrings(char** strings, int k);

// your extra declerations (start)

// your extra declerations (end)

int main(int argc, char const *argv[]) {
    srand(time(NULL));

    int toKeep = 5;
    char** strings = pickStrings("input.txt",toKeep);
    printStrings(strings,toKeep);
    freeStrings(strings,toKeep);

    return 0;
}

char** pickStrings(char* source, int k) {
    FILE* input = getInputSource(source);

    char** strings = (char**) calloc(k,sizeof(char*));
    // your implementation here

    return strings;
}

// your extra implementations (start)

// your extra implementations (end)

FILE* getInputSource(char* source) {
    FILE* input = fopen(source,"r");
    if( !input ) {
        printf("Unable to open input file.\n");
        exit(0);
    }
    return input;
}

void printStrings(char** strings, int k) {
    for( int i = 0; i<k; i++ ) 
        printf("%d: %s\n", i, strings[i]);
}

void freeStrings(char** strings, int k) {
    for( int i = 0; i < k; i++)
        free(strings[i]);
    free(strings);
}

Contents of input.txt for testing:

ok:maybe:not:something:yes:off:blue:donut:A:B:C:D:E:F:G:

Expected Output

A random sequence of the strings in file separated by the delimiter. One sequence would look something like the following:

0: not
1: A
2: maybe
3: yes
4: blue

Assumptions

  • Each string length is unknown. You can either define a fixed length like 256 chars per string or dynamically allocate as many as needed for each string.
  • Strings can contain anything except the defined delimiter which is used to seperate them.
  • Delimiter is used at the last string as well.
  • If input.txt contains less than k strings, you can fill first slots of the array (at random order) and leave the rest of the array empty.

Preferences

  • Comments point out where to write your code. The rest of the code should stay as it is and should only change after discussion in comments.
  • Your answer should contain only the code you added. But if you believe posting it all is better then go ahead, no problem.

Note for this Question

This question was inspired from many other similar questions that exist here and there in stack overflow, but they do not meet my needs even if they partly give me an answer. This is an attempt to generalize this kind of question so that it can cover different possible variations of the question, and make it easier to search for. I tried to edit the existing questions (without changing the author's original question) so they can be more searchable and more meaningful, but many of the authors are inactive for a long period of time or their question was too specific and changing it would ruin it or there is not an accepted/completed answer.


Extras

Some of the other questions mentioned Reservoir Sampling either on comments or answers. It is something that can be considered on how to achieve equal probability of keeping for each string, but if there is anything else glad to hear it as well.





Aucun commentaire:

Enregistrer un commentaire