mardi 27 février 2018

Python 3: Fastest and most efficient way to iterate randomly over all lines in a big file (+1 million lines)

Ok, so I have multiple textfiles, each containing well over 500.000 or even 1.000.000 lines.

Currently I do something like this:

import random

def line_function(line):
    # Do something with given line

def random_itteration(filepath):
    with open(filepath) as f:
        lines = f.readlines()
        shuffled_lines = random.shuffle(lines)
        for line in shuffled_lines:
            result = line_function(line)

The thing is that the Python Docs on random.shuffle() clearly state (emphasis added by me):

Note that even for small len(x), the total number of permutations of x can quickly grow larger than the period of most random number generators. This implies that most permutations of a long sequence can never be generated. For example, a sequence of length 2080 is the largest that can fit within the period of the Mersenne Twister random number generator.

So the question is:

What would be the fastest and most efficient way to make my setup work as intended?

Further info:

There is a reason why I want to apply line_function() to a random line and not simply iterate over them in the sequence they are in. Also note that I highly prefer to only process each line once.

Finally, shuffling the textfile up front, or dividing it into smaller files unfortunately isn't an option. And isn't what I am asking.


Any insights are more then welcome! Thnx in advance guys.




Aucun commentaire:

Enregistrer un commentaire