Ok, so I have multiple textfiles, each containing well over 500.000 or even 1.000.000 lines.
Currently I do something like this:
import random
def line_function(line):
# Do something with given line
def random_itteration(filepath):
with open(filepath) as f:
lines = f.readlines()
shuffled_lines = random.shuffle(lines)
for line in shuffled_lines:
result = line_function(line)
The thing is that the Python Docs on random.shuffle()
clearly state (emphasis added by me):
Note that even for small len(x), the total number of permutations of x can quickly grow larger than the period of most random number generators. This implies that most permutations of a long sequence can never be generated. For example, a sequence of length 2080 is the largest that can fit within the period of the Mersenne Twister random number generator.
So the question is:
What would be the fastest and most efficient way to make my setup work as intended?
Further info:
There is a reason why I want to apply line_function() to a random line and not simply iterate over them in the sequence they are in. Also note that I highly prefer to only process each line once.
Finally, shuffling the textfile up front, or dividing it into smaller files unfortunately isn't an option. And isn't what I am asking.
Any insights are more then welcome! Thnx in advance guys.
Aucun commentaire:
Enregistrer un commentaire