vendredi 27 juillet 2018

randomly rearrange lines in a huge ndjson file (45 GB)

I have an ndjson file (every line a valid json) with information that I want to run through a topic model. As it happens, this data is sorted a) by user, and b) by time, so the structure of the overall ndjson file is far from random. However, for the purpose of pushing this file (which I will later chunk into n smaller ndjsons with 50k lines each) through the topic model, I need the features that appear in the data to all have the same probability of being in any given line. My idea to achieve this is to randomly re-order all the lines in the file. The file I'm working with has 11502106 lines and has an overall file size of 45 GB. I also have a gzipped version of the file, which is approximately 4 GB large.

My idea for solving this problem was to use Ubuntu's built-in shuf function, to extract the same number of lines as in the original file, and to direct the output to a new file. I did this like this:

nohup shuf -n 11502106 original_file.json > new_file_randomorder.json &

However, this process gets killed by the system after running for approx. 5 minutes. I'm guessing that I'm running out of memory (16 GB memory is available in my machine). I'm running Ubuntu 16.04.

I realise that this could be a very complicated task, given the fact that file size > available memory.

If anyone has any ideas or potential solutions to this problem, that would be greatly appreciated! Thanks a lot in advance!




Aucun commentaire:

Enregistrer un commentaire