mardi 31 janvier 2023

How to shuffle big JSON file?

I have a JSON file with 1 000 000 entries in it (Size: 405 Mb). It looks like that:

[
  {
     "orderkey": 1,
     "name": "John",
     "age": 23,
     "email": "john@example.com"
  },
  {
     "orderkey": 2,
     "name": "Mark",
     "age": 33,
     "email": "mark@example.com"
  },
...
]

The data is sorted by "orderkey", I need to shuffle data.

I tried to apply the following Python code. It worked for smaller JSON file, but did not work for my 405 MB one.

import json
import random

with open("sorted.json") as f:
     data = json.load(f)

random.shuffle(data)

with open("sorted.json") as f:
     json.dump(data, f, indent=2)

How to do it?

UPDATE:

Initially I got the following error:

~/Desktop/shuffleData$ python3 toShuffle.py 
Traceback (most recent call last):
  File "/home/andrei/Desktop/shuffleData/toShuffle.py", line 5, in <module>
    data = json.load(f)
  File "/usr/lib/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 403646259 (char 403646258)

Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid.

Removing "}" fixed the problem.




Aucun commentaire:

Enregistrer un commentaire