mercredi 25 mars 2020

Generating big random JSON dataset for ElasticSearch

I am currently required to generate a few hundred GB worth of test data for an ElasticSearch index for load testing. This means hundreds of millions of "docs".

The format I need to generate looks like:

{
field 1 : <ip>
field 2 : <uuid>
field 3 : <date>
field 4: <json-array of string>
...
...
}

for around 40-50 fields per doc. Generated data needs to match the index template for the specific index I am required to test.

Ok, so sounds straight forward right? A normal JSON dataset generator that can handle generating a few hundred million Json docs is the way to go provided I can find one that supports format.

The problem is that the ES Bulk upload API requires upload to be supplied following way. For EACH doc, first a "command json" containing meta-data for a doc, then the json doc itself:

POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }

The one free solution that look like it might support generating huge datasets only support generating Json with uniform format. Which means I can't make it generate the command followed by the doc.

So I tried to generate using my own bash script based on some pre-existing data (I randomize doc-id and some other fields) to generate data. But the problem with that is I need to run my bash script in parallel up to 100s of times at once to generate the data in a timely manner. And the /dev/urandom in bash is "conflicting", as in it is generating the -same- random data across different scripts when ran in parallel when I need the doc-id to be unique.

This is getting long, but any help for either

1) A free solution which can generate the large datasets in JSON and in the format I need

OR

2) A solution for bash random generation process when ran in parallel

Would be appreciated. Thanks.




Aucun commentaire:

Enregistrer un commentaire