I am currently required to generate a few hundred GB worth of test data for an ElasticSearch index for load testing. This means hundreds of millions of "docs".
The format I need to generate looks like:
{
field 1 : <ip>
field 2 : <uuid>
field 3 : <date>
field 4: <json-array of string>
...
...
}
for around 40-50 fields per doc. Generated data needs to match the index template for the specific index I am required to test.
Ok, so sounds straight forward right? A normal JSON dataset generator that can handle generating a few hundred million Json docs is the way to go provided I can find one that supports format.
The problem is that the ES Bulk upload API requires upload to be supplied following way. For EACH doc, first a "command json" containing meta-data for a doc, then the json doc itself:
POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
The one free solution that look like it might support generating huge datasets only support generating Json with uniform format. Which means I can't make it generate the command followed by the doc.
So I tried to generate using my own bash script based on some pre-existing data (I randomize doc-id and some other fields) to generate data. But the problem with that is I need to run my bash script in parallel up to 100s of times at once to generate the data in a timely manner. And the /dev/urandom in bash is "conflicting", as in it is generating the -same- random data across different scripts when ran in parallel when I need the doc-id to be unique.
This is getting long, but any help for either
1) A free solution which can generate the large datasets in JSON and in the format I need
OR
2) A solution for bash random generation process when ran in parallel
Would be appreciated. Thanks.
Aucun commentaire:
Enregistrer un commentaire