mercredi 18 février 2015

Generate Random Data for Cassandra DB

I have a big data project for school that requires us to build and query a 8 node Cassandra system. The system must contain at least seven terabytes of data. I have to generate all this data myself. There is no requirement that the data be "relevant" to the assignment -- ie each column can just be a random int. That being said it is required that each value is random or based on a random sequence.


So, I wrote a simple java program to just generate random ints. I can generate ~200 MB of random test data in ~120s. Now unless my math is off, then I think I'm in a pickle.


There are 35000 200MB units in 7 terabytes.


35000 * 120 = 4 200 000 seconds


4 200 000 / 3600 ~ 1167hours


1167 / 24 = 49 days


So, it appears that it will take 49 days to generate all the test data needed. Obviously, this is impractical. I'm looking for suggestions that will increase the rate at which I can generate data.


I've considered/considering:



setting replication factor to 8 to reduce the amount of data needed to be generated, and also running the data generation program on all 8 nodes.



Aucun commentaire:

Enregistrer un commentaire