mercredi 5 septembre 2018

SQL sample groups

I have a sqlite database that I can read as:

In [42]: df = pd.read_sql("SELECT * FROM all_vs_all", engine)                                                                                                                                                   


In [43]:                                                                                                                                                                                                        
In [43]: df.head()                                                                                                                                                                                              
Out[43]:                                                                                                                                                                                                        
                             user_data                           user_model  \                                                                                                                                  
0  037d05edbbf8ebaf0eca@172.16.199.165  037d05edbbf8ebaf0eca@172.16.199.165                                                                                                                                     
1  037d05edbbf8ebaf0eca@172.16.199.165   060210bf327a3e3b4621@172.16.199.33                                                                                                                                     
2  037d05edbbf8ebaf0eca@172.16.199.165   1141259bd36ba65bef02@172.21.44.180                                                                                                                                     
3  037d05edbbf8ebaf0eca@172.16.199.165  209627747e2af1f6389e@172.16.199.181                                                                                                                                     
4  037d05edbbf8ebaf0eca@172.16.199.165  303a1aff4ab6e3be82ab@172.21.112.182                                                                                                                                     

      score  Class   time_secs model_name  bin_id                                                                                                                                                               
0  0.283141      0  1514764800       Flow       0                                                                                                                                                               
1  0.999300      1  1514764800       Flow       0                                                                                                                                                               
2  1.000000      1  1514764800       Flow       0                                                                                                                                                               
3  0.206360      1  1514764800       Flow       0                                                                                                                                                               
4  1.000000      1  1514764800       Flow       0                                                                                                                                                               

As the table is too big I rather than reading the full table I select a random subset of rows:

This can be done very quckly as:

 random_query = "SELECT * FROM all_vs_all WHERE abs(CAST(random() AS REAL))/9223372036854775808 < %f AND %s" % (ratio, time_condition)                                                                       
    df = pd.read_sql(random_query, engine)                                                                                                                                                                      

The problem is that for each triplet [user_data, user_model, time_secs] I want to get all the rows containing that triplet. Each triplet appears 1 or 2 times.

A possible way to do it is to firstly sample a random set of triplets and then get all the rows that have one of the selected triplets but this seems to be too slow.

Is there an efficient way to do it?

EDIT: If I could load all the data in pandas I would have done something like:

selected_groups = []
for group in df.groupby(['user_data', 'user_model', 'time_secs']):
    if np.random.uniform(0,1) > ratio:
       selected_groups.append(group)
res = pd.concat(selected_groups)




Aucun commentaire:

Enregistrer un commentaire