I have a sqlite database that I can read as:
In [42]: df = pd.read_sql("SELECT * FROM all_vs_all", engine)
In [43]:
In [43]: df.head()
Out[43]:
user_data user_model \
0 037d05edbbf8ebaf0eca@172.16.199.165 037d05edbbf8ebaf0eca@172.16.199.165
1 037d05edbbf8ebaf0eca@172.16.199.165 060210bf327a3e3b4621@172.16.199.33
2 037d05edbbf8ebaf0eca@172.16.199.165 1141259bd36ba65bef02@172.21.44.180
3 037d05edbbf8ebaf0eca@172.16.199.165 209627747e2af1f6389e@172.16.199.181
4 037d05edbbf8ebaf0eca@172.16.199.165 303a1aff4ab6e3be82ab@172.21.112.182
score Class time_secs model_name bin_id
0 0.283141 0 1514764800 Flow 0
1 0.999300 1 1514764800 Flow 0
2 1.000000 1 1514764800 Flow 0
3 0.206360 1 1514764800 Flow 0
4 1.000000 1 1514764800 Flow 0
As the table is too big I rather than reading the full table I select a random subset of rows:
This can be done very quckly as:
random_query = "SELECT * FROM all_vs_all WHERE abs(CAST(random() AS REAL))/9223372036854775808 < %f AND %s" % (ratio, time_condition)
df = pd.read_sql(random_query, engine)
The problem is that for each triplet [user_data, user_model, time_secs]
I want to get all the rows containing that triplet. Each triplet appears 1 or 2 times.
A possible way to do it is to firstly sample a random set of triplets and then get all the rows that have one of the selected triplets but this seems to be too slow.
Is there an efficient way to do it?
EDIT: If I could load all the data in pandas I would have done something like:
selected_groups = []
for group in df.groupby(['user_data', 'user_model', 'time_secs']):
if np.random.uniform(0,1) > ratio:
selected_groups.append(group)
res = pd.concat(selected_groups)
Aucun commentaire:
Enregistrer un commentaire