lundi 24 août 2020

Why does python not shuffle properly when given a large amount of data?

I have a csv file named KDDTrain+.csv, which has 43 columns and 125973 lines.

Column 42 indicates the label of a certain row.

I built a simple code that reads the csv file into a dataframe, and shuffles it completely using pandas sample function, and then goes through each half of the dataframe and counts, how many rows are labeled "normal" and how many rows aren't:

import pandas as pd

csv= pd.read_csv("KDDTrain+.csv")
csv = csv.sample(frac=1).reset_index(drop=True)
results = csv['42'].tolist()
n = 0
a = 0
for res in results[:int(round(len(csv)/2))]:
    if res == "normal":
        n += 1
    else:
        a += 1
print(f"n:{n}, a:{a}|")
n = 0
a = 0
for res in results[int(round(len(csv)/2)):]:
    if res == "normal":
        n += 1
    else:
        a += 1
print(f"n:{n}, a:{a}|")

After running the code multiple times, it seems that the results are always very similar between the two halves, even though they are supposed to be shuffled. (Example: n:33627, a:29359| n:33716, a:29271|)

Note that I tried using other methods for shuffling the data such as SystemRandom, Numpy's Random, and multiple implementations of shuffle using the built in random method, and all gave similar results.

Why does python not really shuffle the data?




Aucun commentaire:

Enregistrer un commentaire