I have many huge gzipped and feathered dataframes. I want to read in a sample of each of them. I wanted to use the random package and skiprows to generate and read the samples. Code looks like this:
df=pd.DataFrame(columns=column_names) #create empty df
for filename in (os.listdir(path)):
with gzip.open(os.path.join(path, filename)) as f:
n = sum(1 for row in filename) #number of rows in dataframe
s = 10 #desired sample size
skip = sorted(random.sample(range(n),n-s)) #compute skipped rows
samples = pd.read_feather(f, skiprows=skip) #read samples
df=df.append(samples) #append to df
Hope someone has an idea for that.
Aucun commentaire:
Enregistrer un commentaire