mardi 23 juin 2020

read sample of large feathered gzipped dataframe without opening

I have many huge gzipped and feathered dataframes. I want to read in a sample of each of them. I wanted to use the random package and skiprows to generate and read the samples. Code looks like this:

df=pd.DataFrame(columns=column_names) #create empty df


for filename in (os.listdir(path)):
        with gzip.open(os.path.join(path, filename)) as f:
            n = sum(1 for row in filename) #number of rows in dataframe
            s = 10 #desired sample size
            skip = sorted(random.sample(range(n),n-s)) #compute skipped rows
            samples = pd.read_feather(f, skiprows=skip) #read samples
            df=df.append(samples) #append to df

Hope someone has an idea for that.




Aucun commentaire:

Enregistrer un commentaire