I have the following Python function in Jupyter Notebook:
def remove_random(df, group_counts):
df = df.copy()
blanks = set()
if isinstance(group_counts, list):
for groups, count in group_counts:
for _, row in df.iterrows():
current = 0
while current < groups:
start = random.randint(0, len(row) - count)
cut = {x for x in range(start, start + count)}
if not cut.issubset(blanks):
current += 1
for i in range(start, start + count):
row[i] = np.nan
blanks.add(i)
else:
groups = group_counts[0]
count = group_counts[1]
for _, row in df.iterrows():
i = 0
print(_)
while i < groups:
start = random.randint(0, len(row) - count)
cut = {x for x in range(start, start + count)}
if not cut.issubset(blanks):
for i in range(start, start + count):
row[i] = np.nan
blanks.add(i)
i += 1
return df
It is a function which removes n groups of j random points from a list; however, the indices being removed may not repeat and have to be unique each iteration. df
is a pandas DataFrame, and group_counts
is either a tuple or a list of tuples of the format (n, j).
When printing the row index as in the code above, the console shows that the rows of df
are processed up to around 400 give or take a few, and when I kill the notebook it usually traces the point of execution to random.randint()
, though I have seen it land on the issubset()
as well. I have tried reseeding the random number generator but that did not fix the issue. Other than that I cannot find any bugs in my code which would be causing this problem. Moreover, as far I can tell from the output speed, the function chugs along quickly until stalling on row 400 so I don't think that using issubset()
is causing a slowdown.
Aucun commentaire:
Enregistrer un commentaire