lundi 29 juin 2020

Python's random.randint() function appears to hang after a few hundred iterations in a for loop?

I have the following Python function in Jupyter Notebook:

def remove_random(df, group_counts):
    df = df.copy()
    blanks = set()
    if isinstance(group_counts, list):
        for groups, count in group_counts:
            for _, row in df.iterrows():
                current = 0
                while current < groups:
                    start = random.randint(0, len(row) - count)
                    cut = {x for x in range(start, start + count)}
                    if not cut.issubset(blanks):
                        current += 1
                        for i in range(start, start + count):
                            row[i] = np.nan
                            blanks.add(i)
    else:
        groups = group_counts[0]
        count = group_counts[1]
        
        for _, row in df.iterrows():
            i = 0
            print(_)
            while i < groups:
                start = random.randint(0, len(row) - count)
                cut = {x for x in range(start, start + count)}
                if not cut.issubset(blanks):
                    for i in range(start, start + count):
                        row[i] = np.nan
                        blanks.add(i)
                    i += 1
    return df

It is a function which removes n groups of j random points from a list; however, the indices being removed may not repeat and have to be unique each iteration. df is a pandas DataFrame, and group_counts is either a tuple or a list of tuples of the format (n, j).

When printing the row index as in the code above, the console shows that the rows of df are processed up to around 400 give or take a few, and when I kill the notebook it usually traces the point of execution to random.randint(), though I have seen it land on the issubset() as well. I have tried reseeding the random number generator but that did not fix the issue. Other than that I cannot find any bugs in my code which would be causing this problem. Moreover, as far I can tell from the output speed, the function chugs along quickly until stalling on row 400 so I don't think that using issubset() is causing a slowdown.




Aucun commentaire:

Enregistrer un commentaire