lundi 24 décembre 2018

Generate random numpy array from a given list of elements with at least one repetition of each element

I want to create an array (say output_list) from a given numpy (say input_list) after resampling such that each element from input_list exists in output_list at least once. The length of output_list will be always > the length of input_list.

I tried a few approaches, and I am looking for a faster method. Unfortunately, numpy's random.choice doesn't guarantee that at least one element exists.

Step 1: Generate Data

import string
import random
import numpy as np

size = 150000
chars = string.digits + string.ascii_lowercase
input_list= [
            "".join(
                [random.choice(chars) for i in range(5)]
            ) for j in range(dict_data[1]['unique_len'])]

Option 1: Let's try numpy's random.choice with uniform distribution in terms of probability.

output_list = np.random.choice(
    input_list,
    size=output_size,
    replace=True,
    p=[1/input_list.__len__()]*input_list.__len__()
    )
assert set(input_list).__len__()==set(output_list).__len__(),\
    "Output list has fewer elements than input list"

This raises assertion:

Output list has fewer elements than input list

Option 2 Let's pad random numbers to input_list and then shuffle it.

output_list = np.concatenate((np.array(input_list),np.random.choice(
    input_list,
    size=output_size-input_list.__len__(),
    replace=True,
    p=[1/input_list.__len__()]*input_list.__len__()
)),axis=None)

np.random.shuffle(output_list)
assert set(input_list).__len__()==set(output_list).__len__(),\
    "Output list has fewer elements than input list"

While this doesn't raise any assertion, I am looking for a faster solution than this either algorithmically or using numpy's in-built function.

Thanks for any help.




Aucun commentaire:

Enregistrer un commentaire