jeudi 16 septembre 2021

random python mask array with some rule: no two (or not too much) adjacent false values

I want to mask part of the words in the text

['First', 'Citizen:', 'Before', 'we', 'proceed', 'any', 'further,', 'hear', 'me', 'speak.', 'All:', 'Speak,', 'speak.', 'First', 'Citizen:', 'You', 'are', 'all', 'resolved', 'rather', 'to', 'die', 'than', 'to', 'famish?', 'All:', 'Resolved.', 'resolved.', 'First', 'Citizen:', 'First,', 'you', 'know', 'Caius', 'Marcius', 'is', 'chief', 'enemy', 'to', 'the', 'people.']

using a Python mask. I implement a random Python mask

mask = np.ones(LEN, dtype=int)
maskrate=0.2 # percentage of masked words <1
nbmask=int(np.floor(LEN*maskrate))
mask[-nbmask:] = 0
np.random.shuffle(mask)
mask = mask.astype(bool)
print (mask )

masked_words=[]

for a,b in zip(words, mask):
    print (a)
    if b:
        masked_words.append(a)
    else:
        masked_words.append('_')

However I would like to avoid that too many words in a row are masked. If numerically possible, no two contiguous words should be masked. Such as what is happening here

_ Citizen: _ we proceed _ further, _ me speak. All: _ speak. First Citizen: You are all resolved rather to die than _ _ All: _ resolved. First Citizen: First, you know Caius Marcius is chief enemy to the people.

I would like randomness to be slightly more evenly distributed ...




Aucun commentaire:

Enregistrer un commentaire