I want to mask part of the words in the text
['First', 'Citizen:', 'Before', 'we', 'proceed', 'any', 'further,', 'hear', 'me', 'speak.', 'All:', 'Speak,', 'speak.', 'First', 'Citizen:', 'You', 'are', 'all', 'resolved', 'rather', 'to', 'die', 'than', 'to', 'famish?', 'All:', 'Resolved.', 'resolved.', 'First', 'Citizen:', 'First,', 'you', 'know', 'Caius', 'Marcius', 'is', 'chief', 'enemy', 'to', 'the', 'people.']
using a Python mask. I implement a random Python mask
mask = np.ones(LEN, dtype=int)
maskrate=0.2 # percentage of masked words <1
nbmask=int(np.floor(LEN*maskrate))
mask[-nbmask:] = 0
np.random.shuffle(mask)
mask = mask.astype(bool)
print (mask )
masked_words=[]
for a,b in zip(words, mask):
print (a)
if b:
masked_words.append(a)
else:
masked_words.append('_')
However I would like to avoid that too many words in a row are masked. If numerically possible, no two contiguous words should be masked. Such as what is happening here
_ Citizen: _ we proceed _ further, _ me speak. All: _ speak. First Citizen: You are all resolved rather to die than _ _ All: _ resolved. First Citizen: First, you know Caius Marcius is chief enemy to the people.
I would like randomness to be slightly more evenly distributed ...
Aucun commentaire:
Enregistrer un commentaire