I'm working on an assignment.
Given a file that contains token (word) per line I'm supposed to compute some entropies of a randomly messed up test.
For every character in the text, mess it up with a likelihood of 10%. If a character is chosen to be messed up, map it into a randomly chosen character from the set of characters that appear in the text. Since there is some randomness to the outcome of the experiment, run the experiment 10 times...
Here's some of my code
import random
tokens = file.read().splitlines()
character_set = list(set(''.join(tokens)))
entropies_messed = []
p = 0.1
for e in range(10):
# get a new copy of the tokens and mess it up
tokens_messed_up = list(tokens)
messing_function(tokens_messed_up, p, set)
entropies_messed.append(bigram_entropy(tokens_messed_up))
For messing function
def mess_up_words(tokens_list, probability, vocabulary):
# reset seed
random.seed()
for i in range(len(tokens_list)):
# with the given probability change the character
if random.random() < probability:
tokens_list[i] = random.choice(vocabulary)
However, this is incredibly slow, because the experiment is then repeated for some other probabilities. Is there any way to make it faster? This is my entropy function, just in case:
def bigram_entropy(tokens):
# total number of bigrams
N = len(tokens) - 1
# get bigram frequencies
# for estimating p(i,j)
bigram_freq = Counter(zip(tokens[:-1], tokens[1:]))
# create a dictionary that will allow (faster) searching by first word
# for estimating p(i|j)
bigram_dictionary = defaultdict(Counter)
for bigram, freq in bigram_freq.items():
bigram_dictionary[bigram[0]][bigram[1]] = freq
# compute entropy
entropy = 0
for bigram in bigram_freq.keys():
# p(i,j) ~ c(i,j) / number of bigrams
entropy -= ((bigram_freq[bigram] / N) * \
# p(i|j) ~ c(i,j) / number bigrams where first word is i
log2(bigram_freq[bigram] / sum(bigram_dictionary[bigram[0]].values())))
return entropy
Aucun commentaire:
Enregistrer un commentaire