mercredi 1 février 2017

Python randomly mess up text - too slow

I'm working on an assignment.

Given a file that contains token (word) per line I'm supposed to compute some entropies of a randomly messed up test.

For every character in the text, mess it up with a likelihood of 10%. If a character is chosen to be messed up, map it into a randomly chosen character from the set of characters that appear in the text. Since there is some randomness to the outcome of the experiment, run the experiment 10 times...

Here's some of my code

import random
tokens = file.read().splitlines()
character_set = list(set(''.join(tokens)))

entropies_messed = []
p = 0.1
for e in range(10):
    # get a new copy of the tokens and mess it up
    tokens_messed_up = list(tokens)
    messing_function(tokens_messed_up, p, set)
    entropies_messed.append(bigram_entropy(tokens_messed_up))

For messing function

def mess_up_words(tokens_list, probability, vocabulary):
    # reset seed
    random.seed()
    for i in range(len(tokens_list)):
        # with the given probability change the character
        if random.random() < probability:
            tokens_list[i] = random.choice(vocabulary)

However, this is incredibly slow, because the experiment is then repeated for some other probabilities. Is there any way to make it faster? This is my entropy function, just in case:

def bigram_entropy(tokens):
    # total number of bigrams
    N = len(tokens) - 1
    # get bigram frequencies
    # for estimating p(i,j)
    bigram_freq = Counter(zip(tokens[:-1], tokens[1:]))
    # create a dictionary that will allow (faster) searching by first word
    # for estimating p(i|j)
    bigram_dictionary = defaultdict(Counter)
    for bigram, freq in bigram_freq.items():
        bigram_dictionary[bigram[0]][bigram[1]] = freq
    # compute entropy
    entropy = 0
    for bigram in bigram_freq.keys():
        #  p(i,j) ~ c(i,j) / number of bigrams
        entropy -= ((bigram_freq[bigram] / N) * \
                    #  p(i|j) ~ c(i,j) / number bigrams where first word is i
                    log2(bigram_freq[bigram] / sum(bigram_dictionary[bigram[0]].values())))
    return entropy




Aucun commentaire:

Enregistrer un commentaire