mercredi 24 juin 2015

Python Custom Zipf Number Generator Performing Poorly

I needed a custom Zipf-like number generator because numpy.random.zipf function doesn't achieve what I need. Firstly, its alpha must be greater than 1.0 and I need an alpha of 0.5. Secondly, its cardinality is directly related to the sample size and I need to make more samples than the cardinality, e.g. make a list of 1000 elements from a Zipfian distribution of only 6 unique values.

@stanga posted a great solution to this.

import random 
import bisect 
import math 

class ZipfGenerator: 

    def __init__(self, n, alpha): 
        # Calculate Zeta values from 1 to n: 
        tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] 
        zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0]) 

        # Store the translation map: 
        self.distMap = [x / zeta[-1] for x in zeta] 

    def next(self): 
        # Take a uniform 0-1 pseudo-random value: 
        u = random.random()  

        # Translate the Zipf variable: 
        return bisect.bisect(self.distMap, u) - 1

The alpha can be less than 1.0 and the sampling can be infinite for a fixed cardinality n. The problem is that it runs too slow.

# Calculate Zeta values from 1 to n: 
tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] 
zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])

These two lines are the culprits. When I choose n=50000 I can generate my list in ~10 seconds. I need to execute this when n=5000000 but it's not feasible. I don't fully understand why this is performing so slow because (I think) it has linear complexity and the floating point operations seem simple. I am using Python 2.6.6 on a good server.

Is there an optimization I can make or a different solution altogether that meet my requirements?




Aucun commentaire:

Enregistrer un commentaire