mardi 25 mai 2021

Custom numpy (or scipy?) probability distribution for random number generation

The issue

Tl;dr: I would like a function that randomly returns a float (or optionally an ndarray of floats) in an interval following a probability distribution that resembles the sum of a "Gaussian" and a uniform distributions.

The function (or class) - let's say custom_distr() - should have as inputs (with default values already given):

  • the lower and upper bounds of the interval: low=0.0, high=1.0
  • the mean and standard deviation parameters of the "Gaussian": loc=0.5, scale=0.02
  • the size of the output: size=None
  • size can be an integer or a tuple of integers. If so, then loc and scale can either both simultaneously be scalars, or ndarrays whose shape corresponds to size.

The output is a scalar or an ndarray, depending on size.

The output has to be scaled to certify that the cumulative distribution is equal to 1 (I'm uncertain how to do this).

Note that I'm following numpy.random.Generator's naming convention from uniform and normal distributions as reference, but the nomenclature and the utilized packages does not really matter to me.

What I've tried

Since I couldn't find a way to "add" numpy.random.Generator's uniform and Gaussian distributions directly, I've tried using scipy.stats.rv_continuous subclassing, but I'm stuck at how to define the _rvs method, or the _ppf method to make it fast.

From what I've understood of rv_continuous class definition in Github, _rvs uses numpy's random.RandomState (which is out of date in comparison to random.Generator) to make the distributions. This seems to defeat the purpose of using scipy.stats.rv_continuous subclassing.

Another option would be to define _ppf, the percent-point function of my custom distribution, since according to rv_generic class definition in Github, the default function _rvs uses _ppf. But I'm having trouble defining this function by hand.

Following, there is a MWE, tested using low=0.0, high=1.0, loc=0.3 and scale=0.02. The names are different than the "The issue" section, because terminologies of terms are different between numpy and scipy.

import numpy as np
from scipy.stats import rv_continuous
import scipy.special as sc
import matplotlib.pyplot as plt
import time


# The class definition
class custom_distr(rv_continuous):
    def __init__(self, my_loc=0.5, my_scale=0.5, a=0.0, b=1.0, *args, **kwargs):
        super(custom_distr, self).__init__(a, b, *args, **kwargs)
        self.a = a
        self.b = b
        self.my_loc = my_loc
        self.my_scale = my_scale

    def _pdf(self, x):
        # uniform distribution
        aux = 1/(self.b-self.a)
        # gaussian distribution
        aux += 1/np.sqrt(2*np.pi*self.my_scale**2) * \
                 np.exp(-(x-self.my_loc)**2/2/self.my_scale**2)
        return aux/2  # divide by 2?

    def _cdf(self, x):
        # uniform distribution
        aux = (x-self.a)/(self.b-self.a)
        # gaussian distribution
        aux += 0.5*(1+sc.erf((x-self.my_loc)/(self.my_scale*np.sqrt(2))))
        return aux/2  # divide by 2?


# Testing the class
if __name__ == "__main__":
    my_cust_distr = custom_distr(name="my_dist", my_loc=0.3, my_scale=0.02)

    x = np.linspace(0.0, 1.0, 10000)

    start_t = time.time()
    the_pdf = my_cust_distr.pdf(x)
    print("PDF calc time: {:4.4f}".format(time.time()-start_t))
    plt.plot(x, the_pdf, label='pdf')

    start_t = time.time()
    the_cdf = my_cust_distr.cdf(x)
    print("CDF calc time: {:4.4f}".format(time.time()-start_t))
    plt.plot(x, the_cdf, 'r', alpha=0.8, label='cdf')

    # Get 10000 random values according to the custom distribution
    start_t = time.time()
    r = my_cust_distr.rvs(size=10000)
    print("RVS calc time: {:4.4f}".format(time.time()-start_t))

    plt.hist(r, density=True, histtype='stepfilled', alpha=0.3, bins=40)

    plt.ylim([0.0, the_pdf.max()])
    plt.grid(which='both')
    plt.legend()

    print("Maximum of CDF is: {:2.1f}".format(the_cdf[-1]))

    plt.show()

The generated image is: enter image description here

The output is:

PDF calc time: 0.0010
CDF calc time: 0.0010
RVS calc time: 11.1120
Maximum of CDF is: 1.0

The time computing the RVS method is too slow in my approach.




Aucun commentaire:

Enregistrer un commentaire