mercredi 26 avril 2017

Combining two samples out of numpy.random does not end up in a random sequence

I implemented the Wald-Wolfowitz runs test in python but during testing I encountered weird behaviour, the steps I take are the following:

  1. I take two samples out of the same distribution:
    import numpy as np
    list_dist_A = np.random.chisquare(2, 1000)
    list_dist_B = np.random.chisquare(2, 1000)

  1. Concatenating the two lists and sorting them, while remembering which number came from which sample. The following function does that and it returns a list of labels ["A","B","A","A", ... "B"]
    def _get_runs_list(list1, list2):
      # Add labels  
      l1 = list(map(lambda x: (x, "A"), list1))
      l2 = list(map(lambda x: (x, "B"), list2))
      # Concatenate
      lst = l1 + l2
      # Sort
      sorted_list = sorted(lst, key=lambda x: x[0])
      # Return only the labels:
      return [l[1] for l in sorted_list]

  1. Now I want to calculate the number of runs (a consecutive sequence of identical labels). e.g.:

    • a,b,a,b has 4 runs
    • a,a,a,b,b has 2 runs
    • a,b,b,b,a,a has 3 runs

    For this I use the following code:

    def _calculate_nruns(labels):
        nruns = 0
        last_seen = None

        for label in labels:
            if label != last_seen:
                nruns += 1
            last_seen = label

        return nruns

Since all elements are randomly drawn I thought that I should roughly end up with a sequence a,b,a,b,a,b... So this would mean that the number of runs is roughly 2000. However as can be seen on repl.it this is not the case, it is always roughly around 1000. Can anyone explain why this is the case?




Aucun commentaire:

Enregistrer un commentaire