I implemented the Wald-Wolfowitz runs test in python but during testing I encountered weird behaviour, the steps I take are the following:
- I take two samples out of the same distribution:
import numpy as np
list_dist_A = np.random.chisquare(2, 1000)
list_dist_B = np.random.chisquare(2, 1000)
- Concatenating the two lists and sorting them, while remembering which number came from which sample. The following function does that and it returns a list of labels ["A","B","A","A", ... "B"]
def _get_runs_list(list1, list2):
# Add labels
l1 = list(map(lambda x: (x, "A"), list1))
l2 = list(map(lambda x: (x, "B"), list2))
# Concatenate
lst = l1 + l2
# Sort
sorted_list = sorted(lst, key=lambda x: x[0])
# Return only the labels:
return [l[1] for l in sorted_list]
-
Now I want to calculate the number of runs (a consecutive sequence of identical labels). e.g.:
- a,b,a,b has 4 runs
- a,a,a,b,b has 2 runs
- a,b,b,b,a,a has 3 runs
For this I use the following code:
def _calculate_nruns(labels):
nruns = 0
last_seen = None
for label in labels:
if label != last_seen:
nruns += 1
last_seen = label
return nruns
Since all elements are randomly drawn I thought that I should roughly end up with a sequence a,b,a,b,a,b... So this would mean that the number of runs is roughly 2000. However as can be seen on repl.it this is not the case, it is always roughly around 1000. Can anyone explain why this is the case?
Aucun commentaire:
Enregistrer un commentaire