lundi 20 août 2018

How to efficiently sample time series stream data

I am thinking how to efficiently sample a data in every n data from a steam, n is dynamic, it will change over time and equals to Total data sent / capacity of the buffer.

I have a buffer which can contain 10,000 points, and the data source is a stream, which will keep sending one point to the buffer every time. If the amount of raw data has been sent is 20,000 in total, then the points in the buffer should be 2,4,6,8,10,...20,000(index), just scales 20,000 points into 10,000 slots. The index of the sampling point does not need to be exact 2,4,6, but the interval between two indices in this case should be about 2 averagely.

Because the total amount of data point is changing, so if I do the sampling every time a new data point sent, the performance will be slow.(pick up 10,000 points from N points every time, and N is increasing over time.) So I would like to know is there any better algorithm can reduce calculation but still remains high precision of sampling?

*I tried using probability to handle this problem, it can do something similar, but the result is not precise. So I have no idea how to achieve my goal.

from time import sleep

samplingCout = 10

reservoir1 = []
reservoir2 = []
reservoir3 = []
avg = []

count = 0

def sample(arr, data):
    global count
    count += 1
    if len(arr) < samplingCout:
        arr.append(data)
    else:
        if randint(0, int(count / samplingCout)) == int(count / samplingCout):
            index = randint(0, samplingCout - 1)
            sleep(0.001)
            del arr[0]
            arr.append(data)

for i in range(1, 1000):
    sample(reservoir1,i)
    sample(reservoir2,i)
    sample(reservoir3,i)

for i in range(0, samplingCout):
    avg.append(int((reservoir1[i] + reservoir2[i] + reservoir3[i])/3))

print(avg)

Thanks.




Aucun commentaire:

Enregistrer un commentaire