Thursday, June 17, 2010

Reservoir Sampling - K samples from an infinite stream

Question;
You have a stream of infinite queries (ie Google searches). Describe how you would find K uniformly distributed samples and write code for it.

Reservoir Sampling is an algorithm designed to take K samples from a stream of samples of unknown size in linear time.

How it works.
The idea is simple. Firstly note that until the samples appear you don't know the total number of samples. To avoid this problem you fill up a reservoir of samples and then adjust it as each new sample appears.

First record samples until you have filled the reservoir(K samples). Once you have the first K samples the probability of each sample being in the final result was K/K or 100%.

Now when the K+i sample appears the probability of the new sample appearing in one _PARTICULAR_ output slot becomes 1/K+i. Since there are K potential output slots there is K/(K+i) chance of it being in the output and i/(K+i) chance that it wont be.

So choosing whether or not to add the new sample is easy we toss the dice. If we decide that the new sample is in the output then one of the existing items needs to be removed. Since all the items that where added to output and have equal chance of being present they also have equal chance of being removed. Hence we can simply choose to remove one of the existing samples with a 1/K probability.

This make sense logically but how do we prove it.
  • Assuming the prior stage worked correctly each element in the output should have K/(K+i-1) chance of being present.
  • There is i/(K+i) chance that prior output will be unchanged this round. Hence each element has Ki/(K+i)(K+i-1) chance of being output if the unchanging choice is made for this stage.
  • There is K/(K+i) chance that this something in the output will change
    • And for each element there is (K-1)/K chance it will not be replaced thus a total of (K-1)/K * K/(K+i) * K/(K+i-1) chance it will not be removed.
    • So conversely there is a 1/K * K/(K+i) * K/(K+i-1) chance that it will be replaced.

Hence the net chance of the existing elements making it to the next round is the sum of the chance that nothing changed with the chance that it was not the one that changed.
= Ki/(K+i)(K+i-1) + (K-1)K/(K+i-1)(K+i)
=> (Ki + (K-1)K)/(K+i-1)(K+i)
=> K(i+K-1)/(K+i-1)(K+i)
=> K/(K+i)
Q.E.D.

And of course the chance of the new element making to the output was K/(K+i). As you can see it has nicely balanced out for the possible output items for this stage.

BUT HERE IS THE CATCH.. Note that the mathematics/algorithm considered only the choice between output and non-output state of the samples. IT MADE NO STATEMENT AS TO THE OUTPUTS ORDERING. Hence the output samples need to be randomly shuffled in the final stages of the algorithm or there will be an issue with the ordering of the samples.

The code:
//compile using g++
#include <iostream>
using namespace std;

int reservoirSample(int sample, int* samples, int size, int count)
{
  if(count < size)
    samples[count] = sample;
  else
    if((rand()%count) < size)
      samples[rand()%size] = sample;
  
  return ++count;
}

#define SIZE 10
// #define STREAM_AVERAGE 300
#define STREAM_AVERAGE 10
int main()
{
  int count = 0;
  int samples[SIZE];
  int sample;
  int i = 0;

  srand(time(NULL));

  cout << "Sample Stream: " << endl;
  while(
        (count < SIZE) || 
        (rand()%STREAM_AVERAGE > 0)
        )
    {
      sample = rand()%1000;
      cout << sample << " ";

      count = reservoirSample(sample, samples, SIZE, count);
    }
  cout << endl;
  cout << "Total samples: " << count << endl;

  cout << "Output samples: " << endl;
  for(i = 0;i < SIZE;i++)
    cout << samples[i] << " ";
  cout << endl;
}

7 comments:

  1. Very Nice and correct explanation...thank you
    just a small query..in ur explanation:
    And for each element there is (K-1)/K chance it will not be replaced thus a total of
    a: (K-1)/K * K/(K+i) * K/(K+i-1) chance it will not be removed.
    b: So conversely there is a 1/K * K/(K+i) * K/(K+i-1) chance that it will be replaced.

    If I look at point 'b'...it is just a reverse of point 'a'.If that so...shouldn't be the probability of 'b' = 1 - prob(a)...that is:
    prob(b) = 1 - (k - 1)/k *k/(k+1)*k/(k + i -1 )

    Please reply.

    ReplyDelete
  2. Shwetank... Something is a miss with your sums you seem to have a (k+1) in the denominator instead of (K+i) i think that will untangle the problem for you.

    ReplyDelete
  3. @Ashley: Thanks for the reply.
    No, my confusion wasn't related with (k + i) [it is an editing mistake]. My real concern was that the probability of point 'b' = 1 - prob(a), but if we calculate the same then the result would not be equivalent to "1/K * K/(K+i) * K/(K+i-1)".
    But now it seems to me that in this probability the last two factors "K/(K+i) * K/(K+i-1)" are showing the probability of "the element exists in the reservoir till (i - 1) iterations and now the reservoir has to be modified", by multiplying by "1/k" into this factor it shows "the probability of an reservoir element to be removed".
    Correct me if I am wrong.

    ReplyDelete
  4. would you explain in the term of for loop (i< size). ?

    ReplyDelete
  5. The code is wrong. e.g. test with SIZE=1 and the data set has 2 points. The second sample is always kept. "if((rand()%count) < size)" is always true for count==size

    ReplyDelete
  6. Hi unknown, if SIZE=1 the "int samples[SIZE]" line would become "int samples[1] " accessing anything but the 0th index of such an array would be out of range

    ReplyDelete