Sampling from a huge data file

The size of data we are handling is growing rapidly. We now have to handle  millions of reads from the Illumina sequencers or millions of trees samples from MCMC runs.

Probably, you have experienced difficulty of sampling a small number of elements from a very large data set. For example, you may have wanted to sample one hundred sequence reads from a huge Illumina fastq file. You cannot load all reads on the memory. So, normal functions to sample elements from a list is not feasible in this situation.

A naive approach is iterating over a file and deciding if an element is sampled with a fixed probability. However, this can not return a fixed number of samples. Also, you need to know the total number of elements beforehand, which requires the iteration of file twice.

After some literature search and googling, I found an elegant algorithm called “reservoir sampling” solves this problem. (If you already know the algorithm, you don’t have to read this post.)

The reservoir sampling algorithm for sampling n elements from N elements is described like this.

1. Iterate over each element.
2. For i-th element, If i is smaller than or equal to n, put it in the reservoir.
3. If i is larger than n, draw a random number, r, between 0 and i.
4. If r is smaller than n, randomly replace an element in the reservoir with the i-th element.
5. Continue until i = N.

That’s it. If I write it in Python, the code has only 19 lines including spaces. (and can be reduced more.)

import sys
import random

def reservoir_sample(stream, n):
res = []

for i, el in enumerate(stream):
if i <= n:
res.append(el)
else:
rand = random.sample(xrange(i), 1)[0]
if rand < n:
res[random.sample(xrange(n), 1)[0]] = el
return res

if __name__=="__main__":
samp = reservoir_sample(sys.stdin, 10)
for s in samp:
sys.stdout.write(s)


This code randomly samples 10 lines from standard input.

python reservorir_sample.py < file.txt


If you replace sys.stdin with other types of stream, for example, a file stream or an iterator of sequences from SeqIO.parse in BioPython, it randomly samples 10 elements from them. Because file iterators of Python do not load whole file onto the memory, you do not need large memories to complete sampling as far as the size of reservoir is small. In addition, you do not need to know how many elements there are in the stream.

I found this algorithm truly useful. But, why does it work?

Let’s consider a list with N elements, from which we sample n elements.

Probability that i-th element is sampled must be a product of probability that i-th element is put into reservoir and probability that i-th element in the reservoir is not replaced by other elements sampled subsequently.

When i = N, the probability that the N-th element is sampled is simply,

$P_{i=N} = \frac{n}{N}$

, because i = N and there is no subsequent sampling.

When i = N-1, the probability is a product of the N-1-th element is chosen and it is not replaced by the N-th element in the reservoir.

$P_{i=N-1} = \frac{n}{N-1} \cdot (1-\frac{n}{N} \cdot \frac{1}{n}) = \frac{n}{N-1} \frac{N-1}{N} = \frac{n}{N}$

The probability that N-1-th element is put in reservoir is clearly n/(N-1), and that the element in reservoir is not replaced is 1-n/N*1/n because the N-th element is chosen with n/N and it replaces N-1-th element with probability 1/n.

It appears that a denominator is reduced by a numerator in the following term.
Similar arguments go down to N-2, N-3, and so forth to the i-th element.

$P_{N-i} = \frac{n}{i} \cdot (1-\frac{n}{i+1} \cdot \frac{1}{n}) \cdot (1-\frac{n}{i+2} \cdot \frac{1}{n}) \cdots (1-\frac{n}{N} \cdot \frac{1}{n}) \\=\frac{n}{i} \cdot \frac{i}{i+1} \cdot \frac{i+1}{i+2} \cdots \frac{N-1}{N} = \frac{n}{N}$

Great. Probability for each element to be sampled is always n/N.

Thanks to this clever algorithm, now I can save memory and time a lot.