A couple of years ago, I wrote a post about a method for sampling from a very large data file. The “reservoir sampling” can efficiently sample a fixed number of samples from a huge data file while you read through it only once. Now, the natural step forward is sampling *with replacement*.

I sometimes need to resample large data, like a huge alignment or SNP table, with replacement for statistical analysis, that is, I need to bootstrap them. Is there a way to efficiently bootstrap a huge file? This looks a lot more difficult than sampling a fixed number of samples without replacement because all items in a file stream must have chance to be sampled several times.

I did some google searches and found that this issue has been studied in various fields from traditional biostatistics to cutting-edge data science. As I expected, the methods for efficient bootstraping are more complicated than the reservoir sampling.

Let’s start with a simple approach. The following code does bootstrapping of lines in a file. You need to load all lines on the memory to resample lines, and bootstrapping will become memory-demanding when you process a data file with millions of lines.

import sys import numpy lines = [l for l in open(sys.argv[1], "r")] for i in numpy.random.randint(0, len(lines), size=len(lines)): sys.stdout.write(lines[i])

The memory usage will be slightly reduced if you use a multinomial distribution instead of simply sampling lines. This is justified because the number of occurrence of one entry in the bootstrap procedure follows multinomial distribution. This idea appears to have been proposed even in 1980s.

import sys import numpy size = sum((1 for line in open(sys.argv[1], "r"))) cnt = numpy.random.multinomial(size, [1.0/size]*size) for line, c in zip(open(sys.argv[1], "r"), cnt): for i in range(c): sys.stdout.write(line)

A clear problem of the code above is that it requires to read a file twice; it counts the number of lines first then bootstraps them following a multinomial distribution. Because it needs to read the file twice, it does not work on stream inputs.

Is there a way to bootstarp samples while you read a file only once? According to some literatures, an algorithm called “Poisson bootstrap” does this job. Instead of sampling with multinomial distribution of size=N (N is the total number of samples), you can approximate the number of occurrence of an item by Poisson distribution of lambda=1. This means that the bootstrap procedure is approximated by sampling each item *n* times while reading lines and drawing *n* from a Poisson distribution.

import sys import numpy for line in open(sys.argv[1], "r"): cnt = numpy.random.poisson(lam=1) if cnt == 0: continue else: for i in range(cnt): sys.stdout.write(line)

This simple, but powerful code (it’s even simpler than the reservoir sampling code) does the approximated bootstrap while reading a file only once.

I could not fully understand the mathematical justifications of this procedure, but if you are interested, there are several papers discussing about the properties of Poisson boot strap. For example, Hanely&MacBibbon(2006) or Chamandy et al. (2012). (I think the first paper is more approachable than the second one.)

A disadvantage of this code is you can not sample a fixed number of samples. Bootstrap replicates have different numbers of samples, which probably introduces extra variances to the bootstrap estimates. However, according to the papers above, the variation of sample size will be negligible when the sample size is huge.

I just generated bootstrap replicates of 100 and 10000 samples with three algorithms above, and checked how many times each sample is sampled. The three algorithms appears to be identical when the sample size is large (N=10000). (Dashed line is the expected number of count with Poisson distribution.)

while the variation is not quite negligible when sample size is 100.

This Poisson bootstrap procedure appears to be particularly useful when you need to resample very large data. It gets closer to the true bootstrap as you have a larger number of samples.