The Universe of Disco


Tue, 13 Feb 2018

Weighted Reservoir Sampling

(If you already know about reservoir sampling, just skip to the good part.)

The basic reservoir sampling algorithm asks us to select a random item from a list, easy peasy, except:

  1. Each item must be selected with equal probability
  2. We don't know ahead of time how big the list is
  3. We may only make one pass over the list
  4. We may use only constant memory

Maybe the items are being read from a pipe or some other lazy data structure. There might be zillions of them, so we can't simply load them into an array. Obviously something like this doesn't work:

# Python
from random import random
selected = inputs.next()
for item in inputs:
    if random() < 0.5:
        selected = item

because it doesn't select the items with equal probability. Far from it! The last item is selected as often as all the preceding items put together.

The requirements may seem at first impossible to satisfy, but it can be done and it's not even difficult:

from random import random
n = 0
selected = None

for item in inputs:
    n += 1
    if random() < 1/n:
        selected = item

The inputs here is some sort of generator that presents the list of items, one at a time. After the loop completes, the selected item is in selected. A proof that this selects each item equiprobably is left as an easy exercise, or see this math StackExchange post. A variation for selecting !!k!! items instead of only one is quite easy.

The good part

Last week I thought of a different simple variation. Suppose each item !!s_i!! is presented along with an arbitrary non-negative weight !!w_i!!, measuring the relative likelihood of its being selected for the output. For example, an item with weight 6 should be selected twice as often as an item with weight 3, and three times as often as an item with weight 2.

The total weight is !!W = \sum w_i!! and at the end, whenever that is, we want to have selected each item !!s_i!! with probability !!\frac{w_i}{W}!!:

total_weight = 0
selected = None

for item, weight in inputs:
    if weight == 0: continue
    total += weight
    if random() < weight/total:
        selected = item

The correctness proof is almost the same. Clearly this reduces to the standard algorithm when all the weights are equal.

This isn't a major change, but it seems useful and I hadn't seen it before.


[Other articles in category /prog] permanent link