The Universe of Discourse

Mark Dominus (陶敏修)
mjd@pobox.com

12 recent entries

A puzzle about balancing test tubes in a centrifuge
Proof by insufficient information
Willie Singletary will you please go now?
How our toy octopuses got revenge on a Philadelphia traffic court judge
Does someone really have to do the dirty jobs?
The mathematical past is a foreign country
Baseball on the Moon
Hangeul sign-engraving machine
Claude and Merle Miller let me down
Reflector grids
Jonathan Chait
Claude chokes on graph theory

Archive:

2025: JF M A M
2024: JF M A M J
J ASOND
2023: JF M A M J
J A S O N D
2022: J F M A M J
JAS O N D
2021: J F M AMJ
J A S O N D
2020: J F M A M J
J A S O N D
2019: JFM A M J
J A S O N D
2018: J F M A M J
J A S O N D
2017: J F M A M J
J A S O N D
2016: JF M A M J
JASON D
2015: JFM A M J
J A S O N D
2014: J F M AMJ
JASON D
2013: JFMAMJ
JAS OND
2012: J F MAMJ
JASOND
2011: JFMAM J
JASOND
2010: JFMAMJ
JA S O ND
2009: J F MAM J
JASOND
2008: J F M A M J
JAS O ND
2007: J F M A M J
J A S O N D
2006: J F M A M J
JAS O N D
2005: O N D

Subtopics:

Mathematics 245

Programming 99

Language 95

Miscellaneous 75

Book 50

Tech 49

Etymology 35

Haskell 33

Oops 30

Unix 27

Cosmic Call 25

Math SE 25

Law 22

Physics 21

Perl 17

Biology 15

Brain 15

Calendar 15

Food 15

Comments disabled

Tue, 13 Feb 2018

Weighted Reservoir Sampling

(If you already know about reservoir sampling, just skip to the good part.)

The basic reservoir sampling algorithm asks us to select a random item from a list, easy peasy, except:

Each item must be selected with equal probability
We don't know ahead of time how big the list is
We may only make one pass over the list
We may use only constant memory

Maybe the items are being read from a pipe or some other lazy data structure. There might be zillions of them, so we can't simply load them into an array. Obviously something like this doesn't work:

# Python
from random import random
selected = inputs.next()
for item in inputs:
    if random() < 0.5:
        selected = item

because it doesn't select the items with equal probability. Far from it! The last item is selected as often as all the preceding items put together.

The requirements may seem at first impossible to satisfy, but it can be done and it's not even difficult:

from random import random
n = 0
selected = None

for item in inputs:
    n += 1
    if random() < 1/n:
        selected = item

The inputs here is some sort of generator that presents the list of items, one at a time. After the loop completes, the selected item is in selected. A proof that this selects each item equiprobably is left as an easy exercise, or see this math StackExchange post. A variation for selecting !!k!! items instead of only one is quite easy.

The good part

Last week I thought of a different simple variation. Suppose each item !!s_i!! is presented along with an arbitrary non-negative weight !!w_i!!, measuring the relative likelihood of its being selected for the output. For example, an item with weight 6 should be selected twice as often as an item with weight 3, and three times as often as an item with weight 2.

The total weight is !!W = \sum w_i!! and at the end, whenever that is, we want to have selected each item !!s_i!! with probability !!\frac{w_i}{W}!!:

total_weight = 0
selected = None

for item, weight in inputs:
    if weight == 0: continue
    total += weight
    if random() < weight/total:
        selected = item

The correctness proof is almost the same. Clearly this reduces to the standard algorithm when all the weights are equal.

This isn't a major change, but it seems useful and I hadn't seen it before.

[Other articles in category /prog] permanent link