Archive:
Subtopics:
Comments disabled |
Tue, 13 Feb 2018 (If you already know about reservoir sampling, just skip to the good part.) The basic reservoir sampling algorithm asks us to select a random item from a list, easy peasy, except:
Maybe the items are being read from a pipe or some other lazy data structure. There might be zillions of them, so we can't simply load them into an array. Obviously something like this doesn't work:
because it doesn't select the items with equal probability. Far from it! The last item is selected as often as all the preceding items put together. The requirements may seem at first impossible to satisfy, but it can be done and it's not even difficult:
The The good partLast week I thought of a different simple variation. Suppose each item !!s_i!! is presented along with an arbitrary non-negative weight !!w_i!!, measuring the relative likelihood of its being selected for the output. For example, an item with weight 6 should be selected twice as often as an item with weight 3, and three times as often as an item with weight 2. The total weight is !!W = \sum w_i!! and at the end, whenever that is, we want to have selected each item !!s_i!! with probability !!\frac{w_i}{W}!!:
The correctness proof is almost the same. Clearly this reduces to the standard algorithm when all the weights are equal. This isn't a major change, but it seems useful and I hadn't seen it before. [Other articles in category /prog] permanent link |