The Universe of Discourse

Fri, 09 Mar 2007

Bernoulli processes
A family has four children. Assume that the sexes of the four children are independent, and that boys and girls are equiprobable. What's the most likely distribution of boys and girls?

Well,it depends how you count. Are there three possibilities or five?

All four the same
Three the same, one different
Two-and-two

Four boys, no girls
Three boys, one girl
Two boys, two girls
One boy, three girls
No boys, four girls

If we group outcomes into five categories, as in the pink division on the right, the most likely distribution is two-and-two, as you would probably guess:

Boys Girls Probability
0 4 0.0625
1 3 0.25
2 2 0.375
3 1 0.25
4 0 0.0625

This distribution is depicted in the graph at right. Individually, (3, 1) and (1, 3) are less likely than (2, 2). But "three-and-one" includes both (1, 3) and (3, 1), whereas "two-and-two" includes only (2, 2). So if you group outcomes into three categories, as in the green division above left, "three-and-one" comes out more frequent overall than "two-and-two":

One sex	The other	Total probability
4	0	0.125
3	1	0.5
2	2	0.375

It makes a difference whether you specify the sexes in the distribution. If a "distribution" is a thing like "b of the children are boys and g are girls", then the most frequent distribution is (2, 2). But if a distribution is "x of one sex and y of the other", then the most frequent distribution [3, 1], where I've used square brackets to show that the order is not important. [3, 1] is the same as [1, 3].

This is true in general. Suppose someone has 1,000 kids. What's the most likely distribution of sexes? It's 500 boys and 500 girls, which I've been writing (500, 500). This is more likely than either (499, 501) or (501, 499). But if you consider "Equal numbers" versus "501-to-499", which I've been writing as [500, 500] and [501, 499], then [501, 499] wins:

Boys Girls Probability
501 499 0.02517
500 500 0.02522
499 501 0.02517

One sex	The other	Total probability
501	499	0.05035
500	500	0.02522

For odd numbers of kids, this anomaly doesn't occur, because there's no symmetric value like [500, 500] to get shorted.

Distribution Number
of hands Frequency
[4, 4, 3, 2] 10810800 0.16109347
[5, 4, 3, 1] 8648640 0.12887478
[5, 3, 3, 2] 8648640 0.12887478
[5, 4, 2, 2] 6486480 0.09665608
[4, 3, 3, 3] 4804800 0.07159710
[6, 4, 2, 1] 4324320 0.06443739
[6, 3, 2, 2] 4324320 0.06443739
[6, 3, 3, 1] 2882880 0.04295826
[5, 5, 2, 1] 2594592 0.03866243
[7, 3, 2, 1] 2471040 0.03682137
[4, 4, 4, 1] 1801800 0.02684891
[6, 4, 3, 0] 1441440 0.02147913
[5, 4, 4, 0] 1081080 0.01610935
[6, 5, 2, 0] 864864 0.01288748
[6, 5, 1, 1] 864864 0.01288748
[5, 5, 3, 0] 864864 0.01288748
[7, 4, 2, 0] 617760 0.00920534
[7, 4, 1, 1] 617760 0.00920534
[7, 2, 2, 2] 617760 0.00920534
[8, 2, 2, 1] 463320 0.00690401
[7, 3, 3, 0] 411840 0.00613689
[8, 3, 2, 0] 308880 0.00460267
[8, 3, 1, 1] 308880 0.00460267
[7, 5, 1, 0] 247104 0.00368214
[8, 4, 1, 0] 154440 0.00230134
[6, 6, 1, 0] 144144 0.00214791
[9, 2, 1, 1] 102960 0.00153422
[9, 3, 1, 0] 68640 0.00102282
[9, 2, 2, 0] 51480 0.00076711
[10, 2, 1, 0] 20592 0.00030684
[7, 6, 0, 0] 20592 0.00030684
[8, 5, 0, 0] 15444 0.00023013
[9, 4, 0, 0] 8580 0.00012785
[10, 1, 1, 1] 6864 0.00010228
[10, 3, 0, 0] 3432 0.00005114
[11, 1, 1, 0] 1872 0.00002789
[11, 2, 0, 0] 936 0.00001395
[12, 1, 0, 0] 156 0.00000232
[13, 0, 0, 0] 4 0.00000006

Similar behavior appears in related problems. What's the most likely distribution of suits in a bridge hand? People often guess (4, 3, 3, 3), and this is indeed the most likely distribution of particular suits. That is, if you consider distributions of the form "a hearts, b spades, c diamonds, and d clubs", then (4, 3, 3, 3) gives the most likely distribution. (The distributions (3, 4, 3, 3), (3, 3, 4, 3), and (3, 3, 3, 4) are of course equally frequent.) But if distributions have the form "a cards of one suit, b of another, c of another, and d of the fourth"—which is what is usually meant by a suit distribution in a bridge hand—then [4, 4, 3, 2] is the most likely distribution, and [4, 3, 3, 3] is in fifth place.

Why is this? [4, 3, 3, 3] covers the four most frequent distributions: (4, 3, 3, 3), (3, 4, 3, 3), (3, 3, 4, 3), and (3, 3, 3, 4). But [4, 4, 3, 2] covers twelve quite frequent distributions: (4, 4, 3, 2), (4, 3, 2, 4), and so on. Even though the individual distributions aren't as common as (4, 4, 4, 3), there are twelve of them instead of 4. This gives [4, 4, 3, 2] the edge.

[5, 4, 3, 1] includes 24 distributions, and ends up tied for second place. A complete table is in the sidebar at left.

(For 5-card poker hands, the situation is much simpler. [2, 2, 1, 0] is most common, followed by [2, 1, 1, 1] and [3, 1, 1, 0] (tied), then [3, 2, 0, 0], [4, 1, 0, 0], and [5, 0, 0, 0].)

This same issue arose in my recent article on Yahtzee roll probabilities. There we had six "suits", which represented the six possible rolls of a die, and I asked how frequent each distribution of "suits" was when five dice were rolled. For distribution [p₁, p₂, ...], we let n_i be the number of p's that are equal to i. Then the expression for probability of the distribution has a factor of $\prod {n_i}!$ in the denominator, with the result that distributions with a lot of equal-sized parts tend to appear less frequently than you might otherwise expect.

I'm not sure how I got so deep into this end of the subject, since I didn't really want to compare complex distributions to each other so much as to compare simple distributions under different conditions. I had originally planned to discuss the World Series, which is a best-four-of-seven series of baseball games that we play here in the U.S. and sometimes in that other country to the north. Sometimes one team wins four games in a row ("sweeps"); other times the Series runs the full seven games.

You might expect that even splits would tend to occur when the two teams playing were evenly matched, but that when one team was much better than the other, the outcome would be more likely to be a sweep. Indeed, this is generally so. The chart below graphs the possible outcomes. The x-axis represents the probability of the Philadelphia Phillies winning any individual game. The y-axis is the probability that the Phillies win the entire series (red line), which in turn is the sum of four possible events: the Phillies win in 4 games (green), in 5 games (dark blue), in 6 games (light blue), or in 7 games (magenta). The probabilities of the Nameless Opponents winning are not shown, because they are exactly the opposite. (That is, you just flip the whole chart horizontally.)

(The Opponents are a semi-professional team that hails from Nameless, Tennessee.)

Clearly, the Phillies have a greater-than-even chance of winning the Series if and only if they have a greater-than-even chance of winning each game. If they are playing a better team, they are likely to lose, but if they do win they are most likely to do so in 6 or 7 games. A sweep is the most likely outcome only if the Opponents are seriously overmatched, and have a less than 25% chance of winning each game. (The lines for the 4-a outcome and the 4-b outcome cross at 1-(p_a / p_b)^1/(b-a), where p_i is 1, 4, 10, 20 for i = 0, 1, 2, 3.)

If we consider just the first four games of the World Series, there are five possible outcomes, ranging from a Phillies sweep, through a two-and-two split, to an Opponents sweep. Let p be the probability of the Phillies winning any single game. As p increases, so does the likelihood of a Phillies sweep. The chart below plots the likelihood of each of the five possible outcomes, for various values of p, charted here on the horizontal axis:

The leftmost red curve is the probability of an Opponents sweep; the red curve on the right is the probability of a Phillies sweep. The green curves are the probabilities of 3-1 outcomes favoring the Opponents and the Phillies, respectively, with the Phillies on the right as before. The middle curve, in dark blue, is the probability of a 2-2 split.

When is the 2-2 split the most likely outcome? Only when the Phillies and the Opponents are approximately evenly matched, with neither team no more than 60% likely to win any game.

But just as with the sexes of the four kids, we get a different result if we consider the outcomes that don't distinguish the teams. For the first four games of the World Series, there are only three outcomes: a sweep (which we've been writing [4, 0]), a [3, 1] split, and a [2, 2] split:

Here the green lines in the earlier chart have merged into a single outcome; similarly the red lines have merged. As you can see from the new chart, there is no pair of teams for which a [2, 2] split predominates; the even split is buried. When one team is grossly overmatched, winning less than about 19% of its games, a sweep is the most likely outcome; otherwise, a [3, 1] split is most likely.

Here are the corresponding charts for series of various lengths.

Series length
(games) Distinguish teams Don't
distinguish teams
2
3
4
5
6
7
8
9
10

I have no particular conclusion to announce about this; I just thought that the charts looked cool.

Coming later, maybe: reasoning backwards: if the Phillies sweep the World Series, what can we conclude about the likelihood that they are a much better team than the Opponents? (My suspicion is that you can conclude a lot more by looking at the runs scored and runs allowed totals.)

(Incidentally, baseball players get a share of the ticket money for World Series games, but only for the first four games. Otherwise, they could have an an incentive to prolong the series by playing less well than they could, which is counter to the ideals of sport. I find this sort of rule, which is designed to prevent conflicts of interest, deeply satisfying.)

[Other articles in category /math] permanent link