The Universe of Discourse

Fri, 12 Oct 2007

The square of the Catalan sequence
Yesterday I went to a talk by Val Tannen about his work on "provenance semirings".

The idea is that when you calculate derived data in a database, such as a view or a selection, you can simultaneously calculate exactly which input tuples contributed to each output tuple's presence in the output. Each input tuple is annotated with an identifier that says who was responsible for putting it there, and the output annotations are polynomials in these identifiers. (The complete paper is here.)

A simple example may make this a bit clearer. Suppose we have the following table R:
 R a a a b a c b c c e d e
We'll write R(p, q) when the tuple (p, q) appears in this table. Now consider the join of R with itself. That is, consider the relation S where S(x, z) is true whenever both R(x, y) and R(y, z) are true:

 S a a a b a c a e b e
Now suppose you discover that the R(a, b) information is untrustworthy. What tuples of S are untrustworthy?

If you annotate the tuples of R with identifiers like this:

 R a a u a b v a c w b c x c e y d e z
then the algorithm in the paper calculates polynomials for the tuples of S like this:
 S a a u2 a b uv a c uw + xv a e wy b e xy
If you decide that R(a, b) is no good, you assign the value 0 to v, which reduces the S table to:

 S a a u2 a b 0 a c uw a e wy b e xy
So we see that tuple S(a, b) is no good any more, but S(a, c) is still okay, because it can be derived from u and w, which we still trust.

This assignment of polynomials generalizes a lot of earlier work on tuple annotation. For example, suppose each tuple in R is annotated with a probability of being correct. You can propagate the probabilities to S just by substituting the appropriate numbers for the variables in the polynomials. Or suppose each tuple in R might appear multiple times and is annotated with the number of times it appears. Then ditto.

If your queries are recursive, then the polynomials might be infinite. For example, suppose you are calculating the transitive closure T of relation R. This is like the previous example, except that instead of having S(x, z) = R(x, y) and R(y, z), we have T(x, z) = R(x, z) or (T(x, y) and R(y, z)). This is a recursive equation, so we need to do a fixpoint solution for it, using certain well-known techniques. The result in this example is:

 T a a u+ a b u*v a c u*(vx+w) a e u*(vx+w)y b c x b e xy d e z
In such a case there might be an infinite number of paths through R to derive the provenance of a certain tuple of T. In this example, R contains a loop, namely R(a, a), so there are an infinite number of derivations of some of the tuples in T, because you can go around the loop as many times as you like. u+ here is an abbreviation for the infinite polynomial u + u2 + u3 + ...; u* here is an abbreviation for 1 + u+.

 1 a 2 (a + b) 3 ((a + b) + c) (a + (b + c)) 4 (((a + b) + c) + d) ((a + (b + c)) + d) ((a + b) + (c + d)) (a + ((b + c) + d)) (a + (b + (c + d))) 5 ((((a + b) + c) + d) + e) (((a + (b + c)) + d) + e) (((a + b) + (c + d)) + e) (((a + b) + c) + (d + e)) ((a + ((b + c) + d)) + e) ((a + (b + (c + d))) + e) ((a + (b + c)) + (d + e)) ((a + b) + ((c + d) + e)) ((a + b) + (c + (d + e))) (a + (((b + c) + d) + e)) (a + ((b + (c + d)) + e)) (a + ((b + c) + (d + e))) (a + (b + ((c + d) + e))) (a + (b + (c + (d + e))))
In one example in the paper, the method produces a recursive relation of the form V = s + V2, which can be solved by the same well-known techniques to come up with an (infinite) polynomial for V, namely V = 1 + s + 2s2 + 5s3 + 14s4 + ... . Mathematicians will recognize the sequence 1, 1, 2, 5, 14, ... as the Catalan numbers, which come up almost as often as the better-known Fibonacci numbers. For example, the Catalan numbers count the number of binary trees with n nodes; they also count the number of ways of parenthesizing an expression with n terms, as shown in the table at right.

Anyway, in his talk, Val referred to the sequence as "bizarre", and I had to jump in to point out that it was not at all bizarre, it was the Catalan numbers, which are just what you would expect from a relation like V = s + V2, blah blah, and he cut me off, because of course he knows all about the Catalan numbers. He only called them bizarre as a rhetorical flourish, meant to echo the presumed puzzlement of the undergraduates in the room.

(I never know how much of what kind of math to expect from computer science professors. Sometimes they know things I don't expect at all, and sometimes they don't know things that I expect everyone to know.

(This was indeed what was going on, and the professor seemed to think it was a surprising insight. I am not relating this boastfully, because I truly don't think it was a particularly inspired guess.

(Now that I think about it, maybe the answer here is that computer science professors know more about math than I expect, and less about computation.)

Anyway, I digress, and the whole article up to now was not really what I wanted to discuss anyway. What I wanted to discuss was that when I started blathering about Catalan numbers, Val said that if I knew so much about Catalan numbers, I should calculate the coefficient of the x59 term in V2, which also appeared as one of the annotations in his example.

So that's the puzzle, what is the coefficient of the x59 term in V2, where V = 1 + s + 2s2 + 5s3 + 14s4 + ... ?

After I had thought about this for a couple of minutes, I realized that it was going to be much simpler than it first appeared, for two reasons.

The first thing that occurred to me was that the definition of multiplication of polynomials is that the coefficient of the xn term in the product of A and B is Σaibn-i. When A=B, this reduces to Σaian-i. Now, it just so happens that the Catalan numbers obey the relation cn+1 = Σ cicn-i, which is exactly the same form. Since the coefficients of V are the ci, the coefficients of V2 are going to have the form Σcicn-i, which is just the Catalan numbers again, but shifted up by one place.

The next thing I thought was that the Catalan numbers have a pretty simple generating function f(x). This just means that you pretend that the sequence V is a Taylor series, and figure out what function it is the Taylor series of, and use that as a shorthand for the whole series, ignoring all questions of convergence and other such analytic fusspottery. If V is the Taylor series for f(x), then V2 is the Taylor series for f(x)2. And if f has a compact representation, say as sin(x) or something, it might be much easier to square than the original V was. Since I knew in this case that the generating function is simple, this seemed likely to win. In fact the generating function of V is not sin(x) but (1-√(1-4x))/2x. When you square this, you get almost the same thing back, which matches my prediction from the previous paragraph. This would have given me the right answer, but before I actually finished that calculation, I had an "oho" moment.

The generating function is known to satisfy the relation f(x) = 1 + xf(x)2. This relation is where the (1-√(1-4x))/2x thing comes from in the first place; it is the function that satisfies that relation. (You can see this relation prefigured in the equation that Val had, with V = s + V2. There the notation is a bit different, though.) We can just rearrange the terms here, putting the f(x)2 by itself, and get f(x)2 = (f(x)-1)/x.

Now we are pretty much done, because f(x) = V = 1 + x + 2x2 + 5x3 + 14x4 + ... , so f(x)-1 = x + 2x2 + 5x3 + 14x4 + ..., and (f(x)-1)/x = 1 + 2x + 5x2 + 14x3 + ... . Lo and behold, the terms are the Catalan numbers again.

So the answer is that the coefficient of the x59 term is just c(60), calculation of which is left as an exercise for the reader.

I don't know what the point of all that was, but I thought it was fun how the hairy-looking problem seemed likely to be simple when I looked at it a little more carefully, and then how it did turn out to be quite simple.

This blog has had a recurring dialogue between subtle technique and the sawed-off shotgun method, and I often favor the sawed-off shotgun method. Often programmers' big problem is that they are very clever and learned, and so they want to be clever and learned all the time, even when being a knucklehead would work better. But I think this example provides some balance, because it shows a big win for the clever, learned method, which does produce a lot more understanding.

 Order Higher-Order Perl with kickback no kickback
Then again, it really doesn't take long to whip up a program to multiply infinite polynomials. I did it in chapter 6 of Higher-Order Perl, and here it is again in Haskell:

        data Poly a = P [a] deriving Show

instance (Eq a) => Eq (Poly a)
where (P x) == (P y) = (x == y)

polySum x [] = x
polySum [] y = y
polySum (x:xs) (y:ys) = (x+y) : (polySum xs ys)

polyTimes  [] _ = []
polyTimes  _ [] = []
polyTimes  (x:xs) (y:ys) = (x*y) : more
where
more = (polySum (polySum (map (x *) ys) (map (* y) xs))
(0 : (polyTimes xs ys)))

instance (Num a) => Num (Poly a)
where (P x) + (P y) = P (polySum x y)
(P x) * (P y) = P (polyTimes x y)