The Universe of Discourse

Wed, 27 Feb 2008

Uniquely-decodable codes
Ricardo J.B. Signes asked me a few days ago if there was a way to decide whether a given set S of strings had the property that any two distinct sequences of strings from S have distinct concatenations.

For example, consider S₁ = { "ab", "abba", "b" }. This set does not have the specified property, because you can take the two sequences [ "ab", "b", "ab" ] and [ "abba", "b" ], and both concatenate to "abbab". But S₂ = { "a", "ab", "abb" } does have this property.

Coding theory

In coding theory, the property has the awful name "unique decodability". The idea is that you have some input symbols, and each input symbol is represented with one output symbol, which is one of the strings from S. Then suppose you receive some message like "abbab". Can you figure out what the original input was? For S₂, yes: it must have been ZY. But for S₁, no: it could have been either YZ or XZX.

In coding theory, the strings are called "code words" and the set of strings is a "code". So the question is how to tell whether a code is uniquely-decodable. One obvious way to take a non-uniquely-decodable code and turn it into a uniquely-decodable code is to append delimiters to the code words. Consider S₁ again. If we delimit the code words, it becomes { "(ab)", "(abba)", "(b)" }, and the two problem sequences are now distinguishable, since "(ab)(b)(ab)" looks nothing like "(abba)(b)". It should be clear that one doesn't need to delimit both ends; the important part is that the words are separated, so one could use { "ab-", "abba-", "b-" } instead, and the problem sequences translate to "ab-b-ab-" and "abba-b-". So every non-uniquely-decodable code corresponds to a uniquely-decodable code in at least this trivial way, and often the uniquely-decodable property is not that important in practice because you can guarantee uniquely-decodableness so easily just by sticking delimiters on the code words.

But if you don't want to transmit the extra delimiters, you can save bandwidth by making your code uniquely-decodable even without delimiters. The delimiters are a special case of a more general principle, which is that a prefix code is always uniquely-decodable. A prefix code is one where no code word is a prefix of another. Or, formally, there are no code words x and y such that x = ys for some nonempty s. Adding the delimiters to a code turns it into a prefix code. But not all prefix codes have delimiters. { "a", "ba", "bba", "bbba" } is an example, as are { "aa", "ab", "ba", "bb" } and { "a", "baa", "bab", "bb" }.

The proof of this is pretty simple: you have some concatenation of code words, say T. You can decode it as follows: Find the unique code word c such that c is a prefix of T; that is, such that T = cU. There must be such a c, because T is a concatenation of code words. And c must be unique, because if there were c' and U' with both cU = T and c'U' = T, then cU = c'U', and whichever of c or c' is shorter must be a prefix of the one that is longer, and that can't happen because this is a prefix code. So c is the first code word in T, and we can pull it off and repeat the process for U, getting a unique sequence of code words, unless U is empty, in which case we are done.

There is a straightforward correspondence between prefix codes and trees; the code words can be arranged at the leaves of a tree, and then to decode some concatenation T you can scan its symbols one at a time, walking the tree, until you get to a leaf, which tells you which code word you just saw. This is the basis of Huffman coding.

Prefix codes include, as a special case, codes where all the words are the same length. For those codes, the tree is balanced, and has all branches the same length.

But uniquely-decodable codes need not be prefix codes. Most obviously, a suffix code is uniquely-decodable and may not be a prefix code. For example, {"a", "aab", "bab", "bb" } is uniquely-decodable but is not a prefix code, because "a" is a prefix of "aab". The proof of uniquely-decodableness is obvious: this is just the last prefix code example from before, with all the code words reversed. If there were two sequences of words with the same concatenation, then the reversed sequences of reversed words would also have the same concatenation, and this would show that the code of the previous paragraph was not uniquely-decodable. But that was a prefix code, and so must be uniquely-decodable.

But codes can be uniquely-decodable without being either prefix or suffix codes. For example, { "aabb", "abb", "bb", "bbba" } is uniquely-decodable but is neither a prefix nor a suffix code. Rik wanted a method for deciding.

I told Rik about the prefix code stuff, which at least provides a sufficient condition for uniquely-decodableness, and then started poking around to see what else I could learn. Ahem, I mean, researching. I suppose that a book on elementary coding theory would have a discussion of the problem, but I didn't have one at hand, and all I could find online concerned prefix codes, which are of more practical interest because of the handy tree method for speedy decoding.

But after tinkering with it for a couple of days (and also making an utterly wrong intermediate guess that it was undecidable, based on a surface resemblance to the Post correspondence problem) I did eventually figure out an algorithm, which I wrote up and released on CPAN, my first CPAN post in about a year and a half.

An example

The idea is pretty simple, and I think best illustrated by an example, as so many things are. We will consider { "ab", "abba", "b" } again. We want to find two sequences of code words whose concatenations are the same. So say we want pX₁ = qY₁, where p and q are code words and X₁ and Y₁ are some longer strings. This can only happen if p and q are different lengths and if one is a prefix of the other, since otherwise the two strings pX₁ and qY₁ don't begin with the same symbols. So we consider just the cases where p is a prefix of q, which means that in this example we want to find "ab"X₁ = "abba"Y₁, or, equivalently, X₁ = "ba"Y₁.

Now X₁ must begin with "ba", so we need to either find a code word that begins with "ba", or we need to find a code word that is a prefix of "ba". The only choice is "b", so we have X₁ = "b"X₂, and so X₁ = "b"X₂ = "ba"Y₁, or equivalently, X₂ = "a"Y₁.

Now X₂ must begin with "a", so we need to either find a code word that begins with "a", or we need to find a code word that is a prefix of "a". This occurs for "abba" and "ab". So we now have two situations to investigate: "ab"X₃ = "a"Y₁, and "abba"X₄ = "a"Y₁. Or, equivalently, "b"X₃ = Y₁, and "bba"X₄ = Y₁.

The first of these, "b"X₃ = Y₁ wins immediately, because "b" is a code word: we can take X₃ to be empty, and Y₁ to be "b", and we have what we want:

"ab" X₁ = "abba" Y₁
"ab" "b" X₂ = "abba" Y₁
"ab" "b" "ab" X₃ = "abba" Y₁
"ab" "b" "ab" = "abba" "b"

where the last line of the table is exactly the solution we seek.

Following the other one, "bba"X₄ = Y₁, fails, and in a rather interesting way. Y₁ must begin with two "b" words, so put "bb"Y₂ = Y₁, so "bba"X₄ = "bb"Y₂, then "a"X₄ = Y₂.

But this last equation is essentially the same as the X₂ = "a"Y₁ situation we were investigating earlier; we are just trying to make two strings that are the same except that one has an extra "a" on the front. So this investigation tells us that if we could find two strings with "a"X = Y, we could make longer strings "abba"Y = "b" "b" "a"X. This may be interesting, but it does not help us find what we really want.

The algorithm

Having seen an example, here's the description of the algorithm. We will tabulate solutions to Xs = Y, where X and Y are sequences of code words, for various strings s. If s is empty, we win.

We start the tabulation by looking for pairs of keywords c₁ and c₂ with c₁ a prefix of c₂, because then we have c₁s = c₂ for some s. We maintain a queue of s-values to investigate. At one point in our example, we had X₁ = "ba"Y₁; here s is "ba".

If s begins with a code word, then s = cs', so we can put s' on the queue. This is what happened when we went from X₁ = "ba"Y₁ to "b"X₂ = "ba"Y₁ to X₂ = "a"Y₁. Here s was "ba" and s' was "a".

If s is a prefix of some code word, say ss' = c, then we can also put s' on the queue. This is what happened when we went from X₂ = "a"Y₁ to "abba"X₄ = "a"Y₁ to "bba"X₄ = Y₁. Here s was "a" and s' was "bba".

If we encounter some queue item that we have seen before, we can discard it; this will prevent us from going in circles. If the next queue item is the empty string, we have proved that the code is not uniquely-decodable. (Alternatively, we can stop just before queueing the empty string.) If the queue is empty, we have investigated all possibilities and the code is uniquely-decodable.

Pseudocode

Here's the summary:

Initialization: For each pair of code words c₁ and c₂ with c₁s = c₂, put s in the queue.
Main loop: Repeat the following until termination
- If the queue is empty, terminate. The code is uniquely-decodable.
- Otherwise:
  1. Take an item s from the queue.
  2. For each code word c:
    - If c = s, terminate. The code is not uniquely-decodable.
    - If cs' = s, and s' has not been seen before, queue s'.
    - If c = ss', and s' has not been seen before, queue s'.

To this we can add a little bookkeeping so that the algorithm emits the two ambiguous sequences when the code is not uniquely-decodable. The implementation I wrote uses a hash to track which strings s have appeared in the queue already. Associated with each string s in the hash are two sequences of code words, P and Q, such that Ps = Q. When s begins with a code word, so that s = cs', the program adds s' to the hash with the two sequences [P, c] and Q. When s is a prefix of a code word, so that ss' = c, the program adds s' to the hash with the two sequences Q and [P, c]; the order of the sequences is reversed in order to maintain the Ps = Q property, which has become Qs' = Pss' = Pc in this case.

Notes

As I said, I suspect this is covered in every elementary coding theory text, but I couldn't find it online, so perhaps this writeup will help someone in the future.

After solving this problem I meditated a little on my role in the programming community. The kind of job I did for Rik here is a familiar one to me. When I was in college, I was the math guy who hung out in the computer lab with the hackers. Periodically one of them would come to me with some math problem: "Crash, I am writing a ray tracer. If I have a ray and a triangle in three dimensions, how can I figure out if the ray intersects the triangle?" And then I would go off and figure out how to do that and come back with the algorithm, perhaps write some code, or perhaps provide some instruction in matrix computations or whatever was needed. In physics class, I partnered with Jim Kasprzak, a physics major, and we did all the homework together. We would read the problem, which would be some physics thing I had no idea how to solve. But Jim understood physics, and could turn the problem from physics into some mathematics thing that he had no idea how to solve. Then I would do the mathematics, and Jim would turn my solution back into physics. I wish I could make a living doing this.

Puzzle: Is { "ab", "baab", "babb", "bbb", "bbba" } uniquely-decodable? If not, find a pair of sequences that concatenate to the same string.

Research question: What's the worst-case running time of the algorithm? The queue items are all strings that are strictly shorter than the longest code word, so if this has length n, then the main loop of the algorithm runs at most (aⁿ-1) / (a-1) times, where a is the number of symbols in the alphabet. But can this worst case really occur, or is the real worst case much faster? In practice the algorithm always seems to complete very quickly.

Project to do: Reimplement in Haskell. Compare with Perl implementation. Meditate on how they can suck in such completely different ways.

[ There is a brief followup to this article. ]

[ Addendum 20170318: Marius Buzea has brought to my attention that this is exactly the Sardinas-Patterson Algorithm. The Wikipedia article claims that the worst-case running time is !!O(nk)!! where !!k!! is the number of code words and !!n!! is their total length. ]

[Other articles in category /CS] permanent link