The Universe of Discourse


Wed, 08 Feb 2006

Generating strings of balanced parentheses
About a year and a half ago, the Perl Quiz of the Week question was to write a program which, given an argument n, printed out all the strings containing n balanced pairs of parentheses. For example, if the argument was 3, the output was to contain the lines:

        ()()()
        (())()
        (()())
        ()(())
        ((()))
in some order.

Enumerating parentheses is important because the parentheses obviously correspond to all the forms that an infix expression can take, so any attempt to enumerate all possible expressions of some form can be built atop a parenthesis-enumerator. But also, there's a straightforward correspondence between parentheses and tree structures, so by enumerating parentheses you are also enumerating all possible tree structures.

There were quite a few correct solutions posted to the list, and also some wrong ones. The commonest mistake to make was to assume that any string of n+1 pairs of balanced parentheses must have the form S(), ()S, or (S), where S is a string of n pairs of balanced parentheses. But this isn't so; (())(()) doesn't have this form. The commonest strategy that worked correctly was to generate all the strings of the right length of the form (S)S, where S is a shorter balanced string. This can be done recursively as follows:

        sub parens {
          my ($N) = @_;
          return ("") if $N == 0;
          my @result;
          for my $i (0 .. $N-1) {
            for $a (parens($i)) {
              for $b (parens($N-$i-1)) {
                push @result, "($a)$b";
              }
            }
          }
          return @result;
        }

        print join "\n", parens(shift()), "";
Or you can organize the logic differently:
        sub parens {
          my $N = shift;
          $pattern = @_ ? shift : "S";
          if ($N == 0) {
            $pattern =~ tr/S//d;
            return $pattern;
          }
          return unless $pattern =~ /S/;
          my $new_pattern_a = my $new_pattern_b = $pattern;
          $new_pattern_a =~ s/S/S(S)/;
          $new_pattern_b =~ s/S//;
          return parens($N-1, $new_pattern_a), parens($N, $new_pattern_b);
        }

        print join "\n", parens(shift()), "";
A somewhat different approach was to build up the string from left to right. Here $s tracks the string built so far, $N counts the number of )'s that are still required, and $unclosed counts the number of ('s that have been added without a corresponding ). The function may append a ( to the string, increasing the number of unclosed open parentheses, if there are fewer than $N such unclosed pairs total, and it may append a ), decreasing both the number of required close parentheses and the number of unmatched open parentheses, as long as at least one open parenthesis remains unclosed.

        sub parens {
            my ($N, $unclosed, $s) = @_;
            $unclosed ||= 0;
            $s ||= "";
            return $s if $N == 0;
            my @result;
            push @result, parens($N,   $unclosed+1, "$s(")  if $unclosed < $N;
            push @result, parens($N-1, $unclosed-1, "$s)")  if $unclosed > 0 ;
            return @result;
        }

        print join "\n", parens(shift()), "";
I had originally planned to do something like the first of these. But a couple of days of intermittent tinkering revealed a completely different and unexpected algorithm:

        #!/usr/bin/perl -l

        print $_ = "()" x shift();
        print while s{^  ( \(+ )  ( \)+ ) \(  }
                     {"()" x (length($2) - 1)
                      . "("  x (length($1) - length($2) + 2) 
                      . ")" 
                     }xe;
This is understandably mystifying, since in tidying it up (and getting it to run as fast as possible) I also erased all the clues as to how I got here to begin with. Folks on the mailing list asked how I came up with this. What follows is a somewhat edited version of my reply. I said at the time:

I can explain how I thought it up, although I'm not sure whether the explanation will be helpful or whether it will sound like "I just counted the legs and divided by four."

By this I meant that the explanation might leave people even more mystified than the algorithm itself. I'm still not sure it won't sound like that, so I'm going to include the following one-paragraph summary that omits the details:

One of the other list members suggested an alternative representation for the strings that seemed promising, so I tried out some examples. Close examination of those examples reminded me of a technique I had used to solve some other problems, so I tried to apply it in this case with the necesary variations. It worked well enough to continue the investigation, but the alternative representation was getting in the way. So I took what I had learned from applying the technique to the alternative representation and tried to write an algorithm to do the same thing directly to the strings of parentheses. Then I optimized the result for speed and code compactness.
The details follow.

I had originally planned to write a recursive solution, but before I started I thought it would be smart to investigate alternative representations. (Someday I'm going to write a long essay about this. I mentioned to Kurt Starsinic last week that the one sure sign of a program written by a novice programmer is that it will represent the data internally in the same format in which it needs to be output, typically as strings.)

One of the representations I looked at was one that had earlier been mentioned by Roger Burton West on the list. In this representation, a string of one "(" followed by n ")"'s becomes the number n, thus:

        ()()()()        1111
        ()()(())        1102
        ()(())()        1021
        ()(()())        1012
        ()((()))        1003
        (())()()        0211
        (())(())        0202
        (()())()        0121
        (()()())        0112
        (()(()))        0103
        ((()))()        0031
        ((())())        0022
        ((()()))        0013
        (((())))        0004
Now if you look at the right-hand column, and at the way the numbers, change, it will seem awfully familiar. Or at least, it felt familiar to me. It reminded me very strongly of a counting process.

The essence of a counting process is that you find the rightmost column that has a certain property, and then you do a little transformation on it so that it gets a little less of that property and the columns to the right get reset to have more of it.

I realize that if you've never thought of it that way that is going to sound totally bizarre, so here's an example. Let's count base-10 numerals. The magic property in this case is the property of being less than 9. 0 has the largest possible amount of this property, since it is as much less than 9 as any digit can be. 8 has very little of this property, since it is just a little less than 9. 9 itself does not have any of this property, since it is not less than 9.

When you count, you find the rightmost column that is less than 9 and then you do a little transformation on it so that it has a little less of that property, so that it a little closer to 9:

        387
        388
        389
After you have done that a few times, you get to a point where the rightmost column's less-than-9-ness has been entirely depleted and you can't change it any more. So you move to the next column to the left and deplete that one instead, and you reset the rightmost column so that it is full of less-than-9-ness again:

        390
Then you continue:
        391
        392
        ...
        399
and now you have to deplete the less-than-9-ness of the third column, and reset the two to its right:

        400
This may be a weird way to look at counting, but it describes all sorts of useful processes. To count in base 2, you allow the columns to hold 0's and 1's, and the property of interest is the property of being a 0; you find the rightmost column that is a 0, and change it to a 1, and change the columns to its right back to 0's.

Probably the next simplest example is when the property of interest is that the n'th column contains a number less than or equal to n:

        0000
        0010
        0100
        0110
        0200
        0210
        1000
        1010
        1100
        1110
        1200
        1210
        2000
        2010
        ...
        3210
This turns out to be useful if you are trying to enumerate permutations; I wrote about it on pages 128–135 of Higher-Order Perl, where I called it the "odometer method". It's used again in the following section to enumerate DNA sequences. So it's a pretty fundamental technique, and turns out to be useful in a wide variety of settings.

Another example I was already familiar with is like the base-2 counting example, but with the added restriction that you are not allowed to have two adjacent 1's:

        0000000
        0000001
        0000010
        0000100
        0000101
        0001000
        0001001
        0001010
        0010000
        0010001
        0010010
        0010100
        0010101
        0100000
        ...
We can generate these strings by repeatedly doing this:

        s/00((10)*1?)$/"01" . "0" x length $1/e;
(This pattern has close relations to the Fibonacci sequence. For example, the nth string contains a single 1 exactly when n is a Fibonacci number. Moreover, every positive integer has a unique representation as a sum of distinct nonconsecutive Fibonacci numbers, the so-called "Zeckendorff representation", and this counting thing tells you what it is. This is analogous to the way that every positive integer has a unique representation as a sum of distinct powers of 2, and the binary expansion tells you what it is.)

So anyway, I was already familiar with this idea, and when I saw the parenthesis numbers, it reminded me strongly of one of these counting processes. Here they are again:

        1111
        1102        
        1021
        1012
        1003
        0211
        0202
        0121
        0112
        0103
        0031
        0022
        0013
        0004
Here the rule seems to be that you always decrement the rightmost nonzero digit that isn't in the last column and increment the following digit—I imagine that the digits are little heaps of 1's, and you are allowed to transfer a 1 from its heap to the heap on its right. But there are some additional constraints.

Since a number n here represents the string ()))...), with n close parentheses, we have the constraints that the leftmost column may not exceed 1; the sum of the two leftmost columns may not exceed 2, and so on. Under these constraints, 1111 is clearly an extreme case, and no string has the 1's any farther to the left. To go from 1111 to 1102, the rightmost moveable 1, which is in the third column, moves to the right, into the fourth column, which I now imagine contains a heap of two 1's.

Now the rightmost 1 is in the second column. So I reset the columns to the right of that (getting back 1111) and then move the 1 from the second to the third column, yielding 1021. Again, I imagine that the 2 here represents a heap of two 1's. These 1's can move to the right, yielding 1012 and then 1003.

Now I'll need to move the 1 from the first to the second column, so I reset the columns to the right of that (getting back 1111 again) and move the 1 from the first to the second column, yielding 0211. Then the 1 in the third column moves, yielding 0202, and then I reset the third and fourth columns back to 0211 so that I can move a 1 from the second to the third column, yielding 0121.

My first cut at implementing this actually manipulated these digit strings directly, like this:

        output while s/([1-9])(0*)([1-9])$/($1-1).($3-length($2)+1).1x length($2)/e;
As a result, it wouldn't work for n > 9. But after I thought about it some more, I realized I didn't need to deal with the digit strings; I could directly manipulate the parentheses in the corresponding way. This was faster, simpler, and got rid of the n < 10 restriction.

Translated back into parentheses, the algorithm might even be simpler to understand. The property we are trying to reduce is appearances of )(, in which ( appears to the right of ). In ()()()...(), the open parentheses are as far to the right as they can be, so this represents the maximal configuration. At each step, we're going to find the rightmost moveable open parenthesis, and we're going to move it one step to the left. That is, we're going to find the rightmost )( and change it to (). If there are other parentheses to the right of the ones that moved, we'll reset them into their minimal configuration. In what follows, the )( that just changed to () is in red, and the parentheses to its right that were reset are in blue. The )( that is about to change to () in the next step is in boldface.

In the initial configuration, the rightmost )( is in positions 6-7:

        ()()()()                        1111
We replace it with ():

        ()()(())                        1102
Now the rightmost )( is in positions 4 and 5. We replace it with () and reset the following parentheses to the minimal configuration:

        ()(())()                       1021
The rightmost )( is on the right again, so we get:

        ()(()())                        1012
Now the rightmost )( is at positions 5 and 6, so the string changes to:

        ()((()))                        1003
Now the only )( is far to the left, at positions 2–3, so when we change it, we need to reset the parentheses to its right:

        (())()()                        0211
And it continues in the same way:

        (())(())                        0202
        (()())()                        0121
        (()()())                        0112
        (()(()))                        0103
        ((()))()                        0031
        ((())())                        0022
        ((()()))                        0013
        (((())))                        0004
After I had implemented that, I realized there was nothing stopping me from writing all the strings backward, and doing the substitutions at the left end of the string instead of at the right end. This is always much cheaper in perl's regex engine because it doesn't have to guess where to start matching in the target string. So I reversed the whole thing. It no longer produced the parentheses in lexicographic order (unless you read right to left) but it was substantially faster.

Another solution to the problem is "consult Knuth". Volume IV of The Art of Computer Programming will contain an entire section devoted to this problem, and includes the algorithm I described. Volume IV isn't published yet, but the relevant section is. As of this writing, it is also still available from Knuth's web site.


[Other articles in category /CS] permanent link