The Universe of Discourse

Thu, 30 Nov 2017

Another public service announcement about Git.

There are a number of commands everyone learns when they first start out using Git. And there are some that almost nobody learns right away, but that should be the first thing you learn once you get comfortable using Git day to day.

One of these has the uninteresting-sounding name git-rev-parse. Git has a bewildering variety of notations for referring to commits and other objects. If you type something like origin/master~3, which commit is that? git-rev-parse is your window into Git's understanding of names:

% git rev-parse origin/master~3
37f2bc78b3041541bb4021d2326c5fe35cbb5fbb

A pretty frequent question is: How do I find out the commit ID of the current HEAD? And the answer is:

2536fdd82332846953128e6e785fbe7f717e117a

or if you want it abbreviated:

2536fdd

But more important than the command itself is the manual for the command. Whether you expect to use this command, you should read its manual. Because every command uses Git's bewildering variety of notations, and that manual is where the notations are completely documented.

When you use a ref name like master, Git finds it in .git/refs/heads/master, but when you use origin/master, Git finds it in .git/refs/remotes/origin/master, and when you use HEAD Git finds it in .git/HEAD. Why the difference? The git-rev-parse manual explains what Git is doing here.

Did you know that if you have an annoying long branch name like origin/martin/f42876-change-tracking you can create a short alias for it by sticking

ref: origin/martin/f42876-change-tracking

into .git/CT, and from then on you can do git log CT or git rebase --onto CT or whatever?

Did you know that you can write topic@{yesterday} to mean “whatever commit topic was pointing to yesterday”?

Did you know that you can write ':/penguin system' to refer to the most recent commit whose commit message mentions the penguin system, and that 'HEAD:/penguin system' means the most recent such commit on the HEAD branch?

Did you know that there's a powerful sublanguage for ranges that you can give to git-log to specify all sorts of useful things about which commits you want to look at?

Once I got comfortable with Git I got in the habit of rereading the git-rev-parse manual every few months, because each time I would notice some new useful tool.

Check it out. It's an important next step.

[ Previous PSAs:

]

Thu, 16 Nov 2017

My recent article about system software errors kinda blew up the Reddit / Hacker News space, and even got listed on Voat, which I understand is the Group W Bench where they send you if you aren't moral enough to be in Reddit. Many people on these fora were eager to tell war stories of times that they had found errors in the compiler or other infrastructural software.

This morning I remembered another example that had happened to me. In the middle 1990s, I was just testing some network program on one of the Sun Solaris machines that belonged to the Computational Linguistics program, when the entire machine locked up. I had to go into the machine room and power-cycle it to get it to come back up.

I returned to my desk to pick up where I had left off, and the machine locked up, again just as I ran my program. I rebooted the machine again, and putting two and two together I tried the next run on a different, less heavily-used machine, maybe my desk workstation or something.

The problem turned out to be a bug in that version of Solaris: if you bound a network socket to some address, and then tried to connect it to the same address, everything got stuck. I wrote a five-line demonstration program and we reported the bug to Sun. I don't know if it was fixed.

My boss had an odd immediate response to this, something along the lines that connecting a socket to itself is not a sanctioned use case, so the failure is excusable. Channeling Richard Stallman, I argued that no user-space system call should ever be able to crash the system, no matter what stupid thing it does. He at once agreed.

I felt I was on safe ground, because I had in mind the GNU GCC bug reporting instructions of the time, which contained the following unequivocal statement:

If the compiler gets a fatal signal, for any input whatever, that is a compiler bug. Reliable compilers never crash.

I love this paragraph. So clear, so pithy! And the second sentence! It could have been left off, but it is there to articulate the writer's moral stance. It is a rock-firm committment in a wavering and uncertain world.

Stallman was a major influence on my writing for a long time. I first encountered his work in 1985, when I was browsing in a bookstore and happened to pick up a copy of Dr. Dobb's Journal. That issue contained the very first publication of the GNU Manifesto. I had never heard of Unix before, but I was bowled over by Stallman's vision, and I read the whole thing then and there, standing up.

 Order The Crazy Ape from Powell's

(It hit the same spot in my heart as Albert Szent-Györgyi's The Crazy Ape, which made a similarly big impression on me at about the same time. I think programmers don't take moral concerns seriously enough, and this is one reason why so many of them find Stallman annoying. But this is what I think makes Stallman so important. Perhaps Dan Bernstein is a similar case.)

I have very vague memories of perhaps finding a bug in gcc, which is perhaps why I was familiar with that particular section of the gcc documentation. But more likely I just read it because I read a lot of stuff. Also Stallman was probably on my “read everything he writes” list.

Why was I trying to connect a socket to itself, anyway? Oh, it was a bug. I meant to connect it somewhere else and used the wrong variable or something. If the operating system crashes when you try, that is a bug. Reliable operating systems never crash.

[ Final note: I looked for my five-line program that connected a socket to itself, but I could not find it. But I found something better instead: an email I sent in April 1993 reporting a program that caused g++ version 2.3.3 to crash with an internal compiler error. And yes, my report does quote the same passage I quoted above. ]

Sun, 12 Nov 2017

When I used to hang out in the comp.lang.c Usenet group, back when there was a comp.lang.c Usenet group, people would show up fairly often with some program they had written that didn't work, and ask if their compiler had a bug. The compiler did not have a bug. The compiler never had a bug. The bug was always in the programmer's code and usually in their understanding of the language.

When I worked at the University of Pennsylvania, a grad student posted to one of the internal bulletin boards looking for help with a program that didn't work. Another graduate student, a super-annoying know-it-all, said confidently that it was certainly a compiler bug. It was not a compiler bug. It was caused by a misunderstanding of the way arguments to unprototyped functions were automatically promoted.

This is actually a subtle point, obscure and easily misunderstood. Most examples I have seen of people blaming the compiler are much sillier. I used to be on the mailing list for discussing the development of Perl 5, and people would show up from time to time to ask if Perl's if statement was broken. This is a little mind-boggling, that someone could think this. Perl was first released in 1987. (How time flies!) The if statement is not exactly an obscure or little-used feature. If there had been a bug in if it would have been discovered and fixed by 1988. Again, the bug was always in the programmer's code and usually in their understanding of the language.

Here's something I wrote in October 2000, which I think makes the case very clearly, this time concerning a claimed bug in the stat() function, another feature that first appeared in Perl 1.000:

On the one hand, there's a chance that the compiler has a broken stat and is subtracting 6 or something. Maybe that sounds likely to you but it sounds really weird to me. I cannot imagine how such a thing could possibly occur. Why 6? It all seems very unlikely.

Well, in the absence of an alternative hypothesis, we have to take what we can get. But in this case, there is an alternative hypothesis! The alternative hypothesis is that [this person's] program has a bug.

Now, which seems more likely to you?

• Weird, inexplicable compiler bug that nobody has ever seen before

or

• Programmer fucked up

Hmmm. Let me think.

I'll take Door #2, Monty.

Presumably I had to learn this myself at some point. A programmer can waste a lot of time looking for the bug in the compiler instead of looking for the bug in their program. I have a file of (obnoxious) Good Advice for Programmers that I wrote about twenty years ago, and one of these items is:

Looking for a compiler bug is the strategy of LAST resort. LAST resort.

Anyway, I will get to the point. As I mentioned a few months ago, I built a simple phone app that Toph and I can use to find solutions to “twenty-four puzzles”. In these puzzles, you are given four single-digit numbers and you have to combine them arithmetically to total 24. Pennsylvania license plates have four digits, so as we drive around we play the game with the license plate numbers we see. Sometimes we can't solve a puzzle, and then we wonder: is it because there is no solution, or because we just couldn't find one? Then we ask the phone app.

The other day we saw the puzzle «5 4 5 1», which is very easy, but I asked the phone app, to find out if there were any other solutions that we missed. And it announced No solutions.” Which is wrong. So my program had a bug, as my programs often do.

The app has a pre-populated dictionary containing all possible solutions to all the puzzles that have solutions, which I generated ahead of time and embedded into the app. My first guess was that bug had been in the process that generated this dictionary, and that it had somehow missed the solutions of «5 4 5 1». These would be indexed under the key 1455, which is the same puzzle, because each list of solutions is associated with the four input numbers in ascending order. Happily I still had the original file containing the dictionary data, but when I looked in it under 1455 I saw exactly the two solutions that I expected to see.

So then I looked into the app itself to see where the bug was. Code Studio's underlying language is Javascript, and Code Studio has a nice debugger. I ran the app under the debugger, and stopped in the relevant code, which was:

var x = [getNumber("a"), getNumber("b"), getNumber("c"), getNumber("d")].sort().join("");

This constructs a hash key (x) that is used to index into the canned dictionary of solutions. The getNumber() calls were retrieving the four numbers from the app's menus, and I verified that the four numbers were «5 4 5 1» as they ought to be. But what I saw next astounded me: x was not being set to 1455 as it should have been. It was set to 4155, which was not in the dictionary. And it was set to 4155 because

the built-in sort() function

was sorting the numbers

into

the

wrong

order.

For a while I could not believe my eyes. But after another fifteen or thirty minutes of tinkering, I sent off a bug report… no, I did not. I still didn't believe it. I asked the front-end programmers at my company what my mistake had been. Nobody had any suggestions.

Then I sent off a bug report that began:

I think that Array.prototype.sort() returned a wrongly-sorted result when passed a list of four numbers. This seems impossible, but …

But to my astonishment, the reply came back only an hour later:

Wow! You're absolutely right. We'll investigate this right away.

In case you're curious, the bug was as follows: The sort() function was using a bubble sort. (This is of course a bad choice, and I think the maintainers plan to replace it.) The bubble sort makes several passes through the input, swapping items that are out of order. It keeps a count of the number of swaps in each pass, and if the number of swaps is zero, the array is already ordered and the sort can stop early and skip the remaining passes. The test for this was:

if (changes <= 1) break;

but it should have been:

if (changes == 0) break;

Ouch.

The Code Studio folks handled this very creditably, and did indeed fix it the same day. (The support system ticket is available for your perusal, as is the Github pull request with the fix, in case you are interested.)

I still can't quite believe it. I feel as though I have accidentally spotted the Loch Ness Monster, or Bigfoot, or something like that, a strange and legendary monster that until now I thought most likely didn't exist.

A bug in the sort() function. O day and night, but this is wondrous strange!

[ Addendum 20171113: Thanks to Reddit user spotter for pointing me to a related 2008 blog post of Jeff Atwood's, “The First Rule of Programming: It's Always Your Fault”. ]

[ Addendum 20171113: Yes, yes, I know sort() is in the library, not in the compiler. I am using “compiler error” as a synecdoche for “system software error”. ]

[ Addendum 20171116: I remembered examples of two other fundamental system software errors I have discovered, including one honest-to-goodness compiler bug. ]

Mon, 19 Jun 2017

On Saturday I posted an article explaining how remote branches and remote-tracking branches work in Git. That article is a prerequisite for this one. But here's the quick summary:

When dealing with a branch (say, master) copied from a remote repository (say, remote), there are three branches one must consider:
1. The copy of master in the local repository
2. The copy of master in the remote repository
3. The local branch origin/master that records the last known position of the remote branch
Branch 3 is known as a “remote-tracking branch”. This is because it tracks the remote branch, not because it is itself a remote branch. Actually it is a local copy of the remote branch. From now on I will just call it a “tracking branch”.

The git-fetch command (green) copies branch (2) to (3).

The git-push command (red) copies branch (1) to (2), and incidentally updates (3) to match the new (2).

The diagram at right summarizes this.

We will consider the following typical workflow:

1. Fetch the remote master branch and check it out.
2. Do some work and commit it on the local master.
3. Push the new work back to the remote.

But step 3 fails, saying something like:

! [rejected]        master -> master (fetch first)
error: failed to push some refs to '../remote/'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

In older versions of Git the hint was a little shorter:

hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Merge the remote changes (e.g. 'git pull')
hint: before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Everyone at some point gets one of these messages, and in my experience it is one of the most confusing and distressing things for beginners. It cannot be avoided, worked around, or postponed; it must be understood and dealt with.

Not everyone gets a clear explanation. (Reading it over, the actual message seems reasonably clear, but I know many people find it long and frighting and ignore it. It is tough in cases like this to decide how to trade off making the message shorter (and perhaps thereby harder to understand) or longer (and frightening people away). There may be no good solution. But here we are, and I am going to try to explain it myself, with pictures.)

In a large project, the remote branch is always moving, as other people add to it, and they do this without your knowing about it. Immediately after you do the fetch in step 1 above, the tracking branch origin/master reflects the state of the remote branch. Ten seconds later, it may not; someone else may have come along and put some more commits on the remote branch in the interval. This is a fundamental reality that new Git users must internalize.

Typical workflow

We were trying to do this:

1. Fetch the remote master branch and check it out.
2. Do some work and commit it on the local master.
3. Push the new work back to the remote.

and the failure occurred in step 3. Let's look at what each of these operations actually does.

1. Fetch the remote master branch and check it out.

git fetch origin master
git checkout master

The black circles at the top represent some commits that we want to fetch from the remote repository. The fetch copies them to the local repository, and the tracking branch origin/master points to the local copy. Then we check out master and the local branch master also points to the local copy.

Branch names like master or origin/master are called “refs”. At this moment all three refs refer to the same commit (although there are separate copies in the two repositories) and the three branches have identical contents.

2. Do some work and commit it on the local master.

edit…
git commit …

The blue dots on the local master branch are your new commits. This happens entirely inside your local repository and doesn't involve the remote one at all.

But unbeknownst to you, something else is happening where you can't see it. Your collaborators or co-workers are doing their own work in their own repositories, and some of them have published this work to the remote repository. These commits are represented by the red dots in the remote repository. They are there, but you don't know it yet because you haven't looked at the remote repository since they appeared.

3. Push the new work back to the remote.

git push origin master

Here we are trying to push our local master, which means that we are asking the remote repo to overwrite its master with our local one. If the remote repo agreed to this, the red commits would be lost (possibly forever!) and would be completely replaced by the blue commits. The error message that is the subject of this article is Git quite properly refusing to fulfill your request:

! [rejected]        master -> master (fetch first)
error: failed to push some refs to '../remote/'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Updates were rejected because the remote contains work that you do not have locally.

This refers specifically to the red commits.

This is usually caused by another repository pushing to the same ref.

In this case, the other repository is your co-worker's repo, not shown in the diagram. They pushed to the same ref (master) before you did.

You may want to first integrate the remote changes (e.g., 'git pull ...') before pushing again.

This is a little vague. There are many ways one could conceivably “integrate the remote changes” and not all of them will solve the problem.

One alternative (which does not integrate the changes) is to use git push -f. The -f is for “force”, and instructs the remote repository that you really do want to discard the red commits in favor of the blue ones. Depending on who owns it and how it is configured, the remote repository may agree to this and discard the red commits, or it may refuse. (And if it does agree, the coworker whose commits you just destroyed may try to feed you poisoned lemonade, so use -f with caution.)

See the 'Note about fast-forwards' in 'git push --help' for details.

To “fast-forward” the remote ref means that your local branch is a direct forward extension of the remote branch, containing everything that the remote branch does, in exactly the same order. If this is the case, overwriting the remote branch with the local branch is perfectly safe. Nothing will be lost or changed, because the local branch contains everything the remote branch already had. The only change will be the addition of new commits at the end.

There are several ways to construct such a local branch, and choosing between them depends on many factors including personal preference, your familiarity with the Git tool set, and the repository owner's policies. Discussing all of this is outside the scope of the article, so I'll just use one as an example: We are going to rebase the blue commits onto the red ones.

4. Refresh the tracking branch.

git fetch origin master

The first thing to do is to copy the red commits into the local repo; we haven't even seen them yet. We do that as before, with git-fetch. This updates the tracking branch with a copy of the remote branch just as it did in step 1.

If instead of git fetch origin master we did git pull --rebase origin master, Git would do exactly the same fetch, and then automatically do a rebase as described in the next section. If we did git pull origin master without --rebase, it would do exactly the same fetch, and then instead of a rebase it would do a merge, which I am not planning to describe. The point to remember is that git pull is just a convenient way to combine the commands of this section and the next one, nothing more.

5. Rewrite the local changes.

git rebase origin/master

Now is the moment when we “integrate the remote changes” with our own changes. One way to do this is git rebase origin/master. This tells Git to try to construct new commits that are just like the blue ones, but instead of starting from the last black commit, the will start from the last red one. (For more details about how this works, see my talk slides about it.) There are many alternatives here to rebase, some quite elaborate, but that is a subject for another article, or several other articles.

If none of the files modified in the blue commits have also been modified in any of the red commits, there is no issue and everything proceeds automatically. And if some of the same files are modified, but only in non-overlapping portions, Git can automatically combine them. But if some of the files are modified in incompatible ways, the rebase process will stop in the middle and asks how to proceed, which is another subject for another article. This article will suppose that the rebase completed automatically. In this case the blue commits have been “rebased onto” the red commits, as in the diagram at right.

The diagram is a bit misleading here: it looks as though those black and red commits appear in two places in the local repository, once on the local master branch and once on the tracking branch. They don't. The two branches share those commits, which are stored only once.

Notice that the command is git rebase origin/master. This is different in form from git fetch origin master or git push origin master. Why a slash instead of a space? Because with git-fetch or git-push, we tell it the name of the remote repo, origin, and the name of the remote branch we want to fetch or push, master. But git-rebase operates locally and has no use for the name of a remote repo. Instead, we give it the name of the branch onto which we want to rebase the new commits. In this case, the target branch is the tracking branch origin/master.

6. Try the push again.

git push origin master

We try the exact same git push origin master that failed in step 3, and this time it succeeds, because this time the operation is a “fast-forward”. Before, our blue commits would have replaced the red commits. But our rewritten local branch does not have that problem: it includes the red commits in exactly the same places as they are already on the remote branch. When the remote repository replaces its master with the one we are pushing, it loses nothing, because the red commits are identical. All it needs to do is to add the blue commits onto the end and then move its master ref forward to point to the last blue commit instead of to the last red commit. This is a “fast-forward”.

At this point, the push is successful, and the git-push command also updates the tracking branch to reflect that the remote branch has moved forward. I did not show this in the illustration.

But wait, what if someone else had added yet more commits to the remote master while we were executing steps 4 and 5? Wouldn't our new push attempt fail just like the first one did? Yes, absolutely! We would have to repeat steps 4 and 5 and try a third time. It is possible, in principle, to be completely prevented from pushing commits to a remote repo because it is always changing so quickly that you never get caught up on its current state. Repeated push failures of this type are sign that the project is large enough that repository's owner needs to set up a more structured code release mechanism than “everyone lands stuff on master whenever they feel like it”.

Unavoidable problems

Everyone suffers through this issue at some point or another. It is tempting to wonder if Git couldn't somehow make it easier for people to deal with. I think the answer is no. Git has multiple, distributed repositories. To abandon that feature would be to go back to the dark ages of galley slaves, smallpox, and SVN. But if you have multiple distributed anythings, you must face the issue of how to synchronize them. This is intrinsic to distributed systems: two components receive different updates at the same time, and how do you reconcile them?

For reasons I have discussed before, it does not appear possible to automate the reconciliation in every case in a source code control system, because sometimes the reconciliation may require going over to a co-worker's desk and arguing for two hours, then calling in three managers and the CTO and making a strategic decision which then has to be approved by a representative of the legal department. The VCS is not going to do this for you.

I'm going to digress a bit and then come back to the main point. Twenty-five years ago I taught an introductory programming class in C. The previous curriculum had tried hard to defer pointers to the middle of the semester, as K&R does (chapter 7, I think). I decided this was a mistake. Pointers are everywhere in C and without them you can't call scanf or pass an array to a function (or access the command-line arguments or operate on strings or use most of the standard library or return anything that isn't a number…). Looking back a few years later I wrote:

Pointers are an essential part of [C's] solution to the data hiding problem, which is an essential issue. Therefore, they cannot be avoided, and in fact should be addressed as soon as possible. … They presented themselves in the earliest parts of the material not out of perversity, but because they were central to the topic.

I developed a new curriculum that began treating pointers early on, as early as possible, and which then came back to them repeatedly, each time elaborating on the idea. This was a big success. I am certain that it is the right way to do it.

(And I've been intending since 2006 to write an article about K&R's crappy discussion of pointers and how its deficiencies and omissions have been replicated down the years by generation after generation of C programmers.)

I think there's an important pedagogical principle here. A good teacher makes the subject as simple as possible, but no simpler. Many difficult issues, perhaps most, can be ignored, postponed, hidden, prevaricated, fudged, glossed over, or even solved. But some must be met head-on and dealt with, and for these I think the sooner they are met and dealt with, the better.

Push conflicts in Git, like pointers in C, are not minor or peripheral; they are an intrinsic and central issue. Almost everyone is going to run into push conflicts, not eventually, but right away. They are going to be completely stuck until they have dealt with it, so they had better be prepared to deal with it right away.

If I were to write a book about Git, this discussion would be in chapter 2. Dealing with merge conflicts would be in chapter 3. All the other stuff could wait.

That is all I have to say about this. Thank you for your kind attention, and thanks to Sumana Harihareswara and AJ Jordan for inspiration.

Sat, 17 Jun 2017

Beginning and even intermediate Git users have several common problem areas, and one of these is the relationship between remote and local branches. I think the basic confusion is that it seems like there ought to be two things, the remote branch and the local one, and you copy back and forth between them. But there are not two but three, and the Git documentation does not clearly point this out or adopt clear terminology to distinguish between the three.

Let's suppose we have a remote repository, which could be called anything, but is typically named origin. And we have a local repository which has no name; it's just the local repo. And let's suppose we're working on a branch named master, as one often does.

There are not two but three branches of interest, and they might all be pointing to different commits:

1. The branch named master in the local repo. This is where we do our work and make our commits. This is the local branch. It is at the lower left in the diagram.

2. The branch named master in the remote repo. This is the remote branch, at the top of the diagram. We cannot normally see this at all because it is (typically) on another computer and (typically) requires a network operation to interact with it. So instead, we mainly deal with…

3. The branch named origin/master in the local repo. This is the tracking branch, at the lower right in the diagram.

We never modify the tracking branch ourselves. It is automatically maintained for us by Git. Whenever Git communicates with the remote repo and learns something about the disposition of the remote master branch, it updates the local branch origin/master to reflect what it has learned.

I think this triangle diagram is the first thing one ought to see when starting to deal with remote repositories and with git-fetch and git-push.

The Git documentation often calls the tracking branch the “remote-tracking branch”. It is important to understand that the remote-tracking branch is a local branch in the local repository. It is called the “remote-tracking” branch because it tracks the state of the remote branch, not because it is itself remote. From now on I will just call it the “tracking branch”.

Now let's consider a typical workflow:

1. We use git fetch origin master. This copies the remote branch master from the remote repo to the tracking branch origin/master in the local repo. This is the green arrow in the diagram.

If other people have added commits to the remote master branch since our last fetch, now is when we find out what they are. We can compare the local branch master with the tracking branch origin/master to see what is new. We might use git log origin/master to see the new commits, or git diff origin/master to compare the new versions of the files with the ones we had before. These commands do not look at the remote branch! They look at the copy of the remote branch that Git retrieved for us. If a long time elapses between the fetch and the compare, the actual remote branch might be in a completely different place than when we fetched at it.

(Maybe you use pull instead of fetch. But pull is exactly like fetch except that it does merge or rebase after the fetch completes. So the process is the same; it merely combines this step and the next step into one command. )

2. We decide how to combine our local master with origin/master. We might use git merge origin/master to merge the two branches, or we might use git rebase origin/master to copy our new local commits onto the commits we just fetched. Or we could use git reset --hard origin/master to throw away our local commits (if any) and just take the ones on the tracking branch. There are a lot of things that could happen here, but the blue arrow in the diagram shows the general idea: we see new stuff in origin/master and update the local master to include that new stuff in some way.

3. After doing some more work on the local master, we want to publish the new work. We use git push origin master. This is the red arrow in the diagram. It copies the local master to the remote master, updating the remote master in the process. If it is successful, it also updates the tracking branch origin/master to reflect the new position of the remote master.

In the last step, why is there no slash in git push origin master? Because origin/master is the name of the tracking branch, and the tracking branch is not involved. The push command gets two arguments: the name of the remote (origin) and the branch to push (master) and then it copies the local branch to the remote one of the same name.

Deleting a branch

How do we delete branches? For the local branch, it's easy: git branch -d master does it instantly.

For the tracking branch, we include the -r flag: git branch -d -r origin/master. This deletes the tracking branch, and has no effect whatever on the remote repo. This is a very unusual thing to do.

To delete the remote branch, we have to use git-push because that is the only way to affect the remote repo. We use git push origin :master. As is usual with a push, if this is successful Git also deletes the tracking branch origin/master.

This section has glossed over an important point: git branch -d master does not delete the master branch, It only deletes the ref, which is the name for the branch. The branch itself remains. If there are other refs that refer to it, it will remain as long as they do. If there are no other refs that point to it, it will be deleted in due course, but not immediately. Until the branch is actually deleted, its contents can be recovered.

Hackery

Another way to delete a local ref (whether tracking or not) is just to go into the repository and remove it. The repository is usually in a subdirectory .git of your working tree, and if you cd .git/refs you can see where Git records the branch names and what they refer to. The master branch is nothing more nor less than a file heads/master in this directory, and its contents are the commit ID of the commit to which it refers. If you edit this commit ID, you have pointed the ref at a different commit. If you remove the file, the ref is gone. It is that simple.

Tracking branches are similar. The origin/master ref is in .git/refs/remotes/origin/master.

The remote master branch, of course, is not in your repository at all; it's in the remote repository.

Poking around in Git's repository is fun and rewarding. (If it worries you, make another clone of the repo, poke around in the clone, and throw it away when you are finished poking.) Tinkering with the refs is a good place to start Git repo hacking: create a couple of branches, move them around, examine them, delete them again, all without using git-branch. Git won't know the difference. Bonus fun activity: HEAD is defined by the file .git/HEAD. When you make a new commit, HEAD moves forward. How does that work?

There is a gitrepository-layout manual that says what else you can find in the repository.

Failed pushes

We're now in a good position to understand one of the most common problems that Git beginners face: they have committed some work, and they want to push it to the remote repository, but Git says

! [rejected]        master -> master (fetch first)
error: failed to push some refs to 'remote'
something something fast-forward, whatever that is

My article explaining this will appear here on Monday. (No, I really mean it.)

Terminology problems

I think one of the reasons this part of Git is so poorly understood is that there's a lack of good terminology in this area. There needs to be a way to say "the local branch named master” and “the branch named master in the remote named origin” without writing a five- or nine-word phrase every time. The name origin/master looks like it might be the second of these, but it isn't. The documentation uses the descriptive but somewhat confusing term “remote-tracking branch” to refer to it. I think abbreviating this to “tracking branch” would tend to clear things up more than otherwise.

I haven't though of a good solution to the rest of it yet. It's tempting to suggest that we should abbreviate “the branch named master in the remote named origin” to something like “origin:master” but I think that would be a disaster. It would be too easy to confuse with origin/master and also with the use of the colon in the refspec arguments to git-push. Maybe something like origin -> master that can't possibly be mistaken for part of a shell command and that looks different enough from origin/master to make clear that it's related but not the same thing.

Git piles yet another confusion on this:

$git checkout master Branch master set up to track remote branch master from origin. This sounds like it has something to with the remote-tracking branch, but it does not! It means that the local branch master has been associated with the remote origin so that fetches and pushes that pertain to it will default to using that remote. I will think this over and try to come up with something that sucks a little less. Suggestions are welcome. Thu, 16 Feb 2017 Over the past couple of days I've written about how I committed a syntax error on a cron script, and a co-worker had to fix it on Saturday morning. I observed that I should have remembered to check the script for syntax errors before committing it, and several people wrote to point out to me that this is the sort of thing one should automate. (By the way, please don't try to contact me on Twitter. It won't work. I have been on Twitter Vacation for months and have no current plans to return.) Git has a “pre-commit hook” feature, which means that you can set up a program that will be run every time you attempt a commit, and which can abort the commit if it doesn't like what it sees. This is the natural place to put an automatic syntax check. Some people suggested that it should be part of the CI system, or even the deployment system, but I don't control those, and anyway it is much better to catch this sort of thing as early as possible. I decided to try to implement a pre-commit hook to check syntax. Unlike some of the git hooks, the pre-commit hook is very simple to use. It gets run when you try to make a commit, and the commit is aborted if the hook exits with a nonzero status. I made one mistake right off the bat: I wrote the hook in Bourne shell, even though I swore years ago to stop writing shell scripts. Everything that I want to write in shell should be written in Perl instead or in some equivalently good language like Python. But the sample pre-commit hook was written in shell and when I saw it I went into automatic shell scripting mode and now I have yet another shell script that will have to be replaced with Perl when it gets bigger. I wish I would stop doing this. Here is the hook, which, I should say up front, I have not yet tried in day-to-day use. The complete and current version is on github. #!/bin/bash function typeof () { filename=$1
case $filename in *.pl | *.pm) echo perl; exit ;; esac line1=$(head -1 $1) case$line1 in '#!'*perl )
echo perl; exit ;;
esac
}

Some of the sample programs people showed me decided which files needed to be checked based only on the filename. This is not good enough. My most important Perl programs have filenames with no extension. This typeof function decides which set of checks to apply to each file, and the minimal demonstration version here can do that based on filename or by looking for the #!...perl line in the first line of the file contents. I expect that this function will expand to include other file types; for example

*.py ) echo python; exit ;;

is an obvious next step.

if [ ! -z $COMMIT_OK ]; then exit 0; fi This block is an escape hatch. One day I will want to bypass the hook and make a commit without performing the checks, and then I can COMMIT_OK=1 git commit …. There is actually a --no-verify flag to git-commit that will skip the hook entirely, but I am unlikely to remember it. (I am also unlikely to remember COMMIT_OK=1. But I know from experience that I will guess that I might have put an escape hatch into the hook. I will also guess that there might be a flag to git-commit that does what I want, but that will seem less likely to be true, so I will look in the hook program first. This will be a good move because my hook is much shorter than the git-commit man page. So I will want the escape hatch, I will look for it in the best place, and I will find it. That is worth two lines of code. Sometimes I feel like the guy in Memento. I have not yet resorted to tattooing COMMIT_OK=1 on my chest.) exec 1>&2 This redirects the standard output of all subsequent commands to go to standard error instead. It makes it more convenient to issue error messages with echo and such like. All the output this hook produces is diagnostic, so it is appropriate for it to go to standard error. allOK=true badFiles= for file in$(git diff --cached --name-only | sort) ; do

allOK is true if every file so far has passed its checks. badFiles is a list of files that failed their checks. the git diff --cached --name-only function interrogates the Git index for a list of the files that have been staged for commit.

type=$(typeof "$file")

This invokes the typeof function from above to decide the type of the current file.

When a check discovers that the current file is bad, it will signal this by setting BAD to true.

echo
echo "##  Checking file $file (type$type)"
case $type in perl ) perl -cw$file || BAD=true
[ -x $file ] || { echo "File is not executable"; BAD=true; } ;; * ) echo "Unknown file type:$file; no checks"
;;
esac

This is the actual checking. To check Python files, we would add a python) … ;; block here. The * ) case is a catchall. The perl checks run perl -cw, which does syntax checking without executing the program. It then checks to make sure the file is executable, which I am sure is a mistake, because these checks are run for .pm files, which are not normally supposed to be executable. But I wanted to test it with more than one kind of check.

if $BAD; then allOK=false; badFiles="$badFiles;$file" fi done If the current file was bad, the allOK flag is set false, and the commit will be aborted. The current filename is appended to badFiles for a later report. Bash has array variables but I don't remember how they work and the manual made it sound gross. Already I regret not writing this in a real language. After the modified files have been checked, the hook exits successfully if they were all okay, and prints a summary if not: if$allOK; then
exit 0;
else
echo ''
echo '## Aborting commit.  Failed checks:'
for file in $(echo$badFiles | tr ';' ' '); do
echo "    $file" done exit 1; fi This hook might be useful, but I don't know yet; as I said, I haven't really tried it. But I can see ahead of time that it has a couple of drawbacks. Of course it needs to be built out with more checks. A minor bug is that I'd like to apply that is-executable check to Perl files that do not end in .pm, but that will be an easy fix. But it does have one serious problem I don't know how to fix yet. The hook checks the versions of the files that are in the working tree, but not the versions that are actually staged for the commit! The most obvious problem this might cause is that I might try to commit some files, and then the hook properly fails because the files are broken. Then I fix the files, but forget to add the fixes to the index. But because the hook is looking at the fixed versions in the working tree, the checks pass, and the broken files are committed! A similar sort of problem, but going the other way, is that I might make several changes to some file, use git add -p to add the part I am ready to commit, but then the commit hook fails, even though the commit would be correct, because the incomplete changes are still in the working tree. I did a little tinkering with git stash save -k to try to stash the unstaged changes before running the checks, something like this: git stash save -k "pre-commit stash" || exit 2 trap "git stash pop" EXIT but I wasn't able to get anything to work reliably. Stashing a modified index has never worked properly for me, perhaps because there is something I don't understand. Maybe I will get it to work in the future. Or maybe I will try a different method; I can think of several offhand: • The hook could copy each file to a temporary file and then run the check on the temporary file. But then the diagnostics emitted by the checks would contain the wrong filenames. • It could move each file out of the way, check out the currently-staged version of the file, check that, and then restore the working tree version. (It can skip this process for files where the staged and working versions are identical.) This is not too complicated, but if it messes up it could catastrophically destroy the unstaged changes in the working tree. • Check out the entire repository and modified index into a fresh working tree and check that, then discard the temporary working tree. This is probably too expensive. • This one is kind of weird. It could temporarily commit the current index (using --no-verify), stash the working tree changes, and check the files. When the checks are finished, it would unstash the working tree changes, use git-reset --soft to undo the temporary commit, and proceed with the real commit if appropriate. • Come to think of it, this last one suggests a much better version of the same thing: instead of a pre-commit hook, use a post-commit hook. The post-commit hook will stash any leftover working tree changes, check the committed versions of the files, unstash the changes, and, if the checks failed, undo the commit with git-reset --soft. Right now the last one looks much the best but perhaps there's something straightforward that I didn't think of yet. [ Thanks to Adam Sjøgren, Jeffrey McClelland, and Jack Vickeridge for discussing this with me. Jeffrey McClelland also suggested that syntax checks could be profitably incorporated as a post-receive hook, which is run on the remote side when new commits are pushed to a remote. I said above that running the checks in the CI process seems too late, but the post-receive hook is earlier and might be just the thing. ] [ Addendum: Daniel Holz wrote to tell me that the Yelp pre-commit frameworkhandles the worrisome case of unstaged working tree changes. The strategy is different from the ones I suggested above. If I'm reading this correctly, it records the unstaged changes in a patch file, which it sticks somewhere, and then checks out the index. If all the checks succeed, it completes the commit and then tries to apply the patch to restore the working tree changes. The checks in Yelp's framework might modify the staged files, and if they do, the patch might not apply; in this case it rolls back the whole commit. Thank you M. Holtz! ] Wed, 15 Feb 2017 Yesterday I wrote, in great irritation, about a line of code I had written that contained three errors. I said: What can I learn from this? Most obviously, that I should have tested my code before I checked it in. Afterward, I felt that this was inane, and that the matter required a little more reflection. We do not test every single line of every program we write; in most applications that would be prohibitively expensive, and in this case it would have been excessive. The change I was making was in the format of the diagnostic that the program emitted as it finished to report how long it had taken to run. This is not an essential feature. If the program does its job properly, it is of no real concern if it incorrectly reports how long it took to run. Two of my errors were in the construction of the message. The third, however, was a syntax error that prevented the program from running at all. Having reflected on it a little more, I have decided that I am only really upset about the last one, which necessitated an emergency Saturday-morning repair by a co-worker. It was quite acceptable not to notice ahead of time that the report would be wrong, to notice it the following day, and to fix it then. I would have said “oops” and quietly corrected the code without feeling like an ass. The third problem, however, was serious. And I could have prevented it with a truly minimal amount of effort, just by running: perl -cw the-script This would have diagnosed the syntax error, and avoided the main problem at hardly any cost. I think I usually remember to do something like this. Had I done it this time, the modified script would have gone into production, would have run correctly, and then I could have fixed the broken timing calculation on Monday. In the previous article I showed the test program that I wrote to test the time calculation after the program produced the wrong output. I think it was reasonable to postpone writing this until after program ran and produced the wrong output. (The program's behavior in all other respects was correct and unmodified; it was only its report about its running time that was incorrect.) To have written the test ahead of time might be an excess of caution. There has to be a tradeoff between cautious preparation and risk. Here I put everything on the side of risk, even though a tiny amount of caution would have eliminated most of the risk. In my haste, I made a bad trade. [ Addendum 20170216: I am looking into automating the perl -cw check. ] Mon, 12 Dec 2016 My co-worker X had been collaborating with a front-end designer on a very large change, consisting of about 406 commits in total. The sum of the changes was to add 18 new files of code to implement the back end of the new system, and also to implement the front end, a multitude of additions to both new and already-existing files. Some of the 406 commits modified just the 18 back-end files, some modified just the front-end files, and many modified both. X decided to merge and deploy just the back-end changes, and then, once that was done and appeared successful, to merge the remaining front-end changes. His path to merging the back-end changes was unorthodox: he checked out the current master, and then, knowing that the back-end changes were isolated in 18 entirely new files, did git checkout topic-branch -- new-file-1 new-file-2 … new-file-18 He then added the 18 files to the repo, committed them, and published the resulting commit on master. In due course this was deployed to production without incident. The next day he wanted to go ahead and merge the front-end changes, but he found himself in “a bit of a pickle”. The merge didn't go forward cleanly, perhaps because of other changes that had been made to master in the meantime. And trying to rebase the branch onto the new master was a complete failure. Many of those 406 commits included various edits to the 18 back-end files that no longer made sense now that the finished versions of those files were in the master branch he was trying to rebase onto. So the problem is: how to land the rest of the changes in those 406 commits, preferably without losing the commit history and messages. The easiest strategy in a case like this is usually to back in time: If the problem was caused by the unorthodox checkout-add-commit, then reset master to the point before that happened and try doing it a different way. That strategy wasn't available because X had already published the master with his back-end files, and a hundred other programmers had copies of them. The way I eventually proceeded was to rebase the 406-commit work branch onto the current master, but to tell Git meantime that conflicts in the 18 back-end files should be ignored, because the version of those files on the master branch was already perfect. Merge drivers There's no direct way to tell Git to ignore merge conflicts in exactly 18 files, but there is a hack you can use to get the same effect. The repo can contain a .gitattributes file that lets you specify certain per-file options. For example, you can use .gitattributes to say that the files in a certain directory are text, that when they are checked out the line terminators should be converted to whatever the local machine's line terminator convention is, and they should be converted back to NLs when changes are committed. Some of the per-file attributes control how merge conflicts are resolved. We were already using this feature for a certain frequently-edited file that was a list of processes to be performed in a certain order: do A then do B Often different people would simultaneously add different lines to the end of this file: # Person X's change: do A then do B then do X # Person Y's change: do A then do B then do Y X would land their version on master and later there would be a conflict when Y tried to land their own version: do A then do B <<<<<<<< then do X -------- then do Y >>>>>>>> Git was confused: did you want new line X or new line Y at the end of the file, or both, and if both then in what order? But the answer was always the same: we wanted both, X and then Y, in that order: do A then do B then do X then do Y With the merge attribute set to union for this file, Git automatically chooses the correct resolution. So, returning to our pickle, I wanted to set the merge attribute for the 18 back-end files to tell Git to always choose the version already in master, and always ignore the changes from the branch I was merging. There is not exactly a way to do this, but the mechanism that is provided is extremely general, and it is not hard to get it to do what we want in this case. The merge attribute in .gitattributes specifies the name of a “driver” that resolves merge conflicts. The driver can be one of a few built-in drivers, such as the union driver I just described, or it can be the name of a user-supplied driver, configured in .gitconfig. The first step is to use .gitattributes to tell Git to use our private, special-purpose driver for the 18 back-end files: new-file-1 merge=ours new-file-2 merge=ours … new-file-18 merge=ours (The name ours here is completely arbitrary. I chose it because its function was analogous to the -s ours and -X ours options of git-merge.) Then we add a section to .gitconfig to say what the ours driver should do: [merge "ours"] name = always prefer our version to the one being merged driver = true The name is just a human-readable description and is ignored by Git. The important part is the deceptively simple-appearing driver = true line. The driver is actually a command that is run when there is a merge conflict. The command is run with the names of three files containing different versions of the target file: the main file being merged into, and temporary files containing the version with the conflicting changes and the common ancestor of the first two files. It is the job of the driver command to examine the three files, figure out how to resolve the conflict, and modify the main file appropriately. In this case merging the two or three versions of the file is very simple. The main version is the one on the master branch, already perfect. The proposed changes are superfluous, and we want to ignore them. To modify the main file appropriately, our merge driver command needs to do exactly nothing. Unix helpfully provides a command that does exactly nothing, called true, so that's what we tell Git to use to resolve merge conflicts. With this configured, and the changes to .gitattributes checked in, I was able to rebase the 406-commit topic branch onto the current master. There were some minor issues to work around, so it was not quite routine, but the problem was basically solved and it wasn't a giant pain. I didn't actually use git-rebase I should confess that I didn't actually use git-rebase at this point; I did it semi-manually, by generating a list of commit IDs and then running a loop that cherry-picked them one at a time: tac /tmp/commit-ids | while read commit; do git cherry-pick$commit || break
done

I don't remember why I thought this would be a better idea than just using git-rebase, which is basically the same thing. (Superstitious anxiety, perhaps.) But I think the process and the result were pretty much the same. The main drawback of my approach is that if one of the cherry-picks fails, and the loop exits prematurely, you have to hand-edit the commit-ids file before you restart the loop, to remove the commits that were successfully picked.

Also, it didn't work on the first try

My first try at the rebase didn't quite work. The merge driver was working fine, but some commits that it wanted to merge modified only the 18 back-end files and nothing else. Then there were merge conflicts, which the merge driver said to ignore, so that the net effect of the merged commit was to do nothing. But git-rebase considers that an error, says something like

The previous cherry-pick is now empty, possibly due to conflict resolution.
If you wish to commit it anyway, use:

git commit --allow-empty

and stops and waits for manual confirmation. Since 140 of the 406 commits modified only the 18 perfect files I was going to have to intervene manually 140 times.

I wanted an option that told git-cherry-pick that empty commits were okay and just to ignore them entirely, but that option isn't in there. There is something almost as good though; you can supply --keep-redundant-commits and instead of failing it will go ahead and create commits that make no changes. So I ended up with a branch with 406 commits of which 140 were empty. Then a second git-rebase eliminated them, because the default behavior of git-rebase is to discard empty commits. I would have needed that final rebase anyway, because I had to throw away the extra commit I added at the beginning to check in the changes to the .gitattributes file.

A few conflicts remained

There were three or four remaining conflicts during the giant rebase, all resulting from the following situation: Some of the back-end files were created under different names, edited, and later moved into their final positions. The commits that renamed them had unresolvable conflicts: the commit said to rename A to B, but to Git's surprise B already existed with different contents. Git quite properly refused to resolve these itself. I handled each of these cases manually by deleting A.

I made this up as I went along

I don't want anyone to think that I already had all this stuff up my sleeve, so I should probably mention that there was quite a bit of this I didn't know beforehand. The merge driver stuff was all new to me, and I had to work around the empty-commit issue on the fly.

Also, I didn't find a working solution on the first try; this was my second idea. My notes say that I thought my first idea would probably work but that it would have required more effort than what I described above, so I put it aside planning to take it up again if the merge driver approach didn't work. I forget what the first idea was, unfortunately.

Named commits

This is a minor, peripheral technique which I think is important for everyone to know, because it pays off far out of proportion to how easy it is to learn.

There were several commits of interest that I referred to repeatedly while investigating and fixing the pickle. In particular:

• The last commit on the topic branch
• The first commit on the topic branch that wasn't on master
• The commit on master from which the topic branch diverged

Instead of trying to remember the commit IDs for these I just gave them mnemonic names with git-branch: last, first, and base, respectively. That enabled commands like git log base..last … which would otherwise have been troublesome to construct. Civilization advances by extending the number of important operations which we can perform without thinking of them. When you're thinking "okay, now I need to rebase this branch" you don't want to derail the train of thought to remember where the bottom of the branch is every time. Being able to refer to it as first is a big help.

Other approaches

After it was all over I tried to answer the question “What should X have done in the first place to avoid the pickle?” But I couldn't think of anything, so I asked Rik Signes. Rik immediately said that X should have used git-filter-branch to separate the 406 commits into two branches, branch A with just the changes to the 18 back-end files and branch B with just the changes to the other files. (The two branches together would have had more than 406 commits, since a commit that changed both back-end and front-end files would be represented in both branches.) Then he would have had no trouble landing branch A on master and, after it was deployed, landing branch B.

At that point I realized that git-filter-branch also provided a less peculiar way out of the pickle once we were in: Instead of using my merge driver approach, I could have filtered the original topic branch to produce just branch B, which would have rebased onto master just fine.

I was aware that git-filter-branch was not part of my personal toolkit, but I was unaware of the extent of my unawareness. I would have hoped that even if I hadn't known exactly how to use it, I would at least have been able to think of using it. I plan to set aside an hour or two soon to do nothing but mess around with git-filter-branch so that next time something like this happens I can at least consider using it.

It occurred to me while I was writing this that it would probably have worked to make one commit on master to remove the back-end files again, and then rebase the entire topic branch onto that commit. But I didn't think of it at the time. And it's not as good as what I did do, which left the history as clean as was possible at that point.

I think I've written before that this profusion of solutions is the sign of a well-designed system. The tools and concepts are powerful, and can be combined in many ways to solve many problems that the designers didn't foresee.

Thu, 21 Jul 2016

Today I invented a pretty good hack.

Suppose I have branch topic checked out. It often happens that I want to

git push origin topic:mjd/topic

which pushes the topic branch to the origin repository, but on origin it is named mjd/topic instead of topic. This is a good practice when many people share the same repository. I wanted to write a program that would do this automatically.

So the question arose, how should the program figure out the mjd part? Almost any answer would be good here: use some selection of environment variables, the current username, a hard-wired default, and the local part of Git's user.email configuration setting, in some order. Getting user.email is easy (git config get user.email) but it might not be set and then you get nothing. If you make a commit but have no user.email, Git doesn't mind. It invents an address somehow. I decided that I would like my program to to do exactly what Git does when it makes a commit.

But what does Git use for the committer's email address if there is no user.email set? This turns out to be complicated. It consults several environment variables in some order, as I suggested before. (It is documented in git-commit-tree if you are interested.) I did not want to duplicate Git's complicated procedure, because it might change, and because duplicating code is a sin. But there seemed to be no way to get Git to disgorge this value, short of actually making a commit and examining it.

So I wrote this command, which makes a commit and examines it:

git log -1 --format=%ce $(git-commit-tree HEAD^{tree} < /dev/null) This is extremely weird, but aside from that it seems to have no concrete drawbacks. It is pure hack, but it is a hack that works flawlessly. What is going on here? First, the$(…) part:

The git-commit-tree command is what git-commit uses to actually create a commit. It takes a tree object, reads a commit message from standard input, writes a new commit object, and prints its SHA1 hash on standard output. Unlike git-commit, it doesn't modify the index (git-commit would use git-write-tree to turn the index into a tree object) and it doesn't change any of the refs (git-commit would update the HEAD ref to point to the new commit.) It just creates the commit.

Here we could use any tree, but the tree of the HEAD commit is convenient, and HEAD^{tree} is its name. We supply an empty commit message from /dev/null.

Then the outer command runs:

git log -1 --format=%ce $(…) The$(…) part is replaced by the SHA1 hash of the commit we just created with git-commit-tree. The -1 flag to git-log gets the log information for just this one commit, and the --format=%ce tells git-log to print out just the committer's email address, whatever it is.

This is fast—nearly instantaneous—and cheap. It doesn't change the state of the repository, except to write a new object, which typically takes up 125 bytes. The new commit object is not attached to any refs and so will be garbage collected in due course. You can do it in the middle of a rebase. You can do it in the middle of a merge. You can do it with a dirty index or a dirty working tree. It always works.

(Well, not quite. It will fail if run in an empty repository, because there is no HEAD^{tree} yet. Probably there are some other similarly obscure failure modes.)

I called the shortcut git-push program git-pusho but I dropped the email-address-finder into git-get, which is my storehouse of weird “How do I find out X” tricks.

I wish my best work of the day had been a little bit more significant, but I'll take what I can get.

[ Addendum: Twitter user @shachaf has reminded me that the right way to do this is

git var GIT_COMMITTER_IDENT

which prints out something like

Mark Jason Dominus (陶敏修) <mjd@plover.com> 1469102546 -0400

which you can then parse. @shachaf also points out that a Stack Overflow discussion of this very question contains a comment suggesting the same weird hack! ]

Thu, 14 Jul 2016

[ Danielle Sucher reminded me of this article I wrote in 1998, before I had a blog, and I thought I'd repatriate it here. It should be interesting as a historical artifact, if nothing else. Thanks Danielle! ]

I avoided syntax coloring for years, because it seemed like a pretty stupid idea, and when I tried it, I didn't see any benefit. But recently I gave it another try, with Ilya Zakharevich's cperl-mode' for Emacs. I discovered that I liked it a lot, but for surprising reasons that I wasn't expecting.

I'm not trying to start an argument about whether syntax coloring is good or bad. I've heard those arguments already and they bore me to death. Also, I agree with most of the arguments about why syntax coloring is a bad idea. So I'm not trying to argue one way or the other; I'm just relating my experiences with syntax coloring. I used to be someone who didn't like it, but I changed my mind.

When people argue about whether syntax coloring is a good idea or not, they tend to pull out the same old arguments and dust them off. The reasons I found for using syntax coloring were new to me; I'd never seen anyone mention them before. So I thought maybe I'd post them here.

Syntax coloring is when the editor understands something about the syntax of your program and displays different language constructs in different fonts. For example, cperl-mode displays strings in reddish brown, comments in a sort of brick color, declared variables (in my) in gold, builtin function names (defined) in green, subroutine names in blue, labels in teal, and keywords (like my and foreach) in purple.

The first thing that I noticed about this was that it was easier to recognize what part of my program I was looking at, because each screenful of the program had its own color signature. I found that I was having an easier time remembering where I was or finding that parts I was looking for when I scrolled around in the file. I wasn't doing this consciously; I couldn't describe the color scheme any particular part of the program was, but having red, gold, and purple blotches all over made it easier to tell parts of the program apart.

The other surprise I got was that I was having more fun programming. I felt better about my programs, and at the end of the day, I felt better about the work I had done, just because I'd spent the day looking at a scoop of rainbow sherbet instead of black and white. It was just more cheerful to work with varicolored text than monochrome text. The reason I had never noticed this before was that the other coloring editors I used had ugly, drab color schemes. Ilya's scheme won here by using many different hues.

I haven't found many of the other benefits that people say they get from syntax coloring. For example, I can tell at a glance whether or not I failed to close a string properly—unless the editor has screwed up the syntax coloring, which it does often enough to ruin the benefit for me. And the coloring also slows down the editor. But the two benefits I've described more than outweigh the drawbacks for me. Syntax coloring isn't a huge win, but it's definitely a win.

If there's a lesson to learn from this, I guess it's that it can be valuable to revisit tools that you rejected, to see if you've changed your mind. Nothing anyone said about it was persuasive to me, but when I tried it I found that there were reasons to do it that nobody had mentioned. Of course, these reasons might not be compelling for anyone else.

Looking back on this from a distance of 18 years, I am struck by the following thoughts:

1. Syntax highlighting used to make the editor really slow. You had to make a real commitment to using it or not. I had forgotten about that. Another victory for Moore’s law!

2. Programmers used to argue about it. Apparently programmers will argue about anything, no matter how ridiculous. Well okay, this is not a new observation. Anyway, this argument is now finished. Whether people use it or not, they no longer find the need to argue about it. This is a nice example that sometimes these ridiculous arguments eventually go away.

3. I don't remember why I said that syntax highlighting “seemed like a pretty stupid idea”, but I suspect that I was thinking that the wrong things get highlighted. Highlighters usually highlight the language keywords, because they're easy to recognize. But this is like highlighting all the generic filler words in a natural language text. The words you want to see are exactly the opposite of what is typically highlighted.

Syntax highlighters should be highlighting the semantic content like expression boundaries, implied parentheses, boolean subexpressions, interpolated variables and other non-apparent semantic features. I think there is probably a lot of interesting work to be done here. Often you hear programmers say things like “Oh, I didn't see the that the trailing comma was actually a period.” That, in my opinion, is the kind of thing the syntax highlighter should call out. How often have you heard someone say “Oh, I didn't see that while there”?

4. I have been misspelling “arguments” as “argmuents” for at least 18 years.

Sat, 16 Apr 2016

If you lose something [in Git], don't panic. There's a good chance that you can find someone who will be able to hunt it down again.

I was not expecting to have a demonstration ready so soon. But today I finished working on a project, I had all the files staged in the index but not committed, and for some reason I no longer remember I chose that moment to do git reset --hard, which throws away the working tree and the staged files. I may have thought I had committed the changes. I hadn't.

If the files had only been in the working tree, there would have been nothing to do but to start over. Git does not track the working tree. But I had added the files to the index. When a file is added to the Git index, Git stores it in the repository. Later on, when the index is committed, Git creates a commit that refers to the files already stored. If you know how to look, you can find the stored files even before they are part of a commit.

(If they are part of a commit, the problem is much easier. Typically the answer is simply “use git-reflog to find the commit again and check it out”. The git-reflog command is probably the first thing anyone should learn on the path from being a Git beginner to becoming an intermediate Git user.)

Each file added to the Git index is stored as a “blob object”. Git stores objects in two ways. When it's fetching a lot of objects from a remote repository, it gets a big zip file with an attached table of contents; this is called a pack. Getting objects from a pack can be a pain. Fortunately, not all objects are in packs. When when you just use git-add to add a file to the index, git makes a single object, called a “loose” object. The loose object is basically the file contents, gzipped, with a header attached. At some point Git will decide there are too many loose objects and assemble them into a pack.

To make a loose object from a file, the contents of the file are checksummed, and the checksum is used as the name of the object file in the repository and as an identifier for the object, exactly the same as the way git uses the checksum of a commit as the commit's identifier. If the checksum is 0123456789abcdef0123456789abcdef01234567, the object is stored in

.git/objects/01/23456789abcdef0123456789abcdef01234567

The pack files are elsewhere, in .git/objects/pack.

So the first thing I did was to get a list of the loose objects in the repository:

cd .git/objects
find ?? -type f  | perl -lpe 's#/##' > /tmp/OBJ

This produces a list of the object IDs of all the loose objects in the repository:

00f1b6cc1dfc1c8872b6d7cd999820d1e922df4a
0093a412d3fe23dd9acb9320156f20195040a063
01f3a6946197d93f8edba2c49d1bb6fc291797b0
…
ffd505d2da2e4aac813122d8e469312fd03a3669
fff732422ed8d82ceff4f406cdc2b12b09d81c2e

There were 500 loose objects in my repository. The goal was to find the eight I wanted.

There are several kinds of objects in a Git repository. In addition to blobs, which represent file contents, there are commit objects, which represent commits, and tree objects, which represent directories. These are usually constructed at the time the commit is done. Since my files hadn't been committed, I knew I wasn't interested in these types of objects. The command git cat-file -t will tell you what type an object is. I made a file that related each object to its type:

for i in $(cat /tmp/OBJ); do echo -n "$i ";
git type $i; done > /tmp/OBJTYPE The git type command is just an alias for git cat-file -t. (Funny thing about that: I created that alias years ago when I first started using Git, thinking it would be useful, but I never used it, and just last week I was wondering why I still bothered to have it around.) The OBJTYPE file output by this loop looks like this: 00f1b6cc1dfc1c8872b6d7cd999820d1e922df4a blob 0093a412d3fe23dd9acb9320156f20195040a063 tree 01f3a6946197d93f8edba2c49d1bb6fc291797b0 commit … fed6767ff7fa921601299d9a28545aa69364f87b tree ffd505d2da2e4aac813122d8e469312fd03a3669 tree fff732422ed8d82ceff4f406cdc2b12b09d81c2e blob Then I just grepped out the blob objects: grep blob /tmp/OBJTYPE | f 1 > /tmp/OBJBLOB The f 1 command throws away the types and keeps the object IDs. At this point I had filtered the original 500 objects down to just 108 blobs. Now it was time to grep through the blobs to find the ones I was looking for. Fortunately, I knew that each of my lost files would contain the string org-service-currency, which was my name for the project I was working on. I couldn't grep the object files directly, because they're gzipped, but the command git cat-file disgorges the contents of an object: for i in$(cat /tmp/OBJBLOB ) ; do
git cat-file blob $i | grep -q org-service-curr && echo$i;
done > /tmp/MATCHES

The git cat-file blob $i produces the contents of the blob whose ID is in$i. The grep searches the contents for the magic string. Normally grep would print the matching lines, but this behavior is disabled by the -q flag—the q is for “quiet”—and tells grep instead that it is being used only as part of a test: it yields true if it finds the magic string, and false if not. The && is the test; it runs echo $i to print out the object ID$i only if the grep yields true because its input contained the magic string.

So this loop fills the file MATCHES with the list of IDs of the blobs that contain the magic string. This worked, and I found that there were only 18 matching blobs, so I wrote a very similar loop to extract their contents from the repository and save them in a directory:

for i in $(cat /tmp/OBJBLOB ) ; do git cat-file blob$i |
grep -q org-service-curr
&& git cat-file blob $i > /tmp/rescue/$i;
done

Instead of printing out the matching blob ID number, this loop passes it to git cat-file again to extract the contents into a file in /tmp/rescue.

The rest was simple. I made 8 subdirectories under /tmp/rescue representing the 8 different files I was expecting to find. I eyeballed each of the 18 blobs, decided what each one was, and sorted them into the 8 subdirectories. Some of the subdirectories had only 1 blob, some had up to 5. I looked at the blobs in each subdirectory to decide in each case which one I wanted to keep, using diff when it wasn't obvious what the differences were between two versions of the same file. When I found one I liked, I copied it back to its correct place in the working tree.

Finally, I went back to the working tree and added and committed the rescued files.

It seemed longer, but it only took about twenty minutes. To recreate the eight files from scratch might have taken about the same amount of time, or maybe longer (although it never takes as long as I think it will), and would have been tedious.

But let's suppose that it had taken much longer, say forty minutes instead of twenty, to rescue the lost blobs from the repository. Would that extra twenty minutes have been time wasted? No! The twenty minutes spent to recreate the files from scratch is a dead loss. But the forty minutes to rescue the blobs is time spent learning something that might be useful in the future. The Git rescue might have cost twenty extra minutes, but if so it was paid back with forty minutes of additional Git expertise, and time spent to gain expertise is well spent! Spending time to gain expertise is how you become an expert!

Git is a core tool, something I use every day. For a long time I have been prepared for the day when I would try to rescue someone's lost blobs, but until now I had never done it. Now, if that day comes, I will be able to say “Oh, it's no problem, I have done this before!”

So if you lose something in Git, don't panic. There's a good chance that you can find someone who will be able to hunt it down again.

Fri, 08 Apr 2016

I'm becoming one of the people at my company that people come to when they want help with git, so I've been thinking a lot about what to tell people about it. It's always tempting to dive into the technical details, but I think the first and most important things to explain about it are:

1. Git has a very simple and powerful underlying model. Atop this model is piled an immense trashheap of confusing, overlapping, inconsistent commands. If you try to just learn what commands to run in what order, your life will be miserable, because none of the commands make sense. Learning the underlying model has a much better payoff because it is much easier to understand what is really going on underneath than to try to infer it, Sherlock-Holmes style, from the top.

2. One of Git's principal design criteria is that it should be very difficult to lose work. Everything is kept, even if it can sometimes be hard to find. If you lose something, don't panic. There's a good chance that you can find someone who will be able to hunt it down again. And if you make a mistake, it is almost always possible to put things back exactly the way they were, and you can find someone who can show you how to do it.

One exception is changes that haven't been committed. These are not yet under Git's control, so it can't help you with them. Commit early and often.

[ Addendum 20160415: I wrote a detailed account of a time I recovered lost files. ]

[ Addendum 20160505: I don't know why I didn't mention it before, but if you want to learn Git's underlying model, you should read Git from the Bottom Up (which is what worked for me) or Git from the Inside Out which is better illustrated. ]

Thu, 13 Aug 2015

On Tuesday I discussed an interesting solution to the problem of turning this:

no X              X on

A --------------- C

into this:

no X     X off    X on

A ------ B ------ C

Dave Du Cros has suggested an alternative solution: Make the changes required to turn off feature X, and commit them as B, as in my solution:

no X     X on     X off

A ------ C ------ B

Then use git-revert to revert the changes, making a new C commit in the right place:

no X     X on     X off     X on

A ------ C ------ B ------ C'

C' and C have identical trees.

Then use git-rebase to squash together C and B:

no X              X off     X on

A --------------- B ------ C'

This has the benefit of not requiring anything strange. I think my solution is more general, but it's also weird, and it's not clear that the increased generality is useful.

However, what if there were a git-reorder-commits command? Then my solution would seem much less weird. It would look like this: create B, as before, and do:

git reorder-commits 0 1

This last command would mean that the previous two commits, normally HEAD~1 and HEAD~0, should switch places. This might be a useful standard tool. Or similarly to turn

B -- 3 -- 2 -- 1 -- 0

into

B -- 2 -- 0 -- 3 -- 1

one would use

git reorder-commits 2 0 3 1

I think git-reorder-commits would be easy to implement, as a loop atop git-commit-tree, as in the previous article.

Tue, 11 Aug 2015

I know, you want to say “Why didn't you just use git-rebase?” Because git-rebase wouldn't work here, that's why. Let me back up.

Say I have commit A, in which feature X does not exist yet. Then in commit C, I implement feature X.

But I realize what I really wanted was to have A, then B, in which feature X was implemented but disabled, and then C in which feature X was enabled. The C I want is just like the C that I have, but I don't have the intervening B.

I have:

no X              X on

A --------------- C

I want:

no X     X off    X on

A ------ B ------ C

One way to do this is to use git-rebase in edit mode to split C into B and C. To do this I would pause while rebasing C, edit C to disable feature X, commit the result, which is B, then undo the previous edits to re-enable X, and continue the rebase, creating C. That's two sets of edits. I could backup the files before the first edit and then copy them back for the second edit, but that's the SVN way, so I'm not going to do that.

Now someone wants me to use git-rebase to “reorder the commits”. Their idea is: I have C. Edit C to disable feature X and commit the result as B':

no X     X on     X off

A ------ C ------ B'

Now use interactive git-rebase to reorder B and C. But this will not work. git-rebase will construct a patch for turning C into B' and will try to apply it to A. This will fail completely, because a patch for turning C into B' is a patch for turning off feature X once it is implemented. Feature X is not in A and you can't turn something off that isn't there. So the rebase will fail to apply the patch.

What I did instead was rather bizarre, using a plumbing command, but worked well. I wrote the code to disable X, and committed it as B, obtaining this:

no X     X on     X off

A ------ C ------ B

Now B and C have the files I want in them, but their parents are wrong. That is, the history is in the wrong order, but if the parent of C was B and the parent of B was A, eveything would be perfect.

But we can't just change the parents; we have to create a new commit, say B', which has the same files as B but whose parent is A instead of C, and we have to create a new commit C' which has the same files as C but whose parent is B' instead of A.

This is what git-commit-tree does. You give it a tree object containing the files you want, a list of parents, and a commit message, and it creates the commit you asked for and prints its SHA1.

When we use git-commit, it first turns the index into a tree, with git-write-tree, then creates the commit, with git-commit-tree, and then moves the current head ref up to the new commit. Here we will use git-commit-tree directly.

So I did:

% git checkout -b XX A
Switched to a new branch 'XX'
% git commit-tree -p HEAD B^{tree}
% git commit-tree -p HEAD C^{tree}
ce46beb90d4aa4e2c9fe0e2e3d22eea256edceac
% git reset --hard ce46beb90d4aa4e2c9fe0e2e3d22eea256edceac

The first git-commit-tree

% git commit-tree -p HEAD B^{tree}

says to make a commit whose tree is the same as B's, and whose parent is the current HEAD, which is A. (B^{tree} is a special notation that means to get the tree from commit B.) Git pauses here to read the commit message from standard input (not shown), and prints the SHA of the new commit on the terminal. I then use git-reset to move the current head ref, XX, up to the new commit. Normally git-commit would do this for us, but we're not using git-commit today.

Then I do the same thing with C:

% git commit-tree -p HEAD C^{tree}

makes a new commit whose tree is the same as C's, and whose parent is the current head, which looks just like B. Again it reads a commit message from standard input, and prints the SHA of the new commit on the terminal, and again I use git-reset to move XX up to the new commit.

Now I have what I want and I only had to edit the files once. To complete the task I just reset the head of my working branch to wherever XX is now, discarding the old A-C-B branch in favor of the new A-B-C branch. If there's an easier way to do this, I don't know it.

It seems to me that there have been a number of times in the past when I wanted to do something like reordering commits, and git-rebase did not do what I wanted because it reorders patches and not commits. I should keep my eyes open, and see if this comes up again, and if it is worth automating.

[ Thanks to Jeremy Leader for suggesting I write this up and to Jeremy Leader and Rik Signes for advance editing. ]

[ Adendum 20150813: a followup article ]

Tue, 04 Aug 2015

A few months ago I wrote an article about using Haskell's list monad to do exhaustive search, with the running example of solving this cryptarithm puzzle:

S E N D
+   M O R E
-----------
M O N E Y

(This means that we want to map the letters S, E, N, D, M, O, R, Y to distinct digits 0 through 9 to produce a five-digit and two four-digit numerals which, when added in the indicated way, produce the indicated sum.)

At the end, I said:

It would be an interesting and pleasant exercise to try to implement the same underlying machinery in another language. I tried this in Perl once, and I found that although it worked perfectly well, between the lack of the do-notation's syntactic sugar and Perl's clumsy notation for lambda functions (sub { my ($s) = @_; … } instead of \s -> …) the result was completely unreadable and therefore unusable. However, I suspect it would be even worse in Python because of semantic limitations of that language. I would be interested to hear about this if anyone tries it. I was specifically worried about Python's peculiar local variable binding. But I did receive the following quite clear solution from Peter De Wachter, who has kindly allowed me to reprint it: digits = set(range(10)) def to_number(*digits): n = 0 for d in digits: n = n * 10 + d return n def let(x, f): return f(x) def unit(x): return [x] def bind(xs, f): ys = [] for x in xs: ys += f(x) return ys def guard(b, f): return f() if b else [] after which the complete solution looks like: def solutions(): return bind(digits - {0}, lambda s: bind(digits - {s}, lambda e: bind(digits - {s,e}, lambda n: bind(digits - {s,e,n}, lambda d: let(to_number(s,e,n,d), lambda send: bind(digits - {0,s,e,n,d}, lambda m: bind(digits - {s,e,n,d,m}, lambda o: bind(digits - {s,e,n,d,m,o}, lambda r: let(to_number(m,o,r,e), lambda more: bind(digits - {s,e,n,d,m,o,r}, lambda y: let(to_number(m,o,n,e,y), lambda money: guard(send + more == money, lambda: unit((send, more, money)))))))))))))) print(solutions()) I think this shows that my fears were unfounded. This code produces the correct answer in about 1.8 seconds on my laptop. Thus inspired, I tried doing it again in Perl, and it was not as bad as I remembered: sub bd { my ($ls, $f) = @_; [ map @{$f->($_)}, @$ls ]      # Yow
}
sub guard { $_[0] ? [undef] : [] } I opted to omit unit/return since an idiomatic solution doesn't really need it. We can't name the bind function bind because that is reserved for a built-in function; I named it bd instead. We could use Perl's operator overloading to represent binding with the >> operator, but that would require turning all the lists into objects, and it didn't seem worth doing. We don't need to_number, because Perl does it implicitly, but we do need a set subtraction function, because Perl has no built-in set operators: sub remove { my ($b, $a) = @_; my %h = map {$_ => 1 } @$a; delete$h{$_} for @$b;
return [ keys %h ];
}

After which the solution, although cluttered by Perl's verbose notation for lambda functions, is not too bad:

my $digits = [0..9]; my$solutions =
bd remove([0],        $digits) => sub { my ($s) = @_;
bd remove([$s],$digits) => sub { my ($e) = @_; bd remove([$s,$e],$digits) => sub { my ($n) = @_; bd remove([$s,$e,$n], $digits) => sub { my ($d) = @_;
my $send = "$s$e$n$d"; bd remove([0,$s,$e,$n,$d],$digits) => sub { my ($m) = @_; bd remove([$s,$e,$n,$d,$m],    $digits) => sub { my ($o) = @_;
bd remove([$s,$e,$n,$d,$m,$o], $digits) => sub { my ($r) = @_;
my $more = "$m$o$r$e"; bd remove([$s,$e,$n,$d,$m,$o,$r], $digits) => sub { my ($y) = @_;
my $money = "$m$o$n$e$y";
bd guard($send +$more == $money) => sub { [[$send, $more,$money]] }}}}}}}}};

for my $s (@$solutions) {
print "@$s\n"; } This runs in about 5.5 seconds on my laptop. I guess, but am not sure, that remove is mainly at fault for this poor performance. An earlier version of this article claimed, incorrectly, that the Python version had lazy semantics. It does not; it is strict. [ Addendum: Aaron Crane has done some benchmarking of the Perl version. A better implementation of remove (using an array instead of a hash) does speed up the calculation somewhat, but contrary to my guess, the largest part of the run time is bd itself, apparently becuse Perl function calls are relatively slow. HN user masklinn tried a translation of the Python code into a version that returns a lazy iterator; I gather the changes were minor. ] Wed, 13 May 2015 I did a residency at the Recurse Center last month. I made a profile page on their web site, which asked me to list some projects I was interested in working on while there. Nobody took me up on any of the projects, but I'm still interested. So if you think any of these projects sounds interesting, drop me a note and maybe we can get something together. They are listed roughly in order of their nearness to completion, with the most developed ideas first and the vaporware at the bottom. I am generally language-agnostic, except I refuse to work in C++. Or if you don't want to work with me, feel free to swipe any of these ideas yourself. Share and enjoy. Linogram Linogram is a constraint-based diagram-drawing language that I think will be better than prior languages (like pic, Metapost, or, god forbid, raw postscript or SVG) and very different from WYSIWYG drawing programs like Inkscape or Omnigraffle. I described it in detail in chapter 9 of Higher-Order Perl and it's missing only one or two important features that I can't quite figure out how to do. It also needs an SVG output module, which I think should be pretty simple. Most of the code for this already exists, in Perl. Orthogonal polygons Each angle of an orthogonal polygon is either 90° or 270°. All 4-sided orthogonal polygons are rectangles. All 6-sided orthogonal polygons are similar-looking letter Ls. There are essentially only four different kinds of 8-sided orthogonal polygons. There are 8 kinds of 10-sided orthogonal polygons: There are 29 kinds of 12-sided orthogonal polygons. I want to efficiently count the number of orthogonal polygons with N sides, and have the computer draw exemplars of each type. I have a nice method for systematically generating descriptions of all simple orthogonal polygons, and although it doesn't scale to polygons with many sides I think I have an idea to fix that, making use of group-theoretic (mathematical) techniques. (These would not be hard for anyone to learn quickly; my ten-year-old daughter picked them right up. Teaching the computer would be somewhat trickier.) For making the pictures, I only have half the ideas I need, and I haven't done the programming yet. The little code I have is written in Perl, but it would be no trouble to switch to a different language. [ Addendum 20150607: the orthogonal polygon sequence is now in OEIS! ] Simple Android app I want to learn to build Android apps for my Android phone. I think a good first project would be a utility where you put in a sequence of letters, say FBS, and it displays all the words that contain those letters in order. (For FBS the list contains "afterburners", "chlorofluorocarbons", "fables", "fabricates", …, "surfboards".) I play this game often with my kid (the letters are supplied by license plates we pass) and we want a way to cheat when we are stumped. My biggest problem with Android development in the past has been getting the immense Android SDK set up. The project would need to be done in Java, because that is what Android uses. gi Git is great, but its user interface is awful. The command set is obscure and non-orthogonal. Error messages are confusing. gi is a thinnish layer that tries to present a more intuitive and uniform command set, with better error messages and clearer advice, without removing any of git's power. There's no code written yet, and we could do it in any language. Perl or Python would be good choices. The programming is probably easy; the hard part of this project is (a) design and (b) user testing. Twingler Twingler takes an example of an input data structure and and output data structure, and writes code in your favorite language for transforming the input into the output. Or maybe it takes some sort of simplified description of what is wanted and writes the code from that. The description would be declarative, not procedural. I'm really not at all sure what it should do or how it should work, but I have a lot of notes, and if we could make it happen a lot of people would love it. No code is written; we could do this in your favorite language. Haskell maybe? Bonus: Whatever your favorite language is, I bet it needs something like this. Crapspad I want a simple library that can render simple pixel graphics and detect and respond to mouse events. I want people to be able to learn to use it in ten minutes. It should be as easy as programming graphics on an Apple II and easier than a Commodore 64. It should not be a gigantic object-oriented windowing system with widgets and all that stuff. It should be possible to whip up a simple doodling program in Crapspad in 15 minutes. I hope to get Perl bindings for this, because I want to use it from Perl programs, but we could design it to have a language-independent interface without too much trouble. Git GUI There are about 17 GUIs for Git and they all suck in exactly the same way: they essentially provide a menu for running all the same Git commands that you would run at the command line, obscuring what is going on without actually making Git any easier to use. Let's fix this. For example, why can't you click on a branch and drag it elsewhere to rebase it, or shift-drag it to create a new branch and rebase that? Why can't you drag diff hunks from one commit to another? I'm not saying this stuff would be easy, but it should be possible. Although I'm not convinced I really want to put ion the amount of effort that would be required. Maybe we could just submit new features to someone else's already-written Git GUI? Or if they don't like our features, fork their project? I have no code yet, and I don't even know what would be good to use. Fri, 24 Apr 2015 (Haskell people may want to skip this article about Haskell, because the technique is well-known in the Haskell community.) Suppose you would like to perform an exhaustive search. Let's say for concreteness that we would like to solve this cryptarithm puzzle: S E N D + M O R E ----------- M O N E Y This means that we want to map the letters S, E, N, D, M, O, R, Y to distinct digits 0 through 9 to produce a five-digit and two four-digit numerals which, when added in the indicated way, produce the indicated sum. (This is not an especially difficult example; my 10-year-old daughter Katara was able to solve it, with some assistance, in about 30 minutes.) If I were doing this in Perl, I would write up either a recursive descent search or a solution based on a stack or queue of partial solutions which the program would progressively try to expand to a full solution, as per the techniques of chapter 5 of Higher-Order Perl. In Haskell, we can use the list monad to hide all the searching machinery under the surface. First a few utility functions: import Control.Monad (guard) digits = [0..9] to_number = foldl (\a -> \b -> a*10 + b) 0 remove rs ls = foldl remove' ls rs where remove' ls x = filter (/= x) ls to_number takes a list of digits like [1,4,3] and produces the number they represent, 143. remove takes two lists and returns all the things in the second list that are not in the first list. There is probably a standard library function for this but I don't remember what it is. This version is !!O(n^2)!!, but who cares. Now the solution to the problem is: -- S E N D -- + M O R E -- --------- -- M O N E Y solutions = do s <- remove [0] digits e <- remove [s] digits n <- remove [s,e] digits d <- remove [s,e,n] digits let send = to_number [s,e,n,d] m <- remove [0,s,e,n,d] digits o <- remove [s,e,n,d,m] digits r <- remove [s,e,n,d,m,o] digits let more = to_number [m,o,r,e] y <- remove [s,e,n,d,m,o,r] digits let money = to_number [m,o,n,e,y] guard$ send + more == money
return (send, more, money)

Let's look at just the first line of this:

solutions = do
s <- remove [0] digits
…

The do notation is syntactic sugar for

(remove [0] digits) >>= \s -> …

where “…” is the rest of the block. To expand this further, we need to look at the overloading for >>= which is implemented differently for every type. The mote on the left of >>= is a list value, and the definition of >>= for lists is:

concat $map (\s -> …) (remove [0] digits) where “…” is the rest of the block. So the variable s is bound to each of 1,2,3,4,5,6,7,8,9 in turn, the rest of the block is evaluated for each of these nine possible bindings of s, and the nine returned lists of solutions are combined (by concat) into a single list. The next line is the same: e <- remove [s] digits for each of the nine possible values for s, we loop over nine value for e (this time including 0 but not including whatever we chose for s) and evaluate the rest of the block. The nine resulting lists of solutions are concatenated into a single list and returned to the previous map call. n <- remove [s,e] digits d <- remove [s,e,n] digits This is two more nested loops. let send = to_number [s,e,n,d] At this point the value of send is determined, so we compute and save it so that we don't have to repeatedly compute it each time through the following 300 loop executions. m <- remove [0,s,e,n,d] digits o <- remove [s,e,n,d,m] digits r <- remove [s,e,n,d,m,o] digits let more = to_number [m,o,r,e] Three more nested loops and another computation. y <- remove [s,e,n,d,m,o,r] digits let money = to_number [m,o,n,e,y] Yet another nested loop and a final computation. guard$ send + more == money
return (send, more, money)

This is the business end. I find guard a little tricky so let's look at it slowly. There is no binding (<-) in the first line, so these two lines are composed with >> instead of >>=:

(guard $send + more == money) >> (return (send, more, money)) which is equivalent to: (guard$ send + more == money) >>= (\_ -> return (send, more, money))

which means that the values in the list returned by guard will be discarded before the return is evaluated.

If send + more == money is true, the guard expression yields [()], a list of one useless item, and then the following >>= loops over this one useless item, discards it, and returns yields a list containing the tuple (send, more, money) instead.

But if send + more == money is false, the guard expression yields [], a list of zero useless items, and then the following >>= loops over these zero useless items, never runs return at all, and yields an empty list.

The result is that if we have found a solution at this point, a list containing it is returned, to be concatenated into the list of all solutions that is being constructed by the nested concats. But if the sum adds up wrong, an empty list is returned and concated instead.

After a few seconds, Haskell generates and tests 1.36 million choices for the eight bindings, and produces the unique solution:

[(9567,1085,10652)]

That is:

S E N D            9 5 6 7
+   M O R E        +   1 0 8 5
-----------        -----------
M O N E Y          1 0 6 5 2

It would be an interesting and pleasant exercise to try to implement the same underlying machinery in another language. I tried this in Perl once, and I found that although it worked perfectly well, between the lack of the do-notation's syntactic sugar and Perl's clumsy notation for lambda functions (sub { my ($s) = @_; … } instead of \s -> …) the result was completely unreadable and therefore unusable. However, I suspect it would be even worse in Python because of semantic limitations of that language. I would be interested to hear about this if anyone tries it. [ Addendum: Thanks to Tony Finch for pointing out the η-reduction I missed while writing this at 3 AM. ] [ Addendum: Several people so far have misunderstood the question about Python in the last paragraph. The question was not to implement an exhaustive search in Python; I had no doubt that it could be done in a simple and clean way, as it can in Perl. The question was to implement the same underlying machinery, including the list monad and its bind operator, and to find the solution using the list monad. [ Peter De Wachter has written in with a Python solution that clearly demonstrates that the problems I was worried about will not arise, at least for this task. I hope to post his solution in the next few days. ] [ Addendum 20150803: De Wachter's solution and one in Perl ] Thu, 17 Jul 2014 A few weeks ago I asked people to predict, without trying it first, what this would print: perl -le 'print(two + two == five ? "true" : "false")' (If you haven't seen this yet, I recommend that you guess, and then test your guess, before reading the rest of this article.) People familiar with Perl guess that it will print true; that is what I guessed. The reasoning is as follows: Perl is willing to treat the unquoted strings two and five as strings, as if they had been quoted, and is also happy to use the + and == operators on them, converting the strings to numbers in its usual way. If the strings had looked like "2" and "5" Perl would have treated them as 2 and 5, but as they don't look like decimal numerals, Perl interprets them as zeroes. (Perl wants to issue a warning about this, but the warning is not enabled by default. Since the two and five are treated as zeroes, the result of the == comparison are true, and the string "true" should be selected and printed. So far this is a little bit odd, but not excessively odd; it's the sort of thing you expect from programming languages, all of which more or less suck. For example, Python's behavior, although different, is about equally peculiar. Although Python does require that the strings two and five be quoted, it is happy to do its own peculiar thing with "two" + "two" == "five", which happens to be false: in Python the + operator is overloaded and has completely different behaviors on strings and numbers, so that while in Perl "2" + "2" is the number 4, in Python is it is the string 22, and "two" + "two" yields the string "twotwo". Had the program above actually printed true, as I expected it would, or even false, I would not have found it remarkable. However, this is not what the program does do. The explanation of two paragraphs earlier is totally wrong. Instead, the program prints nothing, and the reason is incredibly convoluted and bizarre. First, you must know that print has an optional first argument. (I have plans for an article about how optional first arguments are almost always a bad move, but contrary to my usual practice I will not insert it here.) In Perl, the print function can be invoked in two ways: print HANDLE$a, $b,$c, …;
print $a,$b, $c, …; The former prints out the list$a, $b,$c, … to the filehandle HANDLE; the latter uses the default handle, which typically points at the terminal. How does Perl decide which of these forms is being used? Specifically, in the second form, how does it know that $a is one of the items to be printed, rather than a variable containing the filehandle to print to? The answer to this question is further complicated by the fact that the HANDLE in the first form could be either an unquoted string, which is the name of the handle to print to, or it could be a variable containing a filehandle value. Both of these prints should do the same thing: my$handle = \*STDERR;
print STDERR $a,$b, $c; print$handle $a,$b, $c; Perl's method to decide whether a particular print uses an explicit or the default handle is a somewhat complicated heuristic. The basic rule is that the filehandle, if present, can be distinguished because its trailing comma is omitted. But if the filehandle were allowed to be the result of an arbitrary expression, it might be difficult for the parser to decide where there was a a comma; consider the hypothetical expression: print$a += EXPRESSION, $b$c, $d,$e;

Here the intention is that the $a += EXPRESSION,$b expression calculates the filehandle value (which is actually retrieved from $b, the$a += … part being executed only for its side effect) and the remaining $c,$d, $e are the values to be printed. To allow this sort of thing would be way too confusing to both Perl and to the programmer. So there is the further rule that the filehandle expression, if present, must be short, either a simple scalar variable such as$fh, or a bare unquoted string that is in the right format for a filehandle name, such as HANDLE. Then the parser need only peek ahead a token or two to see if there is an upcoming comma.

So for example, in

print STDERR $a,$b, $c; the print is immediately followed by STDERR, which could be a filehandle name, and STDERR is not followed by a comma, so STDERR is taken to be the name of the output handle. And in print$x, $a,$b, $c; the print is immediately followed by the simple scalar value$x, but this $x is followed by a comma, so is considered one of the things to be printed, and the target of the print is the default output handle. In print STDERR,$a, $b,$c;

Perl has a puzzle: STDERR looks like a filehandle, but it is followed by a comma. This is a compile-time error; Perl complains “No comma allowed after filehandle” and aborts. If you want to print the literal string STDERR, you must quote it, and if you want to print A, B, and C to the standard error handle, you must omit the first comma.

perl -le 'print(two + two == five ? "true" : "false")'

Here Perl sees the unquoted string two which could be a filehandle name, and which is not followed by a comma. So it takes the first two to be the output handle name. Then it evaluates the expression

+ two == five ? "true" : "false"

and obtains the value true. (The leading + is a unary plus operator, which is a no-op. The bare two and five are taken to be string constants, which, compared with the numeric == operator, are considered to be numerically zero, eliciting the same warning that I mentioned earlier that I had not enabled. Thus the comparison Perl actually does is is 0 == 0, which is true, and the resulting string is true.)

This value, the string true, is then printed to the filehandle named two. Had we previously opened such a filehandle, say with

open two, ">", "output-file";

then the output would have been sent to the filehandle as usual. Printing to a non-open filehandle elicits an optional warning from Perl, but as I mentioned, I have not enabled warnings, so the print silently fails, yielding a false value.

Had I enabled those optional warnings, we would have seen a plethora of them:

Unquoted string "two" may clash with future reserved word at -e line 1.
Unquoted string "two" may clash with future reserved word at -e line 1.
Unquoted string "five" may clash with future reserved word at -e line 1.
Name "main::two" used only once: possible typo at -e line 1.
Argument "five" isn't numeric in numeric eq (==) at -e line 1.
Argument "two" isn't numeric in numeric eq (==) at -e line 1.
print() on unopened filehandle two at -e line 1.

(The first four are compile-time warnings; the last three are issued at execution time.) The crucial warning is the one at the end, advising us that the output of print was directed to the filehandle two which was never opened for output.

[ Addendum 20140718: I keep thinking of the following remark of Edsger W. Dijkstra:

[This phenomenon] takes one of two different forms: one programmer places a one-line program on the desk of another and … says, "Guess what it does!" From this observation we must conclude that this language as a tool is an open invitation for clever tricks; and while exactly this may be the explanation for some of its appeal, viz., to those who like to show how clever they are, I am sorry, but I must regard this as one of the most damning things that can be said about a programming language.

But my intent is different than what Dijkstra describes. His programmer is proud, but I am disgusted. Incidentally, I believe that Dijkstra was discussing APL here. ]

[ Addendum 20150508: I never have much sympathy for the school of thought that says that you should always always enable warnings in every Perl program; I think Perl produces too many spurious warnings for that. But I also think this example is part of a cogent argument in the other direction. ]

Wed, 16 Jul 2014

Here's a Perl quiz that I confidently predict nobody will get right. Without trying it first, what does the following program print?

perl -le 'print(two + two == five ? "true" : "false")'

Sat, 01 Feb 2014

My current employer uses an online quiz to pre-screen applicants for open positions. The first question on the quiz is a triviality, just to let the candidate get familiar with the submission and testing system. The question is to write a program that copies standard input to standard output. Candidates are allowed to answer the questions using whatever language they prefer.

Sometimes we get candidates who get a zero score on the test. When I see the report that they failed to answer even the trivial question, my first thought is that this should not reflect badly on the candidate. Clearly, the testing system itself is so hard to use that the candidate was unable to submit even a trivial program, and this is a failure of the testing system and not the candidate.

But it has happened more than once that when I look at the candidate's incomplete submissions I see that the problem, at least this time, is not necessarily in the testing system. There is another possible problem that had not even occurred to me. The candidate failed the trivial question because they tried to write the answer in Java.

I am reminded of Dijkstra's remark that the teaching of BASIC should be rated as a criminal offense. Seeing the hapless candidate get bowled over by a question that should be a mere formality makes me wonder if the same might be said of Java.

I'm not sure. It's possible that this is still a failure of the quiz. It's possible that the Java programmers have valuable skills that we could use, despite their inability to produce even a trivial working program in a short amount of time. I could be persuaded, but right now I have a doubtful feeling.

When you learn Perl, Python, Ruby, or Javascript, one of the things you learn is a body of technique for solving problems using hashes, which are an integral part of the language. When you learn Haskell, you similarly learn a body of technique for solving problems with lazy lists and monads. These kinds of powerful general-purpose tools are at the forefront of the language.

But when you learn Java, there aren't any powerful language features you can use to solve many problems. Instead, you spend your time learning a body of technique for solving problems in the language. Java has hashes, but if you are aware of them at all, they are just another piece of the immense Collections library, lost among the many other sorts of collections, and you have no particular reason to know about them or think about them. A good course of Java instruction might emphasize the more useful parts of the Collections, but since they're just another part of the library it may not be obvious that hashes are any more or less useful than, say, AbstractAction or zipOutputStream.

I really like Java

I am glad to have had the experience of programming in Java. I liked programming in Java mainly because I found it very relaxing. With a bad language, like say Fortran or csh, you struggle to do anything at all, and the language fights with you every step of the way forward. With a good language there is a different kind of struggle, to take advantage of the language's strengths, to get the maximum amount of functionality, and to achieve the clearest possible expression.

Java is neither a good nor a bad language. It is a mediocre language, and there is no struggle. In Haskell or even in Perl you are always worrying about whether you are doing something in the cleanest and the best way. In Java, you can forget about doing it in the cleanest or the best way, because that is impossible. Whatever you do, however hard you try, the code will come out mediocre, verbose, redundant, and bloated, and the only thing you can do is relax and keep turning the crank until the necessary amount of code has come out of the spout. If it takes ten times as much code as it would to program in Haskell, that is all right, because the IDE will generate half of it for you, and you are still being paid to write the other half.

So you turn the crank, draw your paycheck, and you don't have to worry about the fact that it takes at least twice as long and the design is awful. You can't solve any really hard design problems, but there is a book you can use to solve some of the medium-hard ones, and solving those involves cranking out a lot more Java code, for which you will also be paid. You are a coder, your job is to write code, and you write a lot of code, so you are doing your job and everyone is happy.

You will not produce anything really brilliant, but you will probably not produce anything too terrible either. The project might fail, but if it does you can probably put the blame somewhere else. After all, you produced 576 classes that contain 10,000 lines of Java code, all of it seemingly essential, so you were doing your job. And nobody can glare at you and demand to know why you used 576 classes when you should have used 50, because in Java doing it with only 50 classes is probably impossible.

(Different languages have different failure modes. With Perl, the project might fail because you designed and implemented a pile of shit, but there is a clever workaround for any problem, so you might be able to keep it going long enough to hand it off to someone else, and then when it fails it will be their fault, not yours. With Haskell someone probably should have been fired in the first month for choosing to do it in Haskell.)

So yes, I enjoyed programming in Java, and being relieved of the responsibility for producing a quality product. It was pleasant to not have to worry about whether I was doing a good job, or whether I might be writing something hard to understand or to maintain. The code was ridiculously verbose, of course, but that was not my fault. It was all out of my hands.

So I like Java. But it is not a language I would choose for answering test questions, unless maybe the grade was proportional to the number of lines of code written. On the test, you need to finish quickly, so you need to optimize for brevity and expressiveness. Java is many things, but it is neither brief nor expressive.

When I see that some hapless job candidate struggled for 15 minutes and 14 seconds to write a Java program for copying standard input to standard output, and finally gave up, without even getting to the real questions, it makes me sad that their education, which was probably expensive, has not equipped them with with better tools or to do something other than grind out Java code.

Fri, 10 Jan 2014

The DateTime suite is an impressive tour de force, but I hate its interface. The methods it provides are usually not the ones you want, and the things it makes easy are often things that are not useful.

Mutators

The most obvious example is that it has too many mutators. I believe that date-time values are a kind of number, and should be treated like numbers. In particular they should be immutable. Rik Signes has a hair-raising story about an accidental mutation that caused a hard to diagnose bug, because the add_duration method modifies the object on which it is called, instead of returning a new object.

DateTime::Duration

But the most severe example, the one that drives me into a rage, is that the subtract_datetime method returns a DateTime::Duration object, and this object is never what you want, because it is impossible to use it usefully.

For example, suppose you would like to know how much time elapses between 1969-04-02 02:38:17 EST and 2013-12-25 21:00:00 EST. You can set up the two DateTime objects for the time, and subtract them using the overloaded minus operator:

#!perl
my ($a) = DateTime->new( year => 1969, month => 04, day => 02, hour => 2, minute => 38, second => 17, time_zone => "America/New_York" ) ; my ($b) = DateTime->new( year => 2013, month => 12, day => 25,
hour => 21, minute => 0, second => 0,
time_zone => "America/New_York" ) ;

my $diff =$b - $a; Internally this invokes subtract_datetime to yield a DateTime::Duration object for the difference. The DateTime::Duration object$diff will contain the information that this is a difference of 536 months, 23 days, 1101 minutes, and 43 seconds, a fact which seems to me to be of very limited usefulness.

You might want to know how long this interval is, so you can compare it to similar intervals. So you might want to know how many seconds this is. It happens that the two times are exactly 1,411,669,328 seconds apart, but there's no way to get the $diff object to tell you this. It seems like there are methods that will get you the actual elapsed time in seconds, but none of them will do it. For example,$diff->in_units('seconds') looks promising, but will return 43, which is the 43 seconds left over after you've thrown away the 536 months, 23 days, and 1101 minutes. I don't know what the use case for this is supposed to be.

And indeed, no method can tell you how long the duration really is, because the subtraction has thrown away all the information about how long the days and months and years were—days, months and years vary in length—so it simply doesn't know how much time this object actually represents.

Similarly if you want to know how many days there are between the two dates, the DateTime::Duration object won't tell you because it can't tell you. If you had the elapsed seconds difference, you could convert it to the correct number of days simply by dividing by 86400 and rounding off. This works because, even though days vary in length, they don't vary by much, and the variations cancel out over the course of a year. If you do this you find that the elapsed number of days is approximately 16338.7653, which rounds off to 16338 or 16339 depending on how you want to treat the 18-hour time-of-day difference. This result is not quite exact, but the error is on the order of 0.000002%. So the elapsed seconds are useful, and you can compute other useful values with them, and get useful answers. In contrast, DateTime::Duration's answer of "536 months and 23 days" is completely useless because months vary in length by nearly 10% and DateTime has thrown away the information about how long the months were. The best you can do to guess the number of days from this is to multiply the 536 months by 30.4375, which is the average number of days in a month, and add 23. This is clumsy, and gets you 16337.5 days—which is close, but wrong.

To get what I consider a useful answer out of the DateTime objects you must not use the overloaded subtraction operator; instead you must do this:

#!perl
$b->subtract_datetime_absolute($a)->in_units('seconds')

What's DateTime::Moonpig for?

DateTime::Moonpig attempts to get rid of the part of DateTime I don't like and keep the part I do like, by changing the interface and leaving the internals alone. I developed it for the Moonpig billing system that Rik Signes and I did; hence the name.

DateTime::Moonpig introduces five main changes to the interface of DateTime:

1. Most of the mutators are gone. They throw fatal exceptions if you try to call them.

2. The overridden addition and subtraction operators have been changed to eliminate DateTime::Duration entirely. Subtracting two DateTime::Moonpig objects yields the difference in seconds, as an ordinary Perl number. This means that instead of

#!perl
$x =$b->subtract_datetime_absolute($a)->in_units('seconds') one can write #!perl$x = $b -$a

From here it's easy to get the approximate number of days difference: just divide by 86400. Similarly, dividing this by 3600 gets the number of hours difference.

An integer number of seconds can be added to or subtracted from a DateTime::Moonpig object; this yields a new object representing a time that is that many seconds later or earlier. Writing $date + 2 is much more convenient than writing$date->clone->add( seconds => 2 ).

If you are not concerned with perfect exactness, you can write

#!perl
sub days { $_[0] * 86400 } my$tomorrow = $now + days(1); This might be off by an hour if there is an intervening DST change, or by a second if there is an intervening leap second, but in many cases one simply doesn't care. There is nothing wrong with the way DateTime overloads < and >, so DateTime::Moonpig leaves those alone. 3. The constructor is extended to accept an epoch time such as is returned by Perl's built-in time() or stat() functions. This means that one can abbreviate this: #!perl DateTime->from_epoch( epoch =>$epoch )

to this:

#!perl
DateTime::Moonpig->new( $epoch ) 4. The default time zone has been changed from DateTime's "floating" time zone to UTC. I think the "floating" time zone is a mistake, and best avoided. It has bad interactions with set_time_zone, which DateTime::Moonpig does not disable, because it is not actually a mutator—unless you use the "floating" time zone. An earlier blog article discusses this. 5. I added a few additional methods I found convenient. For example there is a$date->st that returns the date and time in the format YYYY-MM-DD HH:MM::SS, which is sometimes handy for quick debugging. (The st is for "string".)

Under the covers, it is all just DateTime objects, which seem to do what one needs. Other than the mutators, all the many DateTime methods work just the same; you are even free to use ->subtract_datetime to obtain a DateTime::Duration object if you enjoy being trapped in an absurdist theatre production.

When I first started this module, I thought it was likely to be a failed experiment. I expected that the Moonpig::DateTime objects would break once in a while, or that some operation on them would return a DateTime instead of a Moonpig::DateTime, which would cause some later method call to fail. But to my surprise, it worked well. It has been in regular use in Moonpig for several years.

I recently split it out of Moonpig, and released it to CPAN. I will be interested to find out if it works well in other contexts. I am worried that disabling the mutators has left a gap in functionality that needs to be filled by something else. I will be interested to hear reports from people who try.

Mon, 23 Dec 2013

(This is a companion piece to my article about DateTime::Moonpig on the Perl Advent Calendar today. One of the ways DateTime::Moonpig differs from DateTime is by defaulting to UTC time instead of to DateTime's "floating" time zone. This article explains some of the reasons why.)

Perl's DateTime module lets you create time values in a so-called "floating" time zone. What this means really isn't clear. It would be coherent for it to mean a time with an unknown or unspecified time zone, but it isn't treated that way. If it were, you wouldn't be allowed to compare "floating" times with regular times, or convert "floating" times to epoch times. If "floating" meant "unspecified time zone", the computer would have to honestly say that it didn't know what to do in such cases. But it doesn't.

Unfortunately, this confused notion is the default.

Here are two demonstrations of why I don't like "floating" time zones.

1.

The behavior of the set_time_zone method may not be what you were expecting, but it makes sense and it is useful:

my $a = DateTime->new( second => 0, minute => 0, hour => 5, day => 23, month => 12, year => 2013, time_zone => "America/New_York", ); printf "The time in New York is %s.\n",$a->hms;

$a->set_time_zone("Asia/Seoul"); printf "The time in Seoul is %s.\n",$a->hms;

Here we have a time value and we change its time zone from New York to Seoul. There are at least two reasonable ways to behave here. This could simply change the time zone, leaving everything else the same, so that the time changes from 05:00 New York time to 05:00 Seoul time. Or changing the time zone could make other changes to the object so that it represents the same absolute time as it did before: If I pick up the phone at 05:00 in New York and call my mother-in-law in Seoul, she answers the call at 19:00 in Seoul, so if I change the object's time zone from New York to Seoul, it should change from 05:00 to 19:00.

DateTime chooses the second of these: setting the time zone retains the absolute time stored by the object, so this program prints:

The time in New York is 05:00:00.
The time in Seoul is 19:00:00.

Very good. And we can get to Seoul by any route we want:

$a->set_time_zone("Europe/Berlin");$a->set_time_zone("Chile/EasterIsland");
$a->set_time_zone("Asia/Seoul"); printf "The time in Seoul is still %s.\n",$a->hms;

This prints:

The time in Seoul is still 19:00:00.

We can hop all around the globe, but the object always represents 19:00 in Seoul, and when we get back to Seoul it's still 19:00.

But now let's do the same thing with floating time zones:

my $b = DateTime->new( second => 0, minute => 0, hour => 5, day => 23, month => 12, year => 2013, time_zone => "America/New_York", ); printf "The time in New York is %s.\n",$b->hms;

$b->set_time_zone("floating");$b->set_time_zone("Asia/Seoul");
printf "The time in Seoul is %s.\n", $b->hms; Here we take a hop through the imaginary "floating" time zone. The output is now: The time in New York is 05:00:00. The time in Seoul is 05:00:00. The time has changed! I said there were at least two reasonable ways to behave, and that set_time_zone behaves in the second reasonable way. Which it does, except that conversions to the "floating" time zone behave the first reasonable way. Put together, however, they are unreasonable. 2. use DateTime; sub dec23 { my ($hour, $zone) = @_; return DateTime->new( second => 0, minute => 0, hour =>$hour,
day => 23,
month => 12,
year => 2013,
time_zone => $zone, ); } my$a = dec23(  8, "Asia/Seoul" );
my $b = dec23( 6, "America/New_York" ); my$c = dec23(  7, "floating" );

printf "A is %s B\n", $a <$b ? "less than" : "not less than";
printf "B is %s C\n", $b <$c ? "less than" : "not less than";
printf "C is %s A\n", $c <$a ? "less than" : "not less than";

With DateTime 1.04, this prints:

A is less than B
B is less than C
C is less than A

There are non-transitive relations in the world, but comparison of times is not among them. And if your relation is not transitive, you have no business binding it to the < operator.

However...

Rik Signes points out that the manual says:

If you are planning to use any objects with a real time zone, it is strongly recommended that you do not mix these with floating datetimes.

However, while a disclaimer in the manual can document incorrect behavior, it does not annul it. A bug doesn't stop being a bug just because you document it in the manual. I think it would have been possible to implement floating times sanely, but DateTime didn't do that.

[ Addendum: Rik has now brought to my attention that while the main ->new constructor defaults to the "floating" time zone, the ->now method always returns the current time in the UTC zone, which seems to me to be a mockery of the advice not to mix the two. ]

Tue, 17 Dec 2013

Moonpig: a billing system that doesn't suck
I'm in Amsterdam now, because Booking.com brought me out to tell them about Moonpig, the billing and accounting system that Rik Signes and I wrote. The talk was mostly a rehash of one I gave a Pittsburgh Perl Workshop a couple of months ago, but I think it's of general interest.

The assumption behind the talk is that nobody wants to hear about how the billing system actually works, because most people either have their own billing system already or else don't need one at all. I think I could do a good three-hour talk about the internals of Moonpig, and it would be very interesting to the right group of people, but it would be a small group. So instead I have this talk, which lasts less than an hour. The takeaway from this talk is a list of several basic design decisions that Rik and I made while building Moonpig which weren't obviously good ideas at the time, but which turned out well in hindsight. That part I think everyone can learn from. You may not ever need to write a billing system, but chances are at some point you'll consider using an ORM, and it might be useful to have a voice in your head that says “Dominus says it might be better to do something completely different instead. I wonder if this is one of those times?”

So because I think the talk was pretty good, and it's fresh in my mind right now, I'm going to try to write it down. The talk slides are here if you want to see them. The talk is mostly structured around a long list of things that suck, and how we tried to design Moonpig to eliminate, avoid, or at least mitigate these things.

Moonpig, however, does not suck.

Sometimes I see other people fuck up a project over and over, and I say “I could do that better”, and then I get a chance to try, and I discover it was a lot harder than I thought, I realize that those people who tried before are not as stupid as as I believed.

That did not happen this time. Moonpig is a really good billing system. It is not that hard to get right. Those other guys really were as stupid as I thought they were.

Brief explanation of IC Group

When I tell people I was working for IC Group, they frown; they haven't heard of it. But quite often when I say that IC Group runs pobox.com, those same people smile and say “Oh, pobox!”.

ICG is a first wave dot-com. In the late nineties, people would often have email through their employer or their school, and then they would switch jobs or graduate and their email address would go away. The basic idea of pobox was that for a small fee, something like $15 per year, you could get a pobox.com address that would forward all your mail to your real email address. Then when you changed jobs or schools you could just tell pobox to change the forwarding record, and your friends would continue to send email to the same pobox.com address as before. Later, ICG offered mail storage, web mail, and, through listbox.com, mailing list management and bulk email delivery. Moonpig was named years and years before the project to write it was started. ICG had a billing and accounting system already, a terrible one. ICG employees would sometimes talk about the hypothetical future accounting system that would solve all the problems of the current one. This accounting system was called Moonpig because it seemed clear that it would never actually be written, until pigs could fly. And in fact Moonpig wouldn't have been written, except that the existing system severely constrained the sort of pricing structures and deals that could actually be executed, and so had to go. Even then the first choice was to outsource the billing and accounting functions to some company that specialized in such things. The Moonpig project was only started as a last resort after ICG's president had tried for 18 months to find someone to take over the billing and collecting. She was unsuccessful. A billing provider would seem perfect and then turn out to have some bizarre shortcoming that rendered it unsuitable for ICG's needs. The one I remember was the one that did everything we wanted, except it would not handle checks. “Don't worry,” they said. “It's 2010. Nobody pays by check any more.” Well, as it happened, many of our customers, including some of the largest institutional ones, had not gotten this memo, and did in fact pay by check. So with some reluctance, she gave up and asked Rik and me to write a replacement billing and accounting system. As I mentioned, I had always wanted to do this. I had very clear ideas, dating back many years, about mistakes I would not make, were I ever called upon to write a billing system. For example, I have many times received a threatening notice of this sort: Your account is currently past due! Pay the outstanding balance of$      0 . 00   or we will be forced to refer your account for collection.
What I believe happened here is: some idiot programmer knows that money amounts are formatted with decimal points, so decides to denominate the money with floats. The amount I paid rounds off a little differently than the amount I actually owed, and the result after subtraction is all roundoff error, and leaves me with a nominal debt on the order of !!2^{-64}!! dollars.

So I have said to myself many times “If I'm ever asked to write a billing system, it's not going to use any fucking floats.” And at the meeting at which the CEO told me and Rik that we would write it, those were nearly the first words out of my mouth: No fucking floats.

Moonpig conceptual architecture

I will try to keep this as short as possible, including only as much as is absolutely required to understand the more interesting and generally applicable material later.

Pobox and Listbox accounts

ICG has two basic use cases. One is Pobox addresses and mailboxes, where the customer pays us a certain amount of money to forward (or store) their mail for a certain amount of time, typically a year. The other is Listbox mailing lists, where the customer pays us a certain amount to attempt a certain number of bulk email deliveries on their behalf.

The basic model is simple…

The life cycle for a typical service looks like this: The customer pays us some money: a flat fee for a Pobox account, or a larger or smaller pile for Listbox bulk mailing services, depending on how much mail they need us to send. We deliver service for a while. At some point the funds in the customer's account start to run low. That's when we send them an invoice for an extension of the service. If they pay, we go back and continue to provide service and the process repeats; if not, we stop providing the service.

…just like all basic models

But on top of this basic model there are about 10,019 special cases:

• Customers might cancel their service early.

• Pobox has a long-standing deal where you get a sixth year free if you pay for five years of service up front.

• Sometimes a customer with only email forwarding ($20 per year) wants to upgrade their account to one that does storage and provides webmail access ($50 per year), or vice-versa, in the middle of a year. What to do in this case? Business rules dictate that they can apply their current balance to the new service, and it should be properly pro-rated. So if I have 64 days of $50-per-year service remaining, and I downgrade to the$20-per-year service, I now have 160 days of service left.

Well, that wasn't too bad, except that we should let the customer know the new expiration date. And also, if their service will now expire sooner than it would have, we should give them a chance to pay to extend the service back to the old date, and deal properly with their payment or nonpayment.

Also something has to be done about any 6th free year that I might have had. We don't want someone to sign up for 5 years of $50-per-year service, get the sixth year free, then downgrade their account and either get a full free year of$50-per-year service or get a full free year of $20-per-year service after only !!\frac{20}{50}!! of five full years. • Sometimes customers do get refunds. • Sometimes we screw up and give people a credit for free service, as an apology. Unlike regular credits, these are not refundable! • Some customers get gratis accounts. The other cofounder of ICG used to hand these out at parties. • There are a number of cases for coupons and discounts. For example, if you refer a friend who signs up, you get some sort of credit. Non-profit institutions get some sort of discount off the regular rates. Customers who pay for many accounts get some sort of bulk discount. I forget the details. • Most customers get their service cut off if they don't pay. Certain large and longstanding customers should not be treated so peremptorily, and are allowed to run a deficit. • And so to infinity and beyond. Ledgers and Consumers The Moonpig data store is mostly organized as a huge pile of ledgers. Each represents a single customer or account. It contains some contact information, a record of all the transactions associated with that customer, a history of all the invoices ever sent to that customer, and so forth. A ledger also contains some consumer objects. Each consumer represents some service that we have promised to perform in exchange for money. The consumer has methods in it that you can call to say “I just performed a certain amount of service; please charge accordingly”. It has methods for calculating how much money has been allotted to it, how much it has left, how fast it is consuming its funds, how long it expects to last, and when it expects to run out of money. And it has methods for constructing its own replacement and for handing over control to that replacement when necessary. Heartbeats Every day, a cron job sends a heartbeat event to each ledger. The ledger doesn't do anything with the heartbeat itself; its job is to propagate the event to all of its sub-components. Most of those, in turn, ignore the heartbeat event entirely. But consumers do handle heartbeats. The consumer will wake up and calculate how much longer it expects to live. (For Pobox consumers, this is simple arithmetic; for mailing-list consumers, it guesses based on how much mail has been sent recently.) If it notices that it is going to run out of money soon, it creates a successor that can take over when it is gone. The successor immediately sends the customer an invoice: “Hey, your service is running out, do you want to renew?” Eventually the consumer does run out of money. At that time it hands over responsibility to its replacement. If it has no replacement, it will expire, and the last thing it does before it expires is terminate the service. Things that suck: manual repairs Somewhere is a machine that runs a daily cron job to heartbeat each ledger. What if one day, that machine is down, as they sometimes are, and the cron job never runs? Or what if the machine crashes while the cron job is running, and the cron job only has time to heartbeat 3,672 of the 10,981 ledgers in the system? In a perfect world, every component would be able to depend on exactly one heartbeat arriving every day. We don't live in that world. So it was an ironclad rule in Moonpig development that anything that handles heartbeat events must be prepared to deal with missing heartbeats, duplicate heartbeats, or anything else that could screw up. When a consumer gets a heartbeat, it must not cheerfully say "Oh, it's the dawn of a new day! I'll charge for a day's worth of service!". It must look at the current date and at its own charge record and decide on that basis whether it's time to charge for a day's worth of service. Now the answers to those questions of a few paragraphs earlier are quite simple. What if the machine is down and the cron job never runs? What to do? A perfectly acceptable response here is: Do nothing. The job will run the next day, and at that time everything will be up to date. Some customers whose service should have been terminated today will have it terminated tomorrow instead; they will have received a free day of service. This is an acceptable loss. Some customers who should have received invoices today will receive them tomorrow. The invoices, although generated and sent a day late, will nevertheless show the right dates and amounts. This is also an acceptable outcome. What if the cron job crashes after heartbeating 3,672 of 10,981 ledgers? Again, an acceptable response is to do nothing. The next day's heartbeat will bring the remaining 7,309 ledgers up to date, after which everything will be as it should. And an even better response is available: simply rerun the job. 3,672 of the ledgers will receive the same event twice, and will ignore it the second time. Contrast this with the world in which heartbeats were (mistakenly) assumed to be reliable. In this world, the programming staff must determine precisely which ledgers received the event before the crash, either by trawling through the log files or by grovelling over the ledger data. Then someone has to hack up a program to send the heartbeats to just the 7,309 ledgers that still need it. And there is a stiff deadline: they have to get it done before tomorrow's heartbeat issues! Making everything robust in the face of heartbeat failure is a little more work up front, but that cost is recouped the first time something goes wrong with the heartbeat process, when instead of panicking you smile and open another beer. Let N be the number of failures and manual repairs that are required before someone has had enough and makes the heartbeat handling code robust. I hypothesize that you can tell a lot about an organization from the value of N. Here's an example of the sort of code that is required. The non-robust version of the code would look something like this: sub charge { my ($self, $event) = @_;$self->charge_one_day();
}
The code, implemented by a role called Moonpig::Role::Consumer::ChargesPeriodically, actually looks something like this:

has last_charge_date => ( … );

sub charge {
my ($self,$event) = @_;

my $now = Moonpig->env->now; CHARGE: until ($self->next_charge_date->follows($now)) { my$next = $self->next_charge_date;$self->charge_one_day();
$self->last_charge_date($next);
if ($self->is_expired) {$self->replacement->handle_event($event) if$self->replacement;
last CHARGE;
}
}
}
The last_charge_date member records the last time the consumer actually issued a charge. The next_charge_date method consults this value and returns the next day on which the consumer should issue a charge—not necessarily the following day, since the consumer might issue weekly or monthly charges. The consumer will issue charge after charge until the next_charge_date is the future, when it will stop. It runs the until loop, using charge_one_day to issue another charge each time through, and updating last_charge_date each time, until the next_charge_date is in the future.

The one tricky part here the if block. This is because the consumer might run out of money before the loop completes. In that case it passes the heartbeat event on to its successor (replacement) and quits the loop. The replacement will run its own loop for the remaining period.

Things that suck: real-time testing

A customer pays us $20. This will cover their service for 365 days. The business rules say that they should receive their first invoice 30 days before the current service expires; that is, after 335 days. How are we going to test that the invoice is in fact sent precisely 335 days later? Well, put like that, the answer is obvious: Your testing system must somehow mock the time. But obvious as this is, I have seen many many tests that made some method call and then did sleep 60, waiting and hoping that the event they were looking for would have occurred by then, reporting a false positive if the system was slow, and making everyone that much less likely to actually run the tests. I've also seen a lot of tests that crossed their fingers and hoped that a certain block of code would execute between two ticks of the clock, and that failed nondeterministically when that didn't happen. So another ironclad law of Moonpig design was that no object is ever allowed to call the time() function to find out what time it actually is. Instead, to get the current time, the object must call Moonpig->env->now. The tests run in a test environment. In the test environment, Moonpig->env returns a Moonpig::Env::Test object, which contains a fake clock. It has a stop_clock method that stops the clock, and an elapse_time method that forces the clock forward a certain amount. If you need to check that something happens after 40 days, you can call Moonpig->env->elapse_time(86_400 * 40), or, more likely: for (1..40) { Moonpig->env->elapse_time(86_400);$test_ledger->heartbeat;
}
In the production environment, the environment object still has a now method, but one that returns the true current time from the system clock. Trying to stop the clock in the production environment is a fatal error.

Similarly, no Moonpig object ever interacts directly with the database; instead it must always go through the mediator returned by Moonpig->env->storage. In tests, this can be a fake storage object or whatever is needed. It's shocking how many tests I've seen that begin by allocating a new MySQL instance and executing a huge pile of DDL. Folks, this is not how you write a test.

Again, no Moonpig object ever posts email. It asks Moonpig->env->email_sender to post the email on its behalf. In tests, this uses the CPAN Email::Sender::Transport suite, and the test code can interrogate the email_sender to see exactly what emails would have been sent.

We never did anything that required filesystem access, but if we had, there would have been a Moonpig->env->fs for opening and writing files.

The Moonpig->env object makes this easy to get right, and hard to screw up. Any code that acts on the outside world becomes a red flag: Why isn't this going through the environment object? How are we going to test it?

Things that suck: floating-point numbers

I've already complained about how I loathe floating-point numbers. I just want to add that although there are probably use cases for floating-point arithmetic, I don't actually know what they are. I've had a pretty long and varied programming career so far, and legitimate uses for floating point numbers seem very few. They are really complicated, and fraught with traps; I say this as a mathematical expert with a much stronger mathematical background than most programmers.

The law we adopted for Moonpig was that all money amounts are integers. Each money amount is an integral number of “millicents”, abbreviated “m¢”, worth !!\frac1{1000}!! of a cent, which in turn is !!\frac1{100}!! of a U.S. dollar. Fractional millicents are not allowed. Division must be rounded to the appropriate number of millicents, usually in the customer's favor, although in practice it doesn't matter much, because the amounts are so small.

For example, a $20-per-year Pobox account actually bills $$\\left\lfloor\frac{20,00,000}{365}\right\rfloor = 5479$$ m¢ each day. (5464 in leap years.) Since you don't want to clutter up the test code with a bunch of numbers like 1000000 ($10), there are two utterly trivial utility subroutines:

sub cents   { $_[0] * 1000 } sub dollars {$_[0] * 1000 * 100 }
Now $10 can be written dollars(10). Had we dealt with floating-point numbers, it would have been tempting to write test code that looked like this: cmp_ok(abs($actual_amount - $expected_amount), "<",$EPSILON, …);
That's because with floats, it's so hard to be sure that you won't end up with a leftover !!2^{-64}!! or something, so you write all the tests to ignore small discrepancies. This can lead to overlooking certain real errors that happen to result in small discrepancies. With integer amounts, these discrepancies have nowhere to hide. It sometimes happened that we would write some test and the money amount at the end would be wrong by 2m¢. Had we been using floats, we might have shrugged and attributed this to incomprehensible roundoff error. But with integers, that is a difference of 2, and you cannot shrug it off. There is no incomprehensible roundoff error. All the calculations are exact, and if some integer is off by 2 it is for a reason. These tiny discrepancies usually pointed to serious design or implementation errors. (In contrast, when a test would show a gigantic discrepancy of a million or more m¢, the bug was always quite easy to find and fix.)

There are still roundoff errors; they are unavoidable. For example, a consumer for a $20-per-year Pobox account bills only 365·5479m¢ = 1999835m¢ per year, an error in the customer's favor of 165m¢ per account; after 12,121 years the customer will have accumulated enough error to pay for an extra year of service. For a business of ICG's size, this loss was deemed acceptable. For a larger business, it could be significant. (Imagine 6,000,000 customers times 165m¢ each; that's$9,900.) In such a case I would keep the same approach but denominate everything in micro-cents instead.

Happily, Moonpig did not have to deal with multiple currencies. That would have added tremendous complexity to the financial calculations, and I am not confident that Rik and I could have gotten it right in the time available.

Things that suck: dates and times

Dates and times are terribly complicated, partly because the astronomical motions they model are complicated, and mostly because the world's bureaucrats keep putting their fingers in. It's been suggested recently that you can identify whether someone is a programmer by asking if they have an opinion on time zones. A programmer will get very red in the face and pound their fist on the table.

After I wrote that sentence, I then wrote 1,056 words about the right way to think about date and time calculations, which I'll spare you, for now. I'm going to try to keep this from turning into an article about all the ways people screw up date and time calculations, by skipping the arguments and just stating the main points:

1. Date-time values are a kind of number, and should be considered as such. In particular:
1. Date-time values inside a program should be immutable
2. There should be a single canonical representation of date-time values in the program, and it should be chosen for ease of calculation.
2. If the program does have to deal with date-time values in some other representation, it should convert them to the canonical representation as soon as possible, or from the canonical representation as late as possible, and in any event should avoid letting non-canonical values percolate around the program.
The canonical representation we chose was DateTime objects in UTC time. Requiring that the program deal only with UTC eliminates many stupid questions about time zones and DST corrections, and simplifies all the rest as much as they can be simplified. It also avoids DateTime's unnecessarily convoluted handling of time zones.

We held our noses when we chose to use DateTime. It has my grudging approval, with a large side helping of qualifications. The internal parts of it are okay, but the methods it provides are almost never what you actually want to use. For example, it provides a set of mutators. But, as per item 1 above, date-time values are numbers and ought to be immutable. Rik has a good story about a horrible bug that was caused when he accidentally called the ->subtract method on some widely-shared DateTime value and so mutated it, causing an unexpected change in the behavior of widely-separated parts of the program that consulted it afterward.

So instead of using raw DateTime, we wrapped it in a derived class called Moonpig::DateTime. This removed the mutators and also made a couple of other convenient changes that I will shortly describe.

Things that really really suck: DateTime::Duration

If you have a pair of DateTime objects and you want to know how much time separates the two instants that they represent, you have several choices, most of which will return a DateTime::Duration object. All those choices are wrong, because DateTime::Duration objects are useless. They are a kind of Roach Motel for date and time information: Data checks into them, but doesn't check out. I am not going to discuss that here, because if I did it would take over the article, but I will show the simple example I showed in the talk:

my $then = DateTime->new( month => 4, day => 2, year => 1969, hour => 0, minute => 0, second => 0); my$now = DateTime->now();
my $elapsed =$now - $then; print$elapsed->in_units('seconds'), "\n";
You might think, from looking at this code, that it might print the number of seconds that elapsed between 1969-04-02 00:00:00 (in some unspecified time zone!) and the current moment. You would be mistaken; you have failed to reckon with the $elapsed object, which is a DateTime::Duration. Computing this object seems reasonable, but as far as I know once you have it there is nothing to do but throw it away and start over, because there is no way to extract from it the elapsed amount of time, or indeed anything else of value. In any event, the print here does not print the correct number of seconds. Instead it prints ME CAGO EN LA LECHE, which I have discovered is Spanish for “I shit in the milk”. So much for DateTime::Duration. When a and b are Moonpig::DateTime objects, a-b returns the number of seconds that have elapsed between the two times; it is that simple. You can divide it by 86,400 to get the number of days. Other arithmetic is similarly overloaded: If i is a number, then a+i and a-i are the times obtained by adding or subtracting i seconds to a, respectively. (C programmers should note the analogy with pointer arithmetic; C's pointers, and date-time values—also temperatures—are examples of a mathematical structure called an affine space, and study of the theory of affine spaces tells you just what rules these objects should obey. I hope to discuss this at length another time.) Going along with this arithmetic are a family of trivial convenience functions, such as: sub hours {$_[0] * 3600  }
sub days  { $_[0] * 86400 } so that you can use$a + days(7) to find the time 7 days after $a. Programmers at the Amsterdam talk were worried about this: what about leap seconds? And they are correct: the name days is not quite honest, because it promises, but does not deliver, exactly 7 days. It can't, because the definition of the day varies widely from place to place and time to time, and not only can't you know how long 7 days unless you know where it is, but it doesn't even make sense to ask. That is all right. You just have to be aware, when you add days(7), the resulting time might not be the same time of day 7 days later. (Indeed, if the local date and time laws are sufficiently bizarre, it could in principle be completely wrong. But since Moonpig::DateTime objects are always reckoned in UTC, it is never more than one second wrong.) Anyway, I was afraid that Moonpig::DateTime would turn out to be a leaky abstraction, producing pleasantly easy and correct results thirty times out of thirty-one, and annoyingly wrong or bizarre results the other time. But I was surprised: it never caused a problem, or at least none has come to light. I am working on releasing this module to CPAN, under the name DateTime::Moonpig. [ Addendum: DateTime::Moonpig is now available on CPAN. ] Things that suck: mutable data I left this out of the talk, by mistake, but this is a good place to mention it: mutable data is often a bad idea. In the billing system we wanted to avoid it for accountability reasons: We never wanted the customer service agent to be in the position of being unable to explain to the customer why we thought they owed us$28.39 instead of the $28.37 they claimed they owed; we never wanted ourselves to be in the position of trying to track down a billing system bug only to find that the trail had been erased. One of the maxims Rik and I repeated freqently was that the moving finger writes, and, having writ, moves on. Moonpig is full of methods with names like is_expired, is_superseded, is_canceled, is_closed, is_obsolete, is_abandoned and so forth, representing entities that have been replaced by other entities but which are retained as part of the historical record. For example, a consumer has a successor, to which it will hand off responsibility when its own funds are exhausted; if the customer changes their mind about their future service, this successor might be replaced with a different one, or replaced with none. This doesn't delete or destroy the old successor. Instead it marks the old successor as "superseded", simultaneously recording the supersession time, and pushes the new successor (or undef, if none) onto the end of the target consumer's replacement_history array. When you ask for the current successor, you are getting the final element of this array. This pattern appeared in several places. In a particularly simple example, a ledger was required to contain a Contact object with contact information for the customer to which it pertained. But the Contact wasn't simply this: has contact => ( is => 'rw', isa => role_type( 'Moonpig::Role::Contact' ), required => 1, ); Instead, it was an array; "replacing" the contact actually pushed the new contact onto the end of the array, from which the contact accessor returned the final element: has contact_history => ( is => 'ro', isa => ArrayRef[ role_type( 'Moonpig::Role::Contact' ) ], required => 1, traits => [ 'Array' ], handles => { contact => [ get => -1 ], replace_contact => 'push', }, ); Similarly, what happens if we send the customer an invoice for three services, and they inform customer service that they want to continue two of the services but cancel the third? We need to throw away the old invoice, which will never be paid, and issue a new one. The old invoice remains in the system, marked "abandoned", with a pointer to the new invoice. Things that suck: relational databases Why do we use relational databases, anyway? Is it because they cleanly and clearly model the data we want to store? No, it's because they are lightning fast. When your data truly is relational, a nice flat rectangle of records, each with all the same fields, RDBs are terrific. But Moonpig doesn't have much relational data. It basic datum is the Ledger, which has a bunch of disparate subcomponents, principally a heterogeneous collection of Consumer objects. And I would guess that most programs don't deal in relational data; Like Moonpig, they deal in some sort of object network. Nevertheless we try to represent this data relationally, because we have a relational database, and when you have a hammer, you go around hammering everything with it, whether or not that thing needs hammering. When the object model is mature and locked down, modeling the objects relationally can be made to work. But when the object model is evolving, it is a disaster. Your relational database schema changes every time the object model changes, and then you have to find some way to migrate the existing data forward from the old schema. Or worse, and more likely, you become reluctant to let the object model evolve, because reflecting that evolution in the RDB is so painful. The RDB becomes a ball and chain locked to your program's ankle, preventing it from going where it needs to go. Every change is difficult and painful, so you avoid change. This is the opposite of the way to design a good program. A program should be light and airy, its object model like a string of pearls. In theory the mapping between the RDB and the objects is transparent, and is taken care of seamlessly by an ORM layer. That would be an awesome world to live in, but we don't live in it and we may never. Things that really really suck: ORM software Right now the principal value of ORM software seems to be if your program is too fast and you need it to be slower; the ORM is really good at that. Since speed was the only benefit the RDB was providing in the first place, you have just attached two large, complex, inflexible systems to your program and gotten nothing in return. Watching the ORM try to model the objects is somewhere between hilariously pathetic and crushingly miserable. Perl's DBIx::Class, to the extent it succeeds, succeeds because it doesn't even try to model the objects in the database. Instead it presents you with objects that represent database rows. This isn't because a row needs to be modeled as an object—database rows have no interesting behavior to speak of—but because the object is an access point for methods that generate SQL. DBIx::Class is not for modeling objects, but for generating SQL. I only realized this recently, and angrily shouted it at the DBIx::Class experts, expecting my denunciation to be met with rage and denial. But they just smiled with amusement. “Yes,” said the DBIx::Class experts on more than one occasion, “that is exactly correct.” Well then. So Rik and I believe that for most (or maybe all) projects, trying to store the objects in an RDB, with an ORM layer mediating between the program and the RDB, is a bad, bad move. We determined to do something else. We eventually brewed our own object store, and this is the part of the project of which I'm least proud, not because the object store itself was a bad idea, but because I believe we probably made every possible mistake that could be made, even the ones that everyone writing an object store should already know not to make. For example, the object store has a method, retrieve_ledger, which takes a ledger's ID number, reads the saved ledger data from the disk, and returns a live Ledger object. But it must make sure that every such call returns not just a Ledger object with the right data, but the same object. Otherwise two parts of the program will have different objects to represent the same data, one part will modify its object, and the other part, looking at a different object, will not see the change it should see. It took us a while to figure out problems like this; we really did not know what we were doing. What we should have done, instead of building our own object store, was use someone else's object store. KiokuDB is frequently mentioned in this context. After I first gave this talk people asked “But why didn't you use KiokuDB?” or, on hearing what we did do, said “That sounds a lot like KiokuDB”. I had to get Rik to remind me why we didn't use KiokuDB. We had considered it, and decided to do our own not for technical but for political reasons. The CEO, having made the unpleasant decision to have me and Rik write a new billing system, wanted to see some progress. If she had asked us after the first week what we had accomplished, and we had said “Well, we spent a week figuring out KiokuDB,” her head might have exploded. Instead, we were able to say “We got the object store about three-quarters finished”. In the long run it was probably more expensive to do it ourselves, and the result was certainly not as good. But in the short run it kept the customer happy, and that is the most important thing; I say this entirely in earnest, without either sarcasm or bitterness. (On the other hand, when I ran this article by Rik, he pointed out that KiokuDB had later become essentially unmaintained, and that had we used it he would have had to become the principal maintainer of a large, complex system which which he did not help design or implement. The Moonpig object store may be technically inferior, but Rik was with it from the beginning and understands it thoroughly.) Our object store All that said, here is how our object store worked. The bottom layer was an ordinary relational database with a single table. During the test phase this database was SQLite, and in production it was IC Group's pre-existing MySQL instance. The table had two fields: a GUID (globally-unique identifier) on one side, and on the other side a copy of the corresponding Ledger object, serialized with Perl's Storable module. To retrieve a ledger, you look it up in the table by GUID. To retrieve a list of all the ledgers, you just query the GUID field. That covers the two main use-cases, which are customer service looking up a customer's account history, and running the daily heartbeat job. A subsidiary table mapped IC Group's customer account numbers to ledger GUIDs, so that the storage engine could look up a particular customer's ledger starting from their account number. (Account numbers are actually associated with Consumers. Once you have the right ledger a simple method call to the ledger will retrieve the consumer object, but finding the right ledger requires a table.) There were a couple of other tables of that sort, but overall it was a small thing. There are some fine points to consider. For example, you can choose whether to store just the object data, or the code as well. The choice is clear: you must store only the data, not the code. Otherwise, you would have to update all the stored objects every time you made a code change such as a bug fix. It should be clear that this would have discouraged bug fixes, and that had we gone this way the project would have ended as a pile of smoking rubble. Since the code is not stored in the database, the object store must be responsible, whenever it loads an object, for making sure that the correct class for that object actually exists. The solution for this was that along with every object is stored a list of all the roles that it must perform. At object load time, if the object's class doesn't exist yet, the object store retrieves this list of roles (stored in a third column, parallel to the object data) and uses the MooseX::ClassCompositor module to create a new class that does those roles. MooseX::ClassCompositor was something Rik wrote for the purpose, but it seems generally useful for such applications. Every once in a while you may make an upward-incompatible change to the object format. Renaming an object field is such a change, since the field must be renamed in all existing objects, but adding a new field isn't, unless the field is mandatory. When this happened—much less often than you might expect—we wrote a little job to update all the stored objects. This occurred only seven times over the life of the project; the update programs are all very short. We did also make some changes to the way the objects themselves were stored: Booking.Com's Sereal module was released while the project was going on, and we switched to use it in place of Storable. Also one customer's Ledger object grew too big to store in the database field, which could have been a serious problem, but we were able to defer dealing with the problem by using gzip to compress the serialized data before storing it. The relational database provides transactions The use of the RDB engine for the underlying storage got us MySQL's implementation of transactions and atomicity guarantees, which we trusted. This gave us a firm foundation on which to build the higher functions; without those guarantees you have nothing, and it is impossible to build a reliable system. But since they are there, we could build a higher-level transactional system on top of them. For example, we used an opportunistic locking scheme to prevent race conditions while updating a single ledger. For performance reasons you typically don't want to force all updates to be done through a single process (although it can be made to work; see Rochkind's Advanced Unix Programming). In an optimistic locking scheme, you store a version number with each record. Suppose you are the low-level storage manager and you get a request to update a ledger with a certain ID. Instead of doing this: update ledger set serialized_data = … where ledger_id = 789 You do this: update ledger set serialized_data = … , version = 4 where ledger_id = 789 and version = 3 and you check the return value from the SQL to see how many records were actually updated. The answer must be 0 or 1. If it is 1, all is well and you report the successful update back to your caller. But if it is 0, that means that some other process got there first and updated the same ledger, changing its version number from the 3 you were expecting to something bigger. Your changes are now in limbo; they were applied to a version of the object that is no longer current, so you throw an exception. But is the exception safe? What if the caller had previously made changes to the database that should have been rolled back when the ledger failed to save? No problem! We had exposed the RDB transactions to the caller, so when the caller requested that a transaction be begun, we propagated that request into the RDB layer. When the exception aborted the caller's transaction, all the previous work we had done on its behalf was aborted back to the start of the RDB transaction, just as one wanted. The caller even had the option to catch the exception without allowing it to abort the RDB transaction, and to retry the failed operation. Drawbacks of the object store The major drawback of the object store was that it was very difficult to aggregate data across ledgers: to do it you have to thaw each ledger, one at a time, and traverse its object structure looking for the data you want to aggregate. We planned that when this became important, we could have a method on the Ledger or its sub-objects which, when called, would store relevant numeric data into the right place in a conventional RDB table, where it would then be available for the usual SELECT and GROUP BY operations. The storage engine would call this whenever it wrote a modified Ledger back to the object store. The RDB tables would then be a read-only view of the parts of the data that were needed for building reports. A related problem is some kinds of data really are relational and to store them in object form is extremely inefficient. The RDB has a terrible impedance mismatch for most kinds of object-oriented programming, but not for all kinds. The main example that comes to mind is that every ledger contains a transaction log of every transaction it has ever performed: when a consumer deducts its 5479 m¢, that's a transaction, and every day each consumer adds one to the ledger. The transaction log for a large ledger with many consumers can grow rapidly. We planned from the first that this transaction data would someday move out of the ledger entirely into a single table in the RDB, access to which would be mediated by a separate object, called an Accountant. At present, the Accountant is there, but it stores the transaction data inside itself instead of in an external table. The design of the object store was greatly simplified by the fact that all the data was divided into disjoint ledgers, and that only ledgers could be stored or retrieved. A minor limitation of this design was that there was no way for an object to contain a pointer to a Ledger object, either its own or some other one. Such a pointer would have spoiled Perl's lousy garbage collection, so we weren't going to do it anyway. In practice, the few places in the code that needed to refer to another ledger just store the ledger's GUID instead and looked it up when it was needed. In fact every significant object was given its own GUID, which was then used as needed. This was Rik's strategy, and it was a good one. I was surprised to find how often it was useful to have a simple, reliable identifier for every object, and how much time I had formerly spent on programming problems that would have been trivially solved if objects had had GUIDs. The object store was a success In all, I think the object store technique worked well and was a smart choice that went strongly against prevailing practice. I would recommend the technique for similar projects, except for the part where we wrote the object store ourselves instead of using one that had been written already. Had we tried to use an ORM backed by a relational database, I think the project would have taken at least a third longer; had we tried to use an RDB without any ORM, I think we would not have finished at all. Things that suck: multiple inheritance After I had been using Moose for a couple of years, including the Moonpig project, Rik asked me what I thought of it. I was lukewarm. It introduces a lot of convenience for common operations, but also hides a lot of complexity under the hood, and the complexity does not always stay well-hidden. It is very big and very slow to start up. On the whole, I said, I could take it or leave it. “Oh,” I added. “Except for Roles. Roles are awesome.” I had a long section in the talk about what is good about Roles, but I moved it out to a separate talk, so I am going to take that as a hint about what I should do here. As with my theory of dates and times, I will present only the thesis, and save the arguments for another post: 1. Object-oriented programming is centered around objects, which are encapsulated groups of related data, and around methods, which are opaque functions for operating on particular kinds of objects. 2. OOP does not mandate any particular theory of inheritance, either single or multiple, class-based or prototype based, etc., and indeed, while all OOP systems have objects and methods that are pretty much the same, each has an inheritance system all its own. 3. Over the past 30 years of OOP, many theories of inheritance have been tried, and all of them have had serious problems. 4. If there were no alternative to inheritance, we would have to struggle on with inheritance. However, Roles are a good alternative to inheritance: • Every problem solved by inheritance is solved at least as well by Roles. • Many problems not solved at all by inheritance are solved by Roles. • Many problems introduced by inheritance do not arise when using Roles. • Roles introduce some of their own problems, but none of them are as bad as the problems introduced by inheritance. 5. It's time to give up on inheritance. It was worth a try; we tried it as hard as we could for thirty years or more. It didn't work. 6. I'm going to repeat that: Inheritance doesn't work. It's time to give up on it. Moonpig doesn't use any inheritance (except that Moonpig::DateTime inherits from DateTime, which we didn't control). Every class in Moonpig is composed from Roles. This wasn't because it was our policy to avoid inheritance. It's because Roles did everything we needed, usually in simple and straightforward ways. I plan to write more extensively on this later on. This section is the end of the things I want to excoriate. Note the transition from multiple inheritance, which was a tremendous waste of everyone's time, to Roles, which in my opinion are a tremendous success, the Right Thing, and gosh if only Smalltalk-80 had gotten this right in the first place look how much trouble we all would have saved. Things that are GOOD: web RPC APIs Moonpig has a web API. Moonpig applications, such as the customer service dashboard, or the heartbeat job, invoke Moonpig functions through the API. The API is built using a system, developed in parallel with Moonpig, called Stick. (It was so-called because IC Group had tried before to develop a simple web API system, but none had been good enough to stick. This one, we hoped, would stick.) The basic principle of Stick is distributed routing, which allows an object to have a URI, and to delegate control of the URIs underneath it to other objects. To participate in the web API, an object must compose the Stick::Role::Routable role, which requires that it provide a _subroute method. The method is called with an array containing the path components of a URI. The _subroute method examines the array, or at least the first few elements, and decides whether it will handle the route. To refuse, it can throw an exception, or just return an undefined value, which will turn into a 404 error in the web protocol. If it does handle the path, it removes the part it handled from the array, and returns another object that will handle the rest, or, if there is nothing left, a public resource of some sort. In the former case the routing process continues, with the remaining route components passed to the _subroute method of the next object. If the route is used up, the last object in the chain is checked to make sure it composes the Stick::Role::PublicResource role. This is to prevent accidentally exposing an object in the web API when it should be private. Stick then invokes one final method on the public resource, either resource_get, resource_post, or similar. Stick collects the return value from this method, serializes it and sends it over the network as the response. So for example, suppose a ledger wants to provide access to its consumers. It might implement _subroute like this: sub _subroute { my ($self, $route) = @_; if ($route->[0] eq "consumer") {
shift @$route; my$consumer_id = shift @$route; return$self->find_consumer( id => $consumer_id ); } else { return; # 404 } } Then if /path/to/ledger is any URI that leads to a certain ledger, /path/to/ledger/consumer/12435 will be a valid URI for the specified ledger's consumer with ID 12345. A request to /path/to/ledger/FOOP/de/DOOP will yield a 404 error, as will a request to /path/to/ledger/consumer/98765 whenever find_consumer(id => 98765) returns undefined. A common pattern is to have a path that invokes a method on the target object. For example, suppose the ledger objects are already addressable at certain URIs, and one would like to expose in the API the ability to tell a ledger to handle a heartbeat event. In Stick, this is incredibly easy to implement: publish heartbeat => { -http_method => 'post' } => sub { my ($self) = @_;
$self->handle_event( event('heartbeat') ); }; This creates an ordinary method, called heartbeat, which can be called in the usual way, but which is also invoked whenever an HTTP POST request arrives at the appropriate URI, the appropriate URI being anything of the form /path/to/ledger/heartbeat. The default case for publish is that the method is expected to be GET; in this case one can omit mentioning it: publish amount_due => sub { my ($self) = @_;
…
return abs($due -$avail);
};
More complicated published methods may receive arguments; Stick takes care of deserializing them, and checking that their types are correct, before invoking the published method. This is the ledger's method for updating its contact information:

publish _replace_contact => {
-path        => 'contact',
-http_method => 'put',
attributes   => HashRef,
} => sub {
my ($self,$arg) = @_;
my $contact = class('Contact')->new($arg->{attributes});
$self->replace_contact($contact);

return $contact; }; Although the method is named _replace_contact, is is available in the web API via a PUT request to /path/to/ledger/contact, rather than one to /path/to/ledger/_replace_contact. If the contact information supplied in the HTTP request data is accepted by class('Contact')->new, the ledger's contact is updated. (class('Contact') is a utility method that returns the name of the class that represents a contact. This is probably just the string Moonpig::Class::Contact.) In some cases the ledger has an entire family of sub-objects. For example, a ledger may have many consumers. In this case it's also equipped with a "collection" object that manages the consumers. The ledger can use the collection object as a convenient way to look up its consumers when it needs them, but the collection object also provides routing: If the ledger gets a request for a route that begins /consumers, it strips off /consumers and returns its consumer collection object, which handles further paths such as /guid/XXXX and /xid/1234 by locating and returning the appropriate consumer. The collection object is a repository for all sorts of convenient behavior. For example, if one composes the Stick::Role::Collection::Mutable role onto it, it gains support for POST requests to …/consumers/add, handled appropriately. Adding a new API method to any object is trivial, just a matter of adding a new published method. Unpublished methods are not accessible through the web API. After I wrote this talk I wished I had written a talk about Stick instead. I'm still hoping to write one and present it at YAPC in Orlando this summer. Things that are GOOD: Object-oriented testing Unit tests often have a lot of repeated code, to set up test instances or run the same set of checks under several different conditions. Rik's Test::Routine makes a test program into a class. The class is instantiated, and the tests are methods that are run on the test object instance. Test methods can invoke one another. The test object's attributes are available to the test methods, so they're a good place to put test data. The object's initializer can set up the required test data. Tests can easily load and run other tests, all in the usual ways. If you like OO-style programming, you'll like all the same things about building tests with Test::Routine. Things that are GOOD: Free software All this stuff is available for free under open licenses: (This has been a really long article. Thanks for sticking with me. Headers in the article all have named anchors, in case you want to refer someone to a particular section.) (I suppose there is a fair chance that this will wind up on Hacker News, and I know how much the kids at Hacker News love to dress up and play CEO and Scary Corporate Lawyer, and will enjoy posting dire tut-tuttings about whether my disclosure of ICG's secrets is actionable, and how reluctant they would be to hire anyone who tells such stories about his previous employers. So I may as well spoil their fun by mentioning that I received the approval of ICG's CEO before I posted this.) [ Addendum: A detailed description of DateTime::Moonpig is now available. ] [ Addendum 20140208: Jesper Andersen has written an account of a surprisingly similar system that he wrote in Erlang. ] Tue, 17 Sep 2013 Overlapping intervals Our database stores, among other things, "budgets", which have a lifetime with a start and end time. A business rule is that no two budgets may be in force at the same time. I wanted to build a method which, given a proposed start and end time for a new budget, decided whether there was already a budget in force during any part of the proposed period. The method signature is: sub find_overlapping_budgets { my ($self, $start,$end) = @_;
...
}
and I want to search the contents of $self->budgets for any budgets that overlap the time interval from$start to $end. Budgets have a start_date and an end_date property. My first thought was that for each existing budget, it's enough to check to see if its start_date or its end_date lies in the interval of interest, so I wrote it like this: sub find_overlapping_budgets { my ($self, $start,$end) = @_;

return $self->budgets->search({ [ { start_date => { ">=" ,$start },
start_date => { "<=" , $end }, }, { end_date => { ">=" ,$start },
end_date => { "<=" , $end }, }, ] }); } People ridicule Lisp for having too many parentheses, and code like this, a two-line function which ends with },},]});}, should demonstrate that that is nothing but xenophobia. I'm not gonna explain the ridiculous proliferation of braces and brackets here, except to say that this is expressing the following condition: $$\begin{array}{} ( start_A \le & start_B & & \wedge & \\ & start_B & \le end_A & & ) \vee \\ ( start_A \le & end_B & & \wedge & \\ & end_B & \le end_A & & ) \\ \end{array}$$ which we can abbreviate as: $$start_A \le start_B \le end_A \vee \\ start_A \le end_B \le end_A \\$$ And if this condition holds, then the intervals overlap. Anyway, this seemed reasonable at the time, but is totally wrong, and happily, the automated tests I wrote for the method caught the error. Say that we ask whether we can create a budget that runs from June 1 to June 10. Say there is a budget that already exists, running from June 6 to June 7. Then the query asks : $$\text{June 5} \le \text{June 1} \le \text{June 6} \vee \\ \text{June 5} \le \text{June 10} \le \text{June 6} \\$$ Both of the disjuncts are false, so the method reports that there is no overlap. My implementation was just completely wrong. it's not enough to check to see if either endpoint of the proposed interval lies within an existing interval; you also have to check to see if any of the endpoints of the existing intervals lie within the proposed interval. (Alert readers will have noticed that although the condition "Intervals A and B overlap" is symmetric in A and B, the condition as I wrote it is not symmetric, and this should raise your suspicions.) This was yet another time when I felt slightly foolish as I wrote the automated tests, assuming that the time and effort I spent on testing this trivial function would would be time and effort thrown away on nothing—and then they detected a real fault. Someday perhaps I'll stop feeling foolish writing tests for functions like this one; until then, many cases just like this one will help me remember that I must write the tests even though I feel foolish doing it. Okay, how to get this right? I tried a bunch of things, mostly involving writing out a conjunction of every required condition and then using boolean algebra to simplify the resulting expression: $$start_A \le start_B \le end_A \vee \\ start_A \le end_B \le end_A \vee \\ start_B \le start_A \le end_B \vee \\ start_B \le end_A \le end_B \\$$ This didn't work well, partly because I was doing it at two in the morning, partly because there are many conditions, all very similar, and I kept getting them mixed up, and partly because, for implementation reasons, the final expression must be a query on interval A, even though it is most naturally expressed symmetrically between the two intervals. But then I had a happy idea: For some reason it seemed much simpler to express the opposite condition, that the two intervals do not conflict. If they don't conflict, then interval A must be entirely to the left of interval B, so that $$end_A \lt start_B,$$ or vice-versa, so that $$end_B\lt start_A.$$ Then the intervals do not overlap if either of these is true: $$end_A \lt start_B \vee end_B \lt start_A$$ and the condition that we want, that the two intervals do overlap, is simply its negation: $$end_A \ge start_B \wedge end_B \ge start_A$$ This is correct, or at least all the tests now pass, and it is even simpler than the incorrect condition I wrote in the first place. The code looks like this: sub find_overlapping_budgets { my ($self, $start,$end) = @_;

return $self->budgets->search({ end_date => { '>=',$start },
start_date =>   { '<=', $end }, }); } Usually I like to draw some larger lesson from this sort of thing. What comes to mind now (other than “Just write the tests, fool!”) is this: The end result is quite clever. Often I see the final version of the code and say "Oh, I wonder why I didn't see that right off?" Not this time. I want to say I couldn't have found it by myself, except that I did find it by myself, not by just pulling it magically out of my head, but by applying technique. Instead of "not by magically pulling it out of my head" I was about to write "not by just thinking", but that is not quite right. I did solve it by "just thinking", but it was a different sort of thinking. Sometimes I consider a problem, and a solution leaps to mind, as it did in this case, except that it was wrong. That is what I call "just thinking". But applying carefully-learned and practiced technique is also thinking. The techniques I applied in this problem included: noticing and analyzing symmetries of the original problem, and application of laws of boolean algebra, both in the unsuccessful and the successful attempt. Higher-level strategies included trying more than one approach, and working backwards. Learning and correctly applying technique made me effectively a better thinker, not just in general, but in this particular case. [ Addendum 20130917: Dfan Schmidt remarks: "I'm astonished you didn't know the interval-overlap trick already." I was a little surprised, also, when I tried to pull the answer out of my head and didn't find one there already, either from having read it somewhere before, or from having solved the problem before. ] Sun, 16 Dec 2012 How I got four errors into a one-line program At my current job, each task is assigned a ticket number of the form e12345. The git history is extremely convoluted, and it's been observed that it's easier to find things if you include the ticket number at the front of the commit message. I got tired of inserting it manually, and thought I would write a prepare-commit-message hook to insert it automatically. A prepare-commit-message hook is a program that you stick in the file .git/hooks/prepare-commit-hook. When you run git-commit, git first writes the commit message to a file, then invokes the prepare-commit-message program on file; the program can modify the contents of the message, or abort the commit if it wants to. Then git runs the editor on the message, if it was going to do that, and creates the commit with the edited message. The hook I wrote was basically a one-liner, and the reason I am posting this note is because I found three significant programming errors in it in the first day of use. Here's the first cut: case$2 in
message)
perl -i -lpe "s/^(e\d+:\s+)?/$(cs -): /"$1
;;
esac
This is a shell script, but the main purpose is to run the perl one-liner. The shell script gets two arguments: $1 is the path to the file that contains the proposed commit message. The$2 argument is a tag which describes the commit's context; it's merge if the commit is a merge commit, for example; it's template if the commit message is supplied from a template via -t on the command line or the commit.template configuration option. The default is the empty string, and message, which I have here, means that the message was supplied with the -m command-line option.

The Perl script edits the commit message file, named in $1, in-place, looking for something like e12345: at the beginning of a line, and replacing it with the output of the cs - command, which is a little program I wrote to print the current ticket number. (cs is run by the shell, and its output is inserted into the Perl script before perl is run, so that the program that Perl sees is something like s/^(e\d+:\s+)?/e12345: /.) Simple enough. There is already an error here, although it's a design error, not an implementation error: the Perl one-liner is only invoked when$2 is message. For some reason I decided that I would want it only when I supplied git-commit with the -m message option. This belief lasted exactly until the first time I ran git-commit in default mode it popped up the editor to edit the commit message, and I had to insert the ticket number manually.

So the first change was to let the hook run in the default case as well as the message case:

case $2 in ""|message) perl -i -lpe "s/^(e\d+:\s+)?/$(cs -): /" $1 ;; esac This was wrong because it inserts the ticket number at the start of each line; I wanted it only at the start of the first line. So that was programming error number 1: case$2 in
""|message)
perl -i -lpe "$. == 1 && s/^(e\d+:\s+)?/$(cs -): /" $1 ;; esac So far, so good. Bug #2 appeared the first time I tried a rebase. The cs command infers the ticket number from the name of the current branch. If it fails, it issues a warning and emits the string eXXXXX instead. During a rebase, the head is detached and there is no current branch. So the four commits I rebased all had their formerly-correct ticket numbers replaced with the string eXXXXX. There are several ways to fix this. The best way would be to make sure that the current ticket number was stashed somewhere that cs could always get it. Instead, I changed the Perl script to recognize when the commit message already began with a ticket number, and to leave it alone if so: case$2 in
""|message)
perl -i -lpe "\$. == 1 && !/^e\d+:\s+/ && s/^/$(cs -): /" $1 ;; esac It probably would have been a good idea to leave an escape hatch, and have cs emit the value of$ENV{TICKET_NUMBER} if that is set, to allow invocations like TICKET_NUMBER=e71828 git commit -m …, but I didn't do it, yet.

The third bug appeared when I did git commit --fixup for the first time. With --fixup you tell it which commit you are trying to fix up, and it writes the commit message in a special form that tells a subsequent git-rebase --interactive that this new commit should be handled specially. (It should be applied immediately after that other one, and should be marked as a "fixup", which means that it is squashed into the other one and that its log message is discarded in favor of the other one.) If you are fixing up a commit whose message was Frobulate the veeblefetzers, the fixup commit's message is automatically generated as fixup! Frobulate the veeblefetzers. Or it would have been, if you were not using my prepare-commit-message hook, which would rewrite it to e12345: fixup! Frobulate the veeblefetzers. This is not in the right form, so it's not recognized by git-rebase --interactive for special handling.

So the hook became:

case $2 in ""|message) perl -i -lpe "\$. == 1 && !/^(squash|fixup)! / && !/^e\d+:\s+/ && s/^/$(cs -): /"$1
;;
esac
(The exception for squash is similar to the one for fixup. I never use squash, but it seemed foolish not to put it in while I was thinking of it.)

This is starting to look a little gross, but in a program this small I can tolerate a little grossness.

I thought it was remarkable that such a small program broke in so many different ways. Much of that is because it must interact with git, which is very large and complicated, and partly it is that it must interact with git, which is in many places not very well designed. The first bug, where the ticket number was appended to each line instead of just the first, is not git's fault. It was fallout from my initial bad design decision to apply the script only to messages supplied with -m, which are typically one-liners, so that's what I was thinking of when I wrote the Perl script.

But the other two errors would have been avoided had the interface to the hook been more uniform. There seems to be no reason that rebasing (or cherry-picking) and git-commit --fixup contexts couldn't have been communicated to the hook via the same $2 argument that communicates other contexts. Had this been done in a more uniform way, my program would have worked more correctly. But it wasn't done, and it's probably too late to change it now, since such a change risks breaking many existing prepare-commit-message hooks. (“The enemy of software is software.”) A well-written hook will of course have a catchall: case$2 in
""|message)
perl -i -lpe "\$. == 1 && !/^(squash|fixup)! / && !/^e\d+:\s+/ && s/^/$(cs -): /" $1 ;; merge|template|squash|commit) # do nothing ;; *) # wat echo "prepare-message-hook: unknown context '$2'" 1>&2
exit 1;
;;

esac
But mine doesn't and I bet a lot of others don't either.

Sun, 26 Aug 2012

Rewriting published history in Git
My earlier article about my habits using Git attracted some comment, most of which was favorable. But one recurring comment was puzzlement about my seeming willingness to rewrite published history. In practice, this was not at all a problem, I think for three reasons:

1. Rewriting published history is not nearly as confusing as people seem to think it will be.
2. I worked in a very small shop with very talented developers, so the necessary communication was easy.
3. Our repository setup and workflow were very well-designed and unusually effective, and made a lot of things easier, including this one.

If there are N developers, there are N+1 repositories.

There is a master repository to which only a few very responsible persons can push. It is understood that history in this repository should almost never be rewritten, only in the most exceptional circumstances. We usually call this master repository gitbox. It has only a couple of branches, typically master and deployed. You had better not push incomplete work to master, because if you do someone is likely to deploy it. When you deploy a new version from master, you advance deployed up to master to match.

In addition, each developer has their own semi-public repository, named after them, which everyone can read, but which nobody but them can write. Mine is mjd, and that's what we call it when discussing it, but my personal git configuration calls it origin. When I git push origin master I am pushing to this semi-public repo.

It is understood that this semi-public repository is my sandbox and I am free to rewrite whatever history I want in it. People building atop my branches in this repo, therefore, know that they should be prepared for me to rewrite the history they see there, or to contact me if they want me to desist for some reason.

When I get the changes in my own semi-public repository the way I want them, then I push the changes up to gitbox. Nothing is considered truly "published" until it is on the master repo.

When a junior programmer is ready to deploy to the master repository, they can't do it themselves, because they only have read access on the master. Instead, they publish to their own semi-private repository, and then notify a senior programmer to review the changes. The senior programmer will then push those changes to the master repository and deploy them.

The semi-public mjd repo has lots of benefits. I can rewrite my branches 53 times a day (and I do!) but nobody will care. Conversely, I don't need to know or care how much my co-workers vacillate.

If I do work from three or four different machines, I can use the mjd repo to exchange commits between them. At the end of the day I will push my work-in-progress up to the mjd repo, and then if I want to look at it later that evening, I can fetch the work-in-progress to my laptop or another home computer.

I can create and abandon many topic branches without cluttering up the master repository's history. If I want to send a change or a new test file to a co-worker, I can push it to mjd and then point them at the branch there.

A related note: There is a lot of FUD around the rewriting of published history. For example, the "gitinfo" robot on the #git IRC channel has a canned message:

Rewriting public history is a very bad idea. Anyone else who may have pulled the old history will have to git pull --rebase and even worse things if they have tagged or branched, so you must publish your humiliation so they know what to do. You will need to git push -f to force the push. The server may not allow this. See receive.denyNonFastForwards (git-config)

I think this grossly exaggerates the problems. Very bad! Humiliation! The server may deny you! But dealing with a rebased upstream branch is not very hard. It is at worst annoying: you have to rebase your subsequent work onto the rewritten branch and move any refs that pointed to that branch. If you don't have any subsequent work, you might still have to move refs, if you have any that point to it, but you might not have any.

[ Thanks to Rik Signes for helping me put this together. ]

Thu, 15 Mar 2012

My Git Habits
Miles Gould asked his Twitter followers whether they used git-add -p or git-commit -a and how often. My reply was too long for Twitter, so here it is.

First the short version: I use git-add -p frequently, and git-commit -a almost never. The exception is when I'm working on the repo that holds my blog, where I rarely commit changes to more than one or two files at a time. Then I'll usually just git-commit -a -m ....

But I use git-add -p all the time. Typically what will happen is that I will be developing some fairly complicated feature. It will necessitate a bunch of changes and reshuffling elsewhere in the system. I'll make commits on the topic branch as I go along without worrying too much about whether the commits are neatly packaged.

Often I'll be in the middle of something, with a dirty work tree, when it's time to leave for the day. Then I'll just commit everything with the subject WIP ("work-in-progress"). First thing the next morning I'll git-reset HEAD^ and continue where I left off.

So the model is that the current head is usually a terrible mess, accumulating changes as it moves forward in time. When I'm done, I will merge the topic into master and run the tests.

If they pass, I am not finished. The merge I just created is only a draft merge. The topic branch is often full of all sorts of garbage, commits where I tried one approach, found it didn't work later on, and then tried a different approach, places where I committed debugging code, and so on. So it is now time to clean up the topic branch. Only the cleaned-up topic branch gets published.

Cleaning up messy topic branches

The core of the cleanup procedure is to reset the head back to the last place that look good, possibly all the way back to the merge-base if that is not too long ago. This brings all the topic changes into the working directory. Then:

1. Compose the commits: Repeat until the working tree is clean:
1. Eyeball the output of git-diff
2. Think of an idea for an intelligible commit
3. Use git-add -p to stage the planned commit
4. Use git diff --cached to make sure it makes sense
5. Commit it
2. Order the commits: Use git-rebase --interactive
Notice that this separates the work of composing the commits from the work of ordering them. This is more important than it might appear. It would be extremely difficult to try to do these at the same time. I can't know the sensible order for the commits until I know what the commits are! But it's very hard to know what the commits are without actually making them.

By separating these tasks, I can proceed something like this: I eyeball the diff, and the first thing I see is something about the penguin feature. I can immediately say "Great, I'll make up a commit of all the stuff related to the penguin feature", and proceed to the git-add -p step without worrying that there might be other stuff that should precede the penguin feature in the commit sequence. I can focus on just getting the penguin commit right without needing to think about any of the other changes.

When the time comes to put the commits in order, I can do it well because by then I have abstracted away all the details, and reduced each group of changes to a single atomic unit with a one-line description.

For the most complicated cases, I will print out the diffs, read them over, and mark them up in six colors of highlighter: code to throw away gets marked in orange; code that I suspect is erroneous is pink. I make many notes in pen to remind me how I want to divide up the changes into commits. When a commit occurs to me I'll jot a numbered commit message, and then mark all the related parts of the diff with that number. Once I have the commits planned, I'll reset the topic ref and then run through the procedure above, using git-add -p repeatedly to construct the commits I planned on paper. Since I know ahead of time what they are I might do them in the right order, but more likely I'll just do them in the order I thought of them and then reorder them at the end, as usual.

For simple cases I'll just do a series of git-rebase --interactive passes, pausing at any leftover WIP commits to run the loop above, reordering the commits to squash related commits together, and so on.

The very simplest cases of all require no cleanup, of course.

For example, here's my current topic branch, called c-domain, with the oldest commits at the top:

055a2f7 correction to bulk consumer template
d9630bd DomainActivator half of Pobox Domain consumer
e170e77 start templates for Pobox domain consumers
067ca81 stubbed Domain::ThumbTwiddler
685a3ee cost calculations for DomainActivator
ec8b1cc test fixes; trivial domain test passes now
845b1f2 rename InvoiceCharge::CreateDomain to ..::RegisterDomain
(e)     6083a97 add durations to Domain consumers and charges
c64fda0 tests for Domain::Activator consumer
41e4292 repeat activator tests for 1-year and 3-year durations
7d68065 tests for activator's replacement
(d)     87f3b09 move days_in_year to Moonpig::Util
3cd9f3b WIP
e5063d4 add test for sent invoice in domain.t
c8dbf41 WIP
fc13059 bring in Net::OpenSRS module
(c)     52c18fb OpenSRS interface
893f16f notes about why domain queries might fail
(b)     f64361f rename "croak" method to "fail" to avoid conflicts
4e500ec Domain::Activator initial_invoice_charge_pairs
(a)     3c5cdd4 WIP
3c5cdd4 (a) was the end-of-day state for yesterday; I made it and pushed it just before I dashed out the door to go home. Such commits rarely survive beyond the following morning, but if I didn't make them, I wouldn't be able to continue work from home if the mood took me to do that.

f64361f (b) is a prime candidate for later squashing. 5c218fb (c) introduced a module with a "croak" method. This turned out to be a stupid idea, because this conflicted with the croak function from Perl's Carp module, which we use everywhere. I needed to rename it. By then, the intervening commit already existed. I probably should have squashed these right away, but I didn't think of it at the time. No problem! Git means never having to say "If only I'd realized sooner."

Similarly, 6083a97 (e) added a days_in_year function that I later decided at 87f3b09 (d) should be in a utility module in a different repository. 87f3b09 will eventually be squashed into 6083a97 so that days_in_year never appears in this code at all.

I don't know what is in the WIP commits c8dbf41 or 3cd9f3b, for which I didn't invent commit messages. I don't know why those are left in the tree, but I can figure it out later.

An example cleanup

Now I'm going to clean up this branch. First I git-checkout -b cleanup c-domain so that if something goes awry I can start over completely fresh by doing git-reset --hard c-domain. That's probably superfluous in this case because origin/c-domain is also pointing to the same place, and origin is my private repo, but hey, branches are cheap.

The first order of business is to get rid of those WIP commits. I'll git-reset HEAD^ to bring 3c5cdd4 into the working directory, then use git-status to see how many changes there are:

M lib/Pobox/Moonpig/Consumer/Domain/Activator.pm
M lib/Pobox/Moonpig/Role/HasDomain.pm
M lib/Pobox/Moonpig/TemplateSet.pm
?? bin/register_domains
M t/consumer/domain.t
?? t/lib/MockOpenSRS.pm
(This is the output from git-status --short, for which I have an alias, git s. I use this probably 99 times as often as plain git-status.)

Not too bad, probably no need for a printout. The new bin/register-domains program can go in right away by itself:

% git commit -m 'new register_domains utility program'
Next I'll deal with that new mock object class in t/lib/MockOpenSRS.pm. I'll add that, then use git-add -p to add the related changes from the other files:

...
% git s
MM lib/Pobox/Moonpig/Consumer/Domain/Activator.pm
M lib/Pobox/Moonpig/Role/HasDomain.pm
M lib/Pobox/Moonpig/TemplateSet.pm
A  t/lib/MockOpenSRS.pm
MM t/consumer/domain.t
% git ix
...
The git ix command at the end there is an alias for git diff --cached: it displays what's staged in the index. The output looks good, so I'll commit it:

% git commit -m 'mock OpenSRS object; add tests'
Now I want to see if those tests actually pass. Maybe I forgot something!
% git stash
% make test
...
OK
% git stash pop
The git-stash command hides the unrelated changes from the test suite so that I can see if the tests I just put into t/consumer/domain.t work properly. They do, so I bring back the stashed changes and continue. If they didn't, I'd probably amend the last commit with git commit --amend and try again.

Continuing:

% git diff
...
...
% git commit -m 'Domains do not have explicit start dates'
% git diff
...
...
% git commit --fixup :/mock
That last bit should have been part of the "mock OpenSRS object" commit, but I forgot it. So I make a fixup commit, which I'll merge into the main commit later on. A fixup commit is one whose subject begins with fixup!. Did you know that you can name a commit by writing :/text, and it names the most recent commit whose message contains that text?

It goes on like that for a while:

% git diff
...
...
% git commit -m 'Activator consumer can generate special charges'
% git diff
...
% git checkout lib/Pobox/Moonpig/Role/HasDomain.pm
The only uncommitted change left in HasDomain.pm was a superfluous line, so I just threw it away.

% git diff
...
% git commit -m 'separate templates for domain-registering and domain-renewing consumers'
By this time all the remaining changes belong in the same commit, so I use git-add -u to add them all at once. The working tree is now clean. The history is as I showed above, except that in place of the final WIP commit, I have:

a3c0b92 new register_domains utility program
53d704d mock OpenSRS object; add tests
a24acd8 Domains do not have explicit start dates
17a915d fixup! mock OpenSRS object; add tests
86e472b Activator consumer can generate special charges
5b2ad2b separate templates for domain-registering and domain-renewing consumers
(Again the oldest commit is first.) Now I'll get rid of that fixup!:

% git rebase -i --autosquash HEAD~6
Because of --autosquash, the git-rebase menu is reordered so that the fixup commit is put just after the commit it fixes up, and its default action is 'fixup' instead of 'pick'. So I don't need to edit the rebase instructions at all. But I might as well take the opportunity to put the commits in the right order. The result is:

a3c0b92 new register_domains utility program
ea8dacd Domains do not have explicit start dates
297366a separate templates for domain-registering and domain-renewing consumers
4ef0e28 mock OpenSRS object; add tests
c3ab1eb Activator consumer can generate special charges
I have two tools for dealing with cleaned-up branches like this one. One is git-vee, which compares two branches. It's just a wrapper around the command git log --decorate --cherry-mark --oneline --graph --boundary A"..."B.

Here's a comparison the original c-domain branch and my new cleanup version:

% git vee c-domain
* c3ab1eb (HEAD, cleanup) Activator consumer can generate special charges
* 4ef0e28 mock OpenSRS object; add tests
* 297366a separate templates for domain-registering and domain-renewing consumer
* ea8dacd Domains do not have explicit start dates
* a3c0b92 new register_domains utility program
| * 3c5cdd4 (origin/c-domain, c-domain) WIP
|/
o 4e500ec Domain::Activator initial_invoice_charge_pairs
This clearly shows where the original and cleaned up branches diverge, and what the differences are. I also use git-vee to compare pre- and post-rebase versions of branches (with git-vee ORIG_HEAD) and local branches with their remote tracking branches after fetching (with git-vee remote or just plain git-vee).

A cleaned-up branch should usually have the same final tree as the tree at the end of the original branch. I have another tool, git-treehash, which compares trees. By default it compares HEAD with ORIG_HEAD, so after I use git-rebase to squash or to split commits, I sometimes run "git treehash" to make sure that the tree hasn't changed. In this example, I do:

f0fd6ea0de7dbe60520e2a69fbec210260370d78 c-domain
which tells me that they are not the same. Most often this happens because I threw away all the debugging code that I put in earlier, but this time it was because of that line of superfluous code I eliminated from HasDomain.pm. When the treehashes differ, I'll use git-diff to make sure that the difference is innocuous:

% git diff c-domain
diff --git a/lib/Pobox/Moonpig/Role/HasDomain.pm b/lib/Pobox/Moonpig/Role/HasDomain.pm
index 3d8bb8c..21cb752 100644
--- a/lib/Pobox/Moonpig/Role/HasDomain.pm
+++ b/lib/Pobox/Moonpig/Role/HasDomain.pm
@@ -5,7 +5,6 @@ use Carp qw(croak confess);
use ICG::Handy qw(is_domain);
use Moonpig::Types qw(Factory Time);
use Moose::Util::TypeConstraints qw(duck_type enum subtype);
-use MooseX::SetOnce;

with (
'Moonpig::Role::StubBuild',

Okay then.

The next task is probably to deal with the older WIP commits. This time I'll omit all the details. But the enclosing procedure looks like this:

% git checkout -b wip-cleanup c8dbf41
% ... (a lot of git-add -p as above) ...
...

% git vee c8dbf41
* 4c6ff45 (wip-cleanup) get rid of unused twiddler test
* b328de5 test full payment cycle
* 201a4f2 abstract out pay_invoice operation
* 55ae45e add upper limit (default 30d) to wait_until utility
| * c8dbf41 WIP
|/
o e5063d4 add test for sent invoice in domain.t

7f52ba68923e2ede8fda407ffa9c06c5c48338ae
% git checkout cleanup
% git rebase wip-cleanup
The output of git-treehash says that the tree at the end of the wip-cleanup branch is identical to the one in the WIP commit it is supposed to replace, so it's perfectly safe to rebase the rest of the cleanup branch onto it, replacing the one WIP commit with the four new commits in wip-cleanup. Now the cleaned up branch looks like this:

% git vee c-domain
* a425aa1 (HEAD, cleanup) Activator consumer can generate special charges
* 2bb0932 mock OpenSRS object; add tests
* a77bfcb separate templates for domain-registering and domain-renewing consumer
* 4c44db2 Domains do not have explicit start dates
* fab500f new register_domains utility program
= 38018b6 Domain::Activator initial_invoice_charge_pairs
= aebbae6 rename "croak" method to "fail" to avoid conflicts
= 45a224d notes about why domain queries might fail
= 80e4a90 OpenSRS interface
= 27f4562 bring in Net::OpenSRS module
= f5cb624 add missing MakesReplacement stuff
* 4c6ff45 (wip-cleanup) get rid of unused twiddler test
* b328de5 test full payment cycle
* 201a4f2 abstract out pay_invoice operation
* 55ae45e add upper limit (default 30d) to wait_until utility
| * 3c5cdd4 (origin/c-domain, c-domain) WIP
| = 4e500ec Domain::Activator initial_invoice_charge_pairs
| = f64361f rename "croak" method to "fail" to avoid conflicts
| = 893f16f notes about why domain queries might fail
| = 52c18fb OpenSRS interface
| = fc13059 bring in Net::OpenSRS module
| = 9e6ffa4 add missing MakesReplacement stuff
| * c8dbf41 WIP
|/
o e5063d4 add test for sent invoice in domain.t
git-vee marks a commit with an equal sign instead of a star if it's equivalent to a commit in the other branch. The commits in the middle marked with equals signs are the ones that weren't changed. The upper WIP was replaced with five commits, and the lower one with four.

I've been planning for a long time to write a tool to help me with breaking up WIP commits like this, and with branch cleanup in general: It will write each changed hunk into a file, and then let me separate the hunk files into several subdirectories, each of which represents one commit, and then it will create the commits automatically from the directory contents. This is still only partly finished, but I think when it's done it will eliminate the six-color diff printouts.

[ Addendum 20120404: Further observation has revealed that I almost never use git-commit -a, even when it would be quicker to do so. Instead, I almost always use git-add -u and then git-commit the resulting index. This is just an observation, and not a claim that my practice is either better or worse than using git-commit -a. ]

[ Addendum 20120825: There is now a followup article about how to manage rewriting of published history. ]

Sun, 04 Mar 2012

Why can't Git resolve all conflicted merges?
I like to be prepared ahead of time for questions, and one such question is why Git can't resolve all merge conflicts automatically. People do show up on IRC asking this from time to time. If you're a sophisticated user the answer is obvious, but I've made a pretty good living teaching classes to people who don't find such things obvious.

What we need is a nice example. In the past my example was sort of silly. You have a file that contains the instruction:

Pay potato tax every April 15
Pay potato tax every April 15
(Except in years of potato blight.)
While another branch broadens the original instruction:

Pay all tax due every April 15
What's the correct resolution here? It's easy to understand that mashing together the two changes is a recipe for potential catastrophe:

Pay all tax due every April 15
(Except in years of potato blight.)
You get fined for tax evasion after the next potato blight. And it's similarly easy to construct scenarios in which the correct resolution is to leave the whole thing in place including the modifier, change the thing to something else completely, delete the whole thing, or to refer the matter to Legal and shut down the whole system until you hear back. Clearly it's outside Git's scope to recognize when to call in the lawyers, much less to predict what their answer will be.

But a few months ago I ran into a somewhat less silly example. At work we had two seprate projects, "Moonpig" and "Stick", each in its own repository. Moonpig contained a subsystem, "Collections", which we decided would make more sense as part of Stick. I did this work, removing the Collections code from the Moonpig project and integrating it into the Stick project. From the point of view of the Moonpig repository, the Collections system was deleted entirely.

Meanwhile, on a parallel branch of Moonpig, R.J.B. Signes made some changes that included bug fixes to the Collections. After I removed the collections, he tried to merge his changes into the master branch, and got a merge conflict, because some of the files to which he was making bug fixes were no longer there.

The correct resolution was to perform the rest of the merge without the bug fixes, which Git could conceivably have done. But then the unapplied bug fixes needed to be applied to the Collections module that was now in the completely separate Stick project, and there is no way Git could have done this, or even to have known it should be done. Human intervention was the only answer.

Wed, 15 Feb 2012

Insane calculations in bash
A few weeks ago I wrote an article about various methods of arithmetic calculation in shell scripts and in bash in particular, but it was all leading up to today's article, which I think is more interesting technically.

A while back, Zach Holman (who I hadn't heard of before, but who is apparently a bigwig at GitHub) implemented a kind of cute little hack, called "spark". It's a little shell utility, spark, which gets a list of numbers as its input and uses Unicode block characters to print a little bar graph of the numbers on the output. For example, the invocation:

spark 2,4,6,8
will print out something like:

▃▄▆▇
To do this in one of the 'P' languages (Perl, Python, PHP, Puby, or maybe Pickle) takes something like four lines of code. But M. Holman decided to implement it in bash for maximum portability, so it took 72 lines, not counting comments, whitespace, etc.

Let's begin by discussing the (very simple) mathematics that underlies drawing bar graphs. Suppose you want to generate a set of bars for the numbers $1,$9, $20. And suppose you can actually generate bars of integer heights only, say integers from 0–7: 0 1 ▁ 2 ▂ 3 ▃ 4 ▄ 5 ▅ 6 ▆ 7 ▇ (M. Holman 's original program did this, even though a height-8 bar █ is available. But the mathematics is the same either way.) Absolute scaling The first step is to scale the input numbers onto the range of the bars. To do this, we find a scale factor f that maps dollars onto bar heights, say that f bar units =$1.

A reasonable thing to try is to say that since your largest number is $20, we will set 7 bar units =$20. Then 0.35 bar units = $1, and 3.45 bar units =$9. We'll call these the "natural heights" for the bars.

Unfortunately we can't render the bars at their natural heights; we can only render them at integer heights, so we have to round off. 0.35 bar units rounds off to 0, so we will represent $1 as no bar at all. 3.45 bar units rounds off, badly, to 3, but that's the way it goes; if you try to squeeze the numbers from 1 to 20 into the range 0 to 7, something has to give. Anyway, this gives (1,9,20) → ( ▃▇) The formula is: Let max be the largest input number (here, 20) and let n be the size of the largest possible bar (here, 7). Then an input number x becomes a bar of size n·x / max: $$x\rightarrow {n\cdot x \over max }$$ Note that this maps max itself to n, and 0 to 0. I'll call this method "absolute scaling", because big numbers turn into big bars. (It fails for negative numbers, but we'll assume that the numbers are non-negative.) (0…20) → ( ▁▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▆▇▇) There are a couple of variations we might want to apply. First, maybe we don't like that$1 mapped to no bar at all; it's too hard to see, depending on the context. Perhaps we would like to guarantee that only 0 maps to 0. One way to ensure that is to round everything up, instead of rounding to the nearest integer:

(0…20) → ( ▁▁▂▂▂▃▃▃▄▄▄▅▅▅▆▆▆▇▇▇)
(1,9,20)      → (▁▄▇)
Another benefit of always rounding up is that it uses the bars equally. Suppose we're mapping numbers in the range 1–100 to bars of heights 1–7. If we round off to the nearest integer, each bar represents 14 or 15 different numbers, except that the tallest bar only represents the 8 numbers 93–100. This is a typical situation. If we always round up, each bar corresponds to a nearly equal range of numbers. (Another way to adjust this is to replace n with n+½ in the formula.)

Now consider the numbers $18,$19, $20. Under the absolute scaling method, we get: (18,19,20) → (▆▇▇) or, if you're rounding up, (18,19,20) → (▇▇▇) which obscures the difference between the numbers. There's only an 11% difference between the tallest and shortest bar, and that doesn't show up at this resolution. Depending on your application, this might be what you want, but we might also want to avail ourselves of the old trick of adjusting the baseline. Instead of the bottom of the bar being 0, we can say it represents 17. This effectively reduces every bar by 17 before scaling it, so that the number x is now represented by a bar with natural height n·(x−17) / (max−17). Then we get these bars: (18,19,20) → (▃▅▇) Whether this "relative scaling" is a better representation than ▇▇▇ depends on the application. It emphasizes different properties of the data. In general, if we put the baseline at b, the natural height for a bar representing number x is: $$x\rightarrow {n\cdot (x-b) \over (max-b) }$$ That is the same formula as before, except that everything has been shifted down by b. A reasonable choice of b would be the minimum input value, or perhaps a bit less than the minimum input value. The shell sucks But anyway, what I really wanted to talk about was how to fix this program, because I think my solution was fun and interesting. There is a tricky problem, which is that you need to calculate values like (n-b)/(x-b), which so you might like to do some division, but as I wrote earlier, bash has no facilities for doing fractional arithmetic. The original program used$((…)) everywhere, which throws away fractions. You can work around that, because you don't actually the fractional part of (n-b)/(x-b); you only need the greatest integer part. But the inputs to the program might themselves be fractional numbers, like say 3.5, and $((…)) barfs if you try to operate on such a number:$ x=3.5; echo $((x + 1)) bash: 3.5: syntax error: invalid arithmetic operator (error token is ".5") and you seemingly cannot work around that. My first response to this was to replace all the uses of$((…)) with bc, which, as I explained in the previous article, does not share this problem. M. Holman rejected this, saying that calling out to bc all the time made the program too slow. And there is something to be said for this. M. Holman also said that bc is non-portable, which I find astounding, since it has been in Unix since 1974, but sadly plausible.

So supposing that you take this complaint seriously, what can you do? Are you just doomed? No, I found a solution to the problem that solves all the problems. It is portable, efficient, and correct. It is also slightly insane.

Portable fractions in bash

We cannot use decimal numbers:

$x=3.5; echo$((x + 1))
bash: 3.5: syntax error: invalid arithmetic operator (error token is ".5")
But we can use fractions:

$x_n=7; x_d=2; echo$((x_n + x_d))/$((x_d)) 9/2 And we can convert decimal inputs to fractions without arithmetic: # given an input number which might be a decimal, convert it to # a rational number; set n and d to its numerator and # denominator. For example, 3.3 becomes n=33 and d=10; # 17 becomes n=17 and d=1. to_rational() { # Crapulent bash can't handle decimal numbers, so we will convert # the input number to a rational if [[$1 =~ (.*)\.(.*) ]] ; then
i_part=${BASH_REMATCH[1]} f_part=${BASH_REMATCH[2]}
n="$i_part$f_part";
d=$(( 10 **${#f_part} ))
else
n=$1 d=1 fi } This processes a number like 35.17 in a purely lexical way, extracting the 35 and the 17, and turning them into the numerator 3517 and the denominator 100. If the input number contains no decimal point, our task is trivial: 23 has a numerator of 23 and a denominator of 1. Now we can rewrite all the shell arithmetic in terms of rational numbers. If a_n and a_d are the numerator and denominator of a, and b_n and b_d are the numerator and denominator of b, then addition, subtraction, multiplication, and even division of a and b are fast, easy, and even portable: # a + b sum_n =$((a_n * b_d + a_d * b_n))
sum_d = $((a_d * b_d)) # a - b diff_n =$((a_n * b_d - a_d * b_n))
diff_d = $((a_d * b_d)) # a * b prod_n =$((a_n * b_n))
prod_d = $((a_d * b_d)) # a / b quot_n =$((a_n * b_d))
quot_d = $((a_d * b_n)) We can easily truncate a number to produce an integer, because the built-in division does this for us: greatest_int =$((a_n / a_d))
And we can round to the nearest integer by adding 1/2 before truncating:

nearest_int = $(( (a_n * 2 + a_d) / (a_d * 2) )) (Since n/d + 1/2 = (2n+d)/2d.) For complicated calculations, you can work the thing out as several steps, or you can solve it on paper and then just embed a big rational expression. For example, suppose you want to calculate ((x-minnumber_of_tiers)/range, where number_of_tiers is known to be an integer. You could do each operation in a separate step, or you could use instead: tick_index_n=$(( ( x_n * min_d - min_n * x_d ) * number_of_tiers * range_d ))
tick_index_d=$(( range_n * x_d * min_d )) Should you need to convert to decimals for output, the following is a proof-of-concept converter: function to_dec { n=$1
d=$2 maxit=$(( 1 + ${3:-10} )) while [$n != 0 -a $maxit -gt -1 ]; do next=$((n/d))
if [ "$r" = "" ]; then r="$next."; else r="$r$next"; fi
n=$(( (n - d * next) * 10 )) maxit=$(( maxit - 1 ))
done
r=${r:-'0.'} } For example, to_dec 13 8 sets r to 1.625, and to_dec 13 7 sets r to 1.857142857. The optional third argument controls the maximum number of digits after the decimal point, and defaults to 10. The principal defect is that it doesn't properly round off; frac2dec 19 10 0 yields 1. instead of 2., but this could be fixed without much trouble. Extending it to convert to arbitrary base output is quite easy as well. Coming next month, libraries in bash for computing with continued fractions using Gosper's algorithms. Ha ha, just kidding. The obvious next step is to implement base-10 floating-point numbers in bash like this: prod_mantissa=$((a_mantissa * b_mantissa))
prod_exponent=$((a_exponent + b_exponent)) [ Addendum 20120306: David Jones corrects a number of portability problems in my implementation. ] [ Addendum 20180101: [Shane Hansen did something similar](https://gist.github.com/shanemhansen/cd8f4178d33740b3e835) to calculate Euler's number (2.71818…) in Bash a while back. It might be fun to compare our implementations. ] Thu, 09 Feb 2012 Testing for exceptions The Test::Fatal module makes it very easy to test code that is supposed to throw an exception. It provides an exception function that takes a code block. If the code completes normally, exception { code } returns undefined; if the code throws an exception, exception { code } returns the exception value that was thrown. So for example, if you want to make sure that some erroneous call is detected and throws an exception, you can use this: isnt( exception { do_something( how_many_times => "W" ) }, undef, "how_many_times argument requires a number" ); which will succeed if do_something(…) throws an exception, and fail if it does not. You can also write a stricter test, to look for the particular exception you expect: like( exception { do_something( how_many_times => "W" ) }, qr/how_many_times is not numeric/, "how_many_times argument requires a number" ); which will succeed if do_something(…) throws an exception that contains how_many_times is not numeric, and fail otherwise. Today I almost made the terrible mistake of using the first form instead of the second. The manual suggests that you use the first form, but it's a bad suggestion. The problem is that if you completely screw up the test and write a broken code block that dies, the first test will cheerfully succeed anyway. For example, suppose you make a typo in the test code: isnt( exception { do_something( how_many_tims => "W" ) }, undef, "how_many_times argument requires a number" ); Here the do_something(…) call throws some totally different exception that we are not interested in, something like unknown argument 'how_many_tims' or mandatory 'how_many_times' argument missing, but the exception is swallowed and the test reports success, even though we know nothing at all about the feature we were trying to test. But the test looks like it passed. In my example today, the code looked like this: isnt( exception { my$invoice = gen_invoice();
$invoice->abandon; }, undef, "Can't abandon invoice with no abandoned charges"); }); The abandon call was supposed to fail, for reasons you don't care about. But in fact, the execution never got that far, because there was a totally dumb bug in gen_invoice() (a missing required constructor argument) that caused it to die with a completely different exception. I would never have noticed this error if I hadn't spontaneously decided to make the test stricter: like( exception { my$invoice = gen_invoice();
$invoice->abandon; }, qr/Can't.*with no abandoned charges/, "Can't abandon invoice with no abandoned charges"); }); This test failed, and the failure made clear that gen_invoice(), a piece of otherwise unimportant test apparatus, was completely broken, and that several other tests I had written in the same style appeared to be passing but weren't actually running the code I thought they were. So the rule of thumb is: even though the Test::Fatal manual suggests that you use isnt( exception { … }, undef, …), do not. I mentioned this to Ricardo Signes, the author of the module, and he released a new version with revised documentation before I managed to get this blog post published. Wed, 16 Nov 2011 Arithmetic expressions in shell scripts This spring will be the 25th anniversary of my involvement with Unix, and I have spent way too much of that time writing shell scripts. Back before we had Perl and the other 'P' languages (Python, PHP, Puby, and Pickle) you programmed in C or you programmed in shell. Bourne shell, to be specific. (It was named for its author, Steven Bourne. There was a time before there was a Bourne shell, when there was only "the shell", written by Ken Thompson, but that predates even my experience.) People did sometimes try to program the C shell, but only the very foolish tried it more than once. (Tom Christiansen once wrote a very detailed article explaining why, if you are interested.) C is still used, but it is still C, and, as they say, C is a language that combines the power of raw assembly with the expressiveness of raw assembly. If you wanted to do systems programming, you wrote in C, because that was what there was, but if you wanted to do almost anything else, you wrote in Bourne shell, because otherwise you spent a lot of time counting bytes and groveling over core dumps. If you knew what you were doing, you wrote as much as possible in Bourne shell, and for the parts where your shell script needed to do something interesting, you had it invoke some small utility program that you or someone else had written in C. "Interesting" in this case had an extremely low threshhold. You called out to a C utility to sort data. You called out to a C utility to remove or rename a file. You called out to a C utility to test for the existence of a file. You called out to a C utility to compare two strings. In early versions of the shell, you called out to a C utility to perform file globbing—that is, to expand something like dir?/*.c to a list of files—although this function had been absorbed into the shell itself by 1979, several years before I arrived. You called out to a C utility to print a string to the terminal. And you called out to a C utility if you wanted to do arithmetic. Even including languages that nobody is expected to actually use, Bourne shell is probably the only programming language I have ever used that does not have any built-in operators for performing arithmetic. Instead, there is a C utility program called expr which interprets its command-line arguments as an arithmetic expression, evaluates the expression, and prints the result on the standard output. So for example, if your script has variables x and y and you want to add these and store the result into z, you write: z=expr$x + $y This will fork a subprocess, which will execute the command expr 3 + 4 (or whatever). The command will emit the string 7 into a pipe, and the shell will read the string out of the pipe and store it into z. Astounding! The expr program is a real piece of crap. The following reasonable-seeming invocations of expr all fail: z=expr$x + 1.5
z=expr $x+$y
z=expr $x *$y
The first fails because the craptastic yacc parser in expr has a value stack that is integer-only, so the program was not written to handle fractional values, and will instantly abort with the message non-numeric argument upon encountering the string 1.5 in the input. The second fails because the craptastrophic lexer (a whole 12 lines of C code) assumes that each command argument will be a single token, and makes no effort to actually do any, you know, lexing. The third fails because expr is a command run in a subshell, and since the * character is special in the shell it expands to a list of the files in the current directory, so although you thought you were going to run expr 3 * 4 you actually ran expr 3 hostid sys3 sys3.tar.gz v5root v5root.tar.gz v6doc v6doc.tar.gz v6root v6root.tar.gz v6src v6src.tar.gz v7 v7.tar.gz 4. The whole thing is a craptaclysm of craptitude.

A better way to do arithmetic in a shell script was to invoke a different utility program, bc, the "basic calculator". You sent your arithmetic expression to bc on the standard input (which avoided the craptysmal shell expansion of *) and got the answer on the standard output, typically something like this:

z=echo "$x +$y" | bc -l
You needed the -l flag to enable floating-point calculations; it also enabled certain higher functions such as square roots and trigonometry.

I had assumed that bc was a later development than expr, but it appeared in Unix version 6, while expr did not appear until version 7. So then I thought perhaps expr had been thrown in as a demonstration of yacc, but no, yacc was already present in version 5, and anyway, bc was written with yacc. So I no longer have any workable theory about who perpetrated expr, or why. (I have emailed Brian Kernighan to ask, and if he says anything interesting I will post an addendum.)

Anyway, about ten years after all this, the GNU project was in full swing and was reimplementing all the standard Unix tools, including the shell. Since they wanted their implementations to displace the standard implementations, they added all sorts of bells and whistles to them. So their shell, bash, contained all sorts of stuff. Among other things, it had built-in arithmetic. In bash, if you want to add x and y and put the result into z you can write:

z=$(( x + y )) or even: z=$((x+y))
The nifty $(( punctuation was necessary because the syntax had to be backward compatible with the Bourne shell, and every clean syntax was already used for something else. The$((…)) feature was a great improvement over expr, and in some ways, it was even an improvement over bc. It is much faster, for one thing. And since it does not invoke a subshell, you don't have to worry about * doing something weird.

But in other ways it was a step backwards. It does not have any of bc's higher mathematical functions. It doesn't do radix conversion. And it does all its calculation in machine integers, so not only does it fall short of bc's arbitrary-precision arithmetic, it can't even handle fractions:

x=3; y=4.5
echo $((x+y)) bash: 4.5: syntax error: invalid arithmetic operator (error token is ".5") Why? Why why why??? Who ordered that? I mean, I hate floating-point arithmetic as much as the next guy—probably more—but even I recognize that people need to do it sometimes. Well, here we are, eleven hundred words into this article and I have still not come to the point. That is typical for me, but I think that contrary to my usual practice, I will cut the scroll here and get to the real point in a day or two. [ Addendum 20120215: At last, I got to the real point. ] Wed, 24 Nov 2010 git-reset The Git subcommand git-reset is very frequently used, and is one of very few commonly-used Git commands that can permanently destroy real work. Once work is in the repository, it is almost completely safe from any catastrophe. But git-reset also affects the working tree, and it is quite possible to utterly destroy a day's work by doing git-reset --hard at the wrong time. Unfortunately, the manual is unusually bad, with a huge pile of this stuff: working index HEAD target working index HEAD ---------------------------------------------------- A B C D --soft A B D --mixed A D D --hard D D D --merge (disallowed) working index HEAD target working index HEAD ---------------------------------------------------- A B C C --soft A B C --mixed A C C --hard C C C --merge (disallowed) Six more of these tables follow, giving the impression that git-reset is quite complicated. Sure, I'm gonna memorize 256 table entries. Or look up the results on the table before every git-reset. Not. The thing to notice about the two tables I quoted above is that they are redundant, because the second one is simply a special case of the first, with D replaced by C. So if you were really in love with the tables, you might abbreviate the 64 table entries to 28: working index target working index HEAD ---------------------------------------------------- A B C --soft A B C --mixed A C C --hard C C C --merge (disallowed) But even this is much more complicated than it should be. git-reset does up to three things: 1. It points the HEAD ref at a new 'target' commit, if you specified one. 2. Then it copies the tree of the HEAD commit to the index, unless you said --soft. 3. Finally, it copies the contents of the index to the working tree, if you said --hard. If you compare this with the table above, that is what you will see. The three points above replace at least 60% of the tables. Most of the rest concerns the less-frequently used --merge and --keep options and the circumstances in which the tree is considered to be in "good order". Tables are good for computers to understand, because they have a uniform format and computers are unfazed by giant masses of redundant data. The computer will not understand the data regardless of how well-structured they are, so there is no reason to adopt a representation that showcases the structure. For humans, however, tables are most useful when there is no deeper understanding of the structure to be had, because the structure tends to get lost in the profusion of data, as it did here. [ Thanks to Aristotle Pagaltzis for pointing out that git checkout can also destroy the working tree, and for other corrections. ] Wed, 10 Nov 2010 Revert-all-buffers This is another article about a trivial tool that is worth more to me than it cost to make. It's my new revert-all-buffers function for Emacs. Here's the use case: I'm editing 17 files, and I've saved a bunch of changes to them. Then I commit the changes with git, and then I change the working copy of the files out from under Emacs by doing some other git operation—I merge in another branch, or do a rebase, or something like that. Now when I go back to edit the files, the Emacs buffers are out of date. Emacs notices that, and for each file, it will at some point ask me "Contents of ... have changed on disk; do you really want to edit the buffer?", interrupting my train of thought. I can answer the question by typing r, which will refresh the buffer from the disk version, but having to do that for every buffer is a pain, because I know all those files have changed, and I don't want to be asked each time. Here's the solution: (defun revert-all-buffers () "Refreshes all open buffers from their respective files" (interactive) (let* ((list (buffer-list)) (buffer (car list))) (while buffer (when (and (buffer-file-name buffer) (not (buffer-modified-p buffer))) (set-buffer buffer) (revert-buffer t t t)) (setq list (cdr list)) (setq buffer (car list)))) (message "Refreshed open files")) I have this function bound to some otherwise useless key: it runs through all the buffers, and for each one that has an associated file, and has no unsaved changes, it reverts the contents from the version on the disk. This occasionally fails, most often because I have removed or renamed a file from the disk that I still have open in Emacs. Usually the response is to close the buffer, or reopen it from the new name. I could probably handle that properly in 99% of cases just by having Emacs close the buffer, but the other cases could be catastrophic, so I'm leaving it the way it is for a while. I swiped the code, with small changes, from EmacsWiki. Fri, 27 Aug 2010 A dummy generator for mock objects I am not sure how useful this actually is, but I after having used it once it was not yet obvious that it was a bad idea, so I am writing it up here. Suppose you are debugging some method, say someMethod, which accepts as one of its arguments complicated, annoying objects$annoying that you either can't or don't want to instantiate. This might be because $annoying is very complicated, with many sub-objects to set up, or perhaps you simply don't know how to build$annoying and don't care to find out.

That is okay, because you can get someMethod to run without the full behavior of $annoying. Say for example someMethod calls$annoying->foo_manager->get_foo(...)->get_user_id. You don't understand or care about the details because for debugging someMethod it is enough to suppose that the end result is the user ID 3. You could supply a mock object, or several, that implement the various methods, but that requires some work up front.

Instead, use this canned Dummy class. Instead of instantiating a real $annoying (which is difficult) or using a bespoke mock object, use Dummy->new("annoying"): package Dummy; use Data::Dumper;$Data::Dumper::Terse = 1;
our $METHOD; my @names = qw(bottle corncob euphonium octopus potato slide); my$NAME = "aaa";

sub new {
my ($class,$name) = @_;
$name ||=$METHOD || shift(@names) || $NAME++; bless { N =>$name } => $class; } The call Dummy->new("annoying") will generate an ad-hoc mock object; whenever any method is called on this dummy object, the call will be caught by an AUTOLOAD that will prompt you for the return value you want it to produce: sub AUTOLOAD { my ($self, @args) = @_;
my ($p,$m) = $AUTOLOAD =~ /(.*)::(.*)/; local$METHOD = $m; print STDERR "<<$_[0]{N}\->$m >>\n"; print STDERR "Arguments: " . Dumper(\@args) . "\n"; my$v;
do {
print STDERR "Value?  ";
chomp($v = <STDIN>); } until eval "$v; 1";
return(eval $v); } sub DESTROY { } 1; The prompt looks like this: << annoying->foo_manager >> Arguments: [] Value? If the returned value should be a sub-object, no problem: just put in new Dummy and it will make a new Dummy object named foo_manager, and the next prompt will be: << foo_manager->get_foo >> Arguments: ... ... Value? Now you can put in new Dummy "(Fred's foo)" or whatever. Eventually it will ask you for a value for (Fred's foo)->id and you can have it return 4. It's tempting to add caching, so that it won't ask you twice for the results of the same method call. But that would foreclose the option to have the call return different results twice. Better, I think, is for the user to cache the results themselves if they plan to use them again; there is nothing stopping the user from entering a value expression like$::val = ....

This may turn out to be one of those things that is mildly useful, but not useful enough to actually use; we'll see.

Thu, 26 Aug 2010

I think one problem (of many) that beginners might have with Haskell monads is the confusing terminology. The word "monad" can refer to four related but different things:

2. When a type constructor T of kind ∗ → ∗ is an instance of Monad we say that T "is a monad". For example, "Tree is a monad"; "((→) a) is a monad". This is the only usage that is strictly corrrect.

3. Types resulting from the application of monadic type constructors (#2) are sometimes referred to as monads. For example, "[Integer] is a monad".

Usage #1 is not a real problem; it does not occur that often, and is readily distinguished by context, capitalization, type font, and other markers. #2 is actually correct, so there is no problem there. #3 seems to be an uncommon colloquialism.

The most serious problem here is #4, that people refer to individual values of monadic types as "monads". Even when they don't do this, they are hampered by the lack of a good term for it. As I know no good alternative has been proposed. People often say "monadic value" (I think), which is accurate, but something of a mouthful.

One thing I have discovered in my writing life is that the clarity of a confusing document can sometimes be improved merely by replacing a polysyllabic noun phrase with a monosyllable. For example, chapter 3 of Higher-Order Perl discussed the technique of memoizing a function by generating an anonymous replacement for it that maintains a cache and calls the real function on a cache miss. Early drafts were hard to understand, and improved greatly when I replaced the phrase "anonymous replacement function" with "stub". The Perl documentation was significantly improved merely by replacing "associative array" everywhere with "hash" and "funny punctuation character" with "sigil".

I think a monosyllabic replacement for "monadic value" would be a similar boon to discussion of monads, not just for beginners but for everyone else too. The drawback, of introducing yet another jargon term, would in this case be outweighed by the benefits. Jargon can obscure, but sometimes it can clarify.

The replacement word should be euphonious, clear but not overly specific, and not easily confused with similar jargon words. It would probably be good for it to begin with the letter "m". I suggest:

mote

So return takes a value and returns a mote. The >>= function similarly lifts a function on pure values to a function on motes; when the mote is a container one may think of >>= as applying the function to the values in the container. [] is a monad, so lists are motes. The expression on the right-hand side of a var ← expr in a do-block must have mote type; it binds the mote on the right to the name on the left, using the >>= operator.

I have been using this term privately for several months, and it has been a small but noticeable success. Writing and debugging monadic programs is easier because I have a simple name for the motes that the program manipulates, which I can use when I mumble to myself: "What is the type error here? Oh, commit should be returning a mote." And then I insert return in the right place.

I'm don't want to oversell the importance of this invention. But there is clearly a gap in the current terminology, and I think it is well-filled by "mote".

(While this article was in progress I discovered that What a Monad is not uses the nonceword "mobit". I still prefer "mote".)

Sun, 03 Jan 2010

A short bibliography of probability monads
Several people helpfully wrote to me to provide references to earlier work on probability distribution monads. Here is a summary:

My thanks to Stephen Tetley, Gaal Yahas, and Luke Palmer for these.

I did not imagine that my idea was a new one. I arrived at it by thinking about List as a representation of non-deterministic computation. But if you think of it that way, the natural interpretation is that every list element represents an equally likely outcome, and so annotating the list elements with probabilities is the obvious next step. So the existence of the Erwig library was not a big surprise.

A little more surprising though, were the references in the Erwig paper. Specifically, the idea dates back to at least 1981; Erwig cites a paper that describes the probability monad in a pure-mathematics context.

Nobody responded to my taunting complaint about Haskell's failure to provide support a good monad of sets. It may be that this is because they all agree with me. (For example, the documentation of the Erwig package says "Unfortunately we cannot use a more efficient data structure because the key type must be of class Ord, but the Monad class does not allow constraints for result types.") But a number of years ago I said that the C++ macro processor blows goat dick. I would not have put it so strongly had I not naïvely believed that this was a universally-held opinion. But no, plenty of hapless C++ programmers wrote me indignant messages defending their macro system. So my being right is no guarantee that language partisans will not dispute with me, and the Haskell community's failure to do so in this case reflects well on them, I think.

Thu, 31 Dec 2009

A monad for probability and provenance
I don't quite remember how I arrived at this, but it occurred to me last week that probability distributions form a monad. This is the first time I've invented a new monad that I hadn't seen before; then I implemented it and it behaved pretty much the way I thought it would. So I feel like I've finally arrived, monadwise.

Suppose a monad value represents all the possible outcomes of an event, each with a probability of occurrence. For concreteness, let's suppose all our probability distributions are discrete. Then we might have:

data ProbDist p a = ProbDist [(a,p)] deriving (Eq, Show)
unpd (ProbDist ps) = ps
Each a is an outcome, and each p is the probability of that outcome occurring. For example, biased and unbiased coins:

unbiasedCoin = ProbDist [ ("heads", 0.5),
("tails", 0.5) ];

biasedCoin   = ProbDist [ ("heads", 0.6),
("tails", 0.4) ];

Or a couple of simple functions for making dice:

import Data.Ratio

d sides = ProbDist [(i, 1 % sides) | i <- [1 .. sides]]
die = d 6

d n is an n-sided die.

The Functor instance is straightforward:

instance Functor (ProbDist p) where
fmap f (ProbDist pas) = ProbDist $map (\(a,p) -> (f a, p)) pas The Monad instance requires return and >>=. The return function merely takes an event and turns it into a distribution where that event occurs with probability 1. I find join easier to think about than >>=. The join function takes a nested distribution, where each outcome of the outer distribution specifies an inner distribution for the actual events, and collapses it into a regular, overall distribution. For example, suppose you put a biased coin and an unbiased coin in a bag, then pull one out and flip it: bag :: ProbDist Double (ProbDist Double String) bag = ProbDist [ (biasedCoin, 0.5), (unbiasedCoin, 0.5) ] The join operator collapses this into a single ProbDist Double String: ProbDist [("heads",0.3), ("tails",0.2), ("heads",0.25), ("tails",0.25)] It would be nice if join could combine the duplicate heads into a single ("heads", 0.55) entry. But that would force an Eq a constraint on the event type, which isn't allowed, because (>>=) must work for all data types, not just for instances of Eq. This is a problem with Haskell, not with the monad itself. It's the same problem that prevents one from making a good set monad in Haskell, even though categorially sets are a perfectly good monad. (The return function constructs singletons, and the join function is simply set union.) Maybe in the next language. Perhaps someone else will find the >>= operator easier to understand than join? I don't know. Anyway, it's simple enough to derive once you understand join; here's the code: instance (Num p) => Monad (ProbDist p) where return a = ProbDist [(a, 1)] (ProbDist pas) >>= f = ProbDist$ do
(a, p) <- pas
let (ProbDist pbs) = f a
(b, q) <- pbs
return (b, p*q)
So now we can do some straightforward experiments:

liftM2 (+) (d 6) (d 6)

ProbDist [(2,1 % 36),(3,1 % 36),(4,1 % 36),(5,1 % 36),(6,1 %
36),(7,1 % 36),(3,1 % 36),(4,1 % 36),(5,1 % 36),(6,1 %
36),(7,1 % 36),(8,1 % 36),(4,1 % 36),(5,1 % 36),(6,1 %
36),(7,1 % 36),(8,1 % 36),(9,1 % 36),(5,1 % 36),(6,1 %
36),(7,1 % 36),(8,1 % 36),(9,1 % 36),(10,1 % 36),(6,1 %
36),(7,1 % 36),(8,1 % 36),(9,1 % 36),(10,1 % 36),(11,1 %
36),(7,1 % 36),(8,1 % 36),(9,1 % 36),(10,1 % 36),(11,1 %
36),(12,1 % 36)]
This is nasty-looking; we really need to merge the multiple listings of the same event. Here is a function to do that:

agglomerate :: (Num p, Eq b) => (a -> b) -> ProbDist p a -> ProbDist p b
agglomerate f pd = ProbDist $foldr insert [] (unpd (fmap f pd)) where insert (k, p) [] = [(k, p)] insert (k, p) ((k', p'):kps) | k == k' = (k, p+p'):kps | otherwise = (k', p'):(insert (k,p) kps) agg :: (Num p, Eq a) => ProbDist p a -> ProbDist p a agg = agglomerate id Then agg$ liftM2 (+) (d 6) (d 6) produces:

ProbDist [(12,1 % 36),(11,1 % 18),(10,1 % 12),(9,1 % 9),
(8,5 % 36),(7,1 % 6),(6,5 % 36),(5,1 % 9),
(4,1 % 12),(3,1 % 18),(2,1 % 36)]
Hey, that's correct.

There must be a shorter way to write insert. It really bothers me, because it looks look it should be possible to do it as a fold. But I couldn't make it look any better.

You are not limited to calculating probabilities. The monad actually will count things. For example, let us throw three dice and count how many ways there are to throw various numbers of sixes:

eq6 n = if n == 6 then 1 else 0
agg $liftM3 (\a b c -> eq6 a + eq6 b + eq6 c) die die die ProbDist [(3,1),(2,15),(1,75),(0,125)] There is one way to throw three sixes, 15 ways to throw two sixes, 75 ways to throw one six, and 125 ways to throw no sixes. So ProbDist is a misnomer. It's easy to convert counts to probabilities: probMap :: (p -> q) -> ProbDist p a -> ProbDist q a probMap f (ProbDist pds) = ProbDist$ (map (\(a,p) -> (a, f p))) pds

normalize :: (Fractional p) => ProbDist p a -> ProbDist p a
normalize pd@(ProbDist pas) = probMap (/ total) pd where
total = sum . (map snd) $pas normalize$ agg $probMap toRational$
liftM3 (\a b c -> eq6 a + eq6 b + eq6 c) die die die

ProbDist [(3,1 % 216),(2,5 % 72),(1,25 % 72),(0,125 % 216)]
I think this is the first time I've gotten to write die die die in a computer program.

The do notation is very nice. Here we calculate the distribution where we roll four dice and discard the smallest:

stat = do
a <- d 6
b <- d 6
c <- d 6
d <- d 6
return (a+b+c+d - minimum [a,b,c,d])

probMap fromRational $agg stat ProbDist [(18,1.6203703703703703e-2), (17,4.1666666666666664e-2), (16,7.253086419753087e-2), (15,0.10108024691358025), (14,0.12345679012345678), (13,0.13271604938271606), (12,0.12885802469135801), (11,0.11419753086419752), (10,9.41358024691358e-2), (9,7.021604938271606e-2), (8,4.7839506172839504e-2), (7,2.9320987654320986e-2), (6,1.6203703703703703e-2), (5,7.716049382716049e-3), (4,3.0864197530864196e-3), (3,7.716049382716049e-4)] One thing I was hoping to get didn't work out. I had this idea that I'd be able to calculate the outcome of a game of craps like this: dice = liftM2 (+) (d 6) (d 6) point n = do roll <- dice case roll of 7 -> return "lose" _ | roll == n = "win" _ | otherwise = point n craps = do roll <- dice case roll of 2 -> return "lose" 3 -> return "lose" 4 -> point 4 5 -> point 5 6 -> point 6 7 -> return "win" 8 -> point 8 9 -> point 9 10 -> point 10 11 -> return "win" 12 -> return "lose" This doesn't work at all; point is an infinite loop because the first value of dice, namely 2, causes a recursive call. I might be able to do something about this, but I'll have to think about it more. It also occurred to me that the use of * in the definition of >>= / join could be generalized. A couple of years back I mentioned a paper of Green, Karvounarakis, and Tannen that discusses "provenance semirings". The idea is that each item in a database is annotated with some "provenance" information about why it is there, and you want to calculate the provenance for items in tables that are computed from table joins. My earlier explanation is here. One special case of provenance information is that the provenances are probabilities that the database information is correct, and then the probabilities are calculated correctly for the joins, by multiplication and addition of probabilities. But in the general case the provenances are opaque symbols, and the multiplication and addition construct regular expressions over these symbols. One could generalize ProbDist similarly, and the ProbDist monad (even more of a misnomer this time) would calculate the provenance automatically. It occurs to me now that there's probably a natural way to view a database table join as a sort of Kleisli composition, but this article has gone on too long already. Happy new year, everyone. [ Addendum 20100103: unsurprisingly, this is not a new idea. Several readers wrote in with references to previous discussion of this monad, and related monads. It turns out that the idea goes back at least to 1981. ] My thanks to Graham Hunter for his donation. Tue, 15 Dec 2009 Monads are like burritos A few months ago Brent Yorgey complained about a certain class of tutorials which present monads by explaining how monads are like burritos. At first I thought the choice of burritos was only a facetious reference to the peculiar and sometimes strained analogies these tutorials make. But then I realized that monads are like burritos. I will explain. A monad is a special kind of a functor. A functor F takes each type T and maps it to a new type FT. A burrito is like a functor: it takes a type, like meat or beans, and turns it into a new type, like beef burrito or bean burrito. A functor must also be equipped with a map function that lifts functions over the original type into functions over the new type. For example, you can add chopped jalapeños or shredded cheese to any type, like meat or beans; the lifted version of this function adds chopped jalapeños or shredded cheese to the corresponding burrito. A monad must also possess a unit function that takes a regular value, such as a particular batch of meat, and turns it into a burrito. The unit function for burritos is obviously a tortilla. Finally, a monad must possess a join function that takes a ridiculous burrito of burritos and turns them into a regular burrito. Here the obvious join function is to remove the outer tortilla, then unwrap the inner burritos and transfer their fillings into the outer tortilla, and throw away the inner wrappings. The map, join, and unit functions must satisfy certain laws. For example, if B is already a burrito, and not merely a filling for a burrito, then join(unit(B)) must be the same as B. This means that if you have a burrito, and you wrap it in a second tortilla, and then unwrap the contents into the outer tortilla, the result is the same as what you started with. This is true because tortillas are indistinguishable. I know you are going to point out that some tortillas have the face of Jesus. But those have been toasted, and so are unsuitable for burrito-making, and do not concern us here. So monads are indeed like burritos. I asked Brent if this was actually what he had in mind when he first suggested the idea of tutorials explaining monads in terms of burritos, and if everyone else had understood this right away. But he said no, I was the lone genius. [ Addendum 20120106: Chris Done has presented this theory in cartoon form. ] Sat, 01 Aug 2009 Dijkstra was not insane Recently, a reader on the Higher-Order Perl discussion mailing list made a remark about Edsger Dijkstra and his well-known opposition to the break construction (in Perl, last) that escapes prematurely from a loop. People often use this as an example to show that Dijkstra was excessively doctrinaire, and out of touch with the reality of programming[1], but usually it's because they don't know what his argument was. I wrote a response, explaining where Dijkstra was coming from, and I am very happy with how it came out, so I'm reposting it here. The list subscriber said, in part: On a side note, I never read anything by Dijkstra that wasn't noticeably out of touch with the reality of programming, which qualifies them as screeds to me. And I say that as a former Pascal programmer, and as one who has read, and bought into, things like Kernighan's "Why Pascal is Not My Favorite Programming Language" and the valid rants about how some form of breaking out of a loop without having to proceed to the end is very useful, without destroying structure (except by Dijkstra's definition of structure)... A lot of people bring up the premature-loop-exit prohibition without understanding why Dijkstra suggested it; it wasn't just that he was a tightassed Dutchman. Dijkstra's idea was this: suppose you want to prove, mathematically, that your program does what it is supposed to do. Please, everyone, suspend your judgment of this issue for a few paragraphs, and bear with me. Let's really suppose that we want to do this. Dijkstra's idea is that the program is essentially a concatenation of blocks, each of which is trying to accomplish something or other, and each of which does not make sense to run unless some part of the program state is set up for it ahead of time. For example, the program might be to print a sorted list of links from a web page. Then the obvious blocks are: A get the web page and store it in a variable B extract the links from the text in the variable into an array C sort the array D print out the array contents Section C is trying to sort the array; if it is correct then the array will be sorted by the time step D commences. But it doesn't make sense to commence step C unless the array is populated. Garbage in, garbage out, as they used to say when I was in elementary school. We say that the "precondition" for C is that the array be populated with URLs, and the "postcondition" is that the array be in sorted order. What you would want to prove about C is that if the precondition holds—that is, if the array is properly populated before C begins—then the postcondition will hold too—that is, the array will be in sorted order when C completes. It occurs to me that calling this a "proof" is probably biasing everyone's thinking. Let's forget about mathematical proofs and just think about ordinary programmers trying to understand if the program is correct. If the intern in the next cubicle handed you his code for this program, and you were looking it over, you would probably think in very much this way: you would identify block C (maybe it's a subroutine, or maybe not) and then you would try to understand if C, given an array of URLs, would produce a properly sorted array by the time it was done. C itself might depend on some sub-blocks or subroutines that performed sub-parts of the task; you could try to understand them similarly. Having proved (or convinced yourself) that C will produce the postcondition "array contains sorted list of URLs", you are in an excellent position to prove (or convince yourself) that block D prints out a sorted array of URLs, which is what you want. Without that belief about C, you are building on sand; you have almost nothing to go on, and you can conclude hardly anything useful about the behavior of D. Now consider a more complex block, one of the form: if (q) { E; } else { F; } Suppose you believe that code E, given precondition x, is guaranteed to produce postcondition y. And suppose you believe the same thing about F. Then you can conclude the same thing about the entire if-else block: if x was true before it began executing, then y will be true when it is done.[2] So you can build up proofs (or beliefs) about small bits of code into proofs (or beliefs) about larger ones. We can understand while loops similarly. Suppose we know that condition p is true prior to the commencement of some loop, and that if p is true before G executes, then p will also be true when G finishes. Then what can we say about this loop? while (q) { G; } We can conclude that if p was true before the loop began, then p will still be true, and q will be false, when the loop ends. BUT BUT BUT BUT if your language has break, then that guarantee goes out the window and you can conclude nothing. Or at the very least your conclusions will become much more difficult. You can no longer treat G atomically; you have to understand its contents in detail. So this is where Dijkstra is coming from: features like break[3] tend to sabotage the benefits of structured programming, and prevent the programmer from understanding the program as a composition of independent units. The other subscriber made a seemingly disparaging reference to "Dijkstra's idea of structure", but I hope it is clear that it was not an arbitrary idea. Dijkstra's idea of structure is what will allow you to understand a large program as a collection of modules. Regardless of your opinion about formal verification methods, or correctness proofs, or the practicality of omitting break from your language, it should at least be clear that Dijkstra was not being doctrinaire just for the sake of doctrine. Some additional notes Here are some interesting peripheral points that I left out of my main discussion because I wanted to stick to the main point, which was: "Dijkstra was not insane". 1. I said in an earlier post that "I often find Dijkstra's innumerable screeds very tiresome in their unkind, unforgiving, and unrealistic attitudes toward programmers." But despite this, I believe he was a brilliant thinker, and almost every time he opened his mouth it was to make a carefully-considered argument. You may not like him, and you may not agree with him, but you'll be better off listening to him. An archive of Dijkstra's miscellaneous notes and essays (a pre-blogging blog, if you like) is maintained at the University of Texas. I recommend it. 2. I said: if (q) { E; } else { F; } Suppose you believe that code E, given precondition x, is guaranteed to produce postcondition y. And suppose you believe the same thing about F. Then you can conclude the same thing about the entire if-else block. Actually, your job is slightly easier. Let's write this: [x] E [y] to mean that code E, given precondition x, produces postcondition y. That is, if we know that x is true when E begins execution, then we know that y is true when E finishes. Then my quoted paragraph above says that from these: [x] E [y] [x] F [y] we can conclude this: [x] if (q) {E} else {F} [y] But actually we can make a somewhat stronger statement. We can make the same conclusion from weaker assumptions. If we believe these: [x and q] E [y] [x and not q] F [y] then we can conclude this: [x] if (q) {E} else {F} [y] In fact this precisely expresses the complete semantics of the if-else construction. Why do we use if-else blocks anyway? This is the reason: we want to be able to write code to guarantee something like this: [x] BLAH [y] but we only know how to guarantee [x and q] FOO [y] and [x and not q] BAR [y] for some q. So we write two blocks of code, each of which accomplishes y under some circumstances, and use if-else to make sure that the right one is selected under the right circumstances. 3. Similar to break (but worse), in the presence of goto you are on very shaky ground in trying to conclude anything about whether the program is correct. Suppose you know that C is correct if its precondition (an array of URLs) is satisfied. And you know that B will set up that precondition (that is, the array) if its precondition is satisfied, so it seems like you are all right. But no, because block W somewhere else might have goto C; and transfer control to C without setting up the precondition, and then C could cause winged demons to fly out of your nose. Further reading • For a quick overview, see the Wikipedia article on Hoare logic. Hoare logic is the [x] E [y] notation I used above, and a set of rules saying how to reason with claims of that form. For example, one rule of Hoare logic defines the meaning of the null statement: if ; is the null statement, then [p] ; [p] for all conditions p. Hoare logic was invented by Tony Hoare, who also invented the Quicksort algorithm.  Order A Discipline of Programming from Powell's • For further details, see Dijkstra's book "A Discipline of Programming". Dijkstra introduces a function called wp for "weakest precondition". Given a piece of code C and a desired postcondition q, wp(C, q) is the weakest precondition that is sufficient for code C to accomplish q. That is, it's the minimum prerequisite for C to accomplish q. Most of the book is about how to figure out what these weakest preconditions are, and, once you know them, how they can guide you to through the implementation of your program. I have an idea that the Dijkstra book might be easier to follow after having read this introduction than without it. • No discussion of structured programming and goto is complete without a mention of Donald Knuth's wonderful paper Stuctured Programming with go to Statements. This is my single all-time favorite computer science paper. Download it here.  Order Software Tools in Pascal from Powell's • Software Tools in Pascal is a book by Kernighan and Plauger that tries to translate the tool suite of their earlier Software Tools book into Pascal. They were repeatedly screwed by deficiencies in the Pascal language, and this was the inspiration for Kernighan's famous "Why Pascal is not my Favorite Programming Language" paper. In effect, Software Tools in Pascal is a book-length case study of the deficiencies of Pascal for practical programming tasks. Tue, 16 Jun 2009 Haskell logo fail The Haskell folks have chosen a new logo. Ouch. Thu, 14 May 2009 Product types in Java Recently I wanted a Java function that would return two Person objects. Java functions return only a single value. I could, of course, make a class that encapsulates two Persons: class Persons2 { Person personA, personB; Persons2(Person a, Person b) { personA = a; personB = b; } Person getPersonA() { return personA; } ... } Java is loathsome in its verbosity, and this sort of monkey code is Java's verbosity at its most loathsome. So I did not do this. Haskell functions return only one value also, but this is no limitation, because Haskell has product types. And starting in Java 5, the Java type system is a sort of dented, bolted-on version of the type systems that eventually evolved into the Haskell type system. But product types are pretty simple. I can make a generic product type in Java: class Pair<A,B> { A a; B b; Pair(A a, B b) { this.a = a; this.b = b; } A fst() { return a; } B snd() { return b; } } Then I can declare my function to return a Pair<Person,Person>: Pair<Person,Person> findMatch() { ... return new Pair(husband, wife); } Okay, that worked just fine. The boilerplate is still there, but you only have to do it once. This trick seems sufficiently useful that I can imagine that I will use it again, and that someone else reading this will want to use it too. I've been saying for a while that up through version 1.4, Java was a throwback to the languages of the 1970s, but that with the introduction of generics in Java 5, it took a giant step forward into the 1980s. I think this is a point of evidence in favor of that claim. I wonder why this class isn't in the standard library. I was not the first person to think of doing this; web search turns up several others, who also wonder why this class isn't in the standard library. I wrote a long, irrelevant coda regarding my use of the identifiers husband and wife in the example, but, contrary to my usual practice, I will publish it another day. [ Addendum 20090517: Here's the long, irrelevant coda. ] I gratefully acknowledge the gift of Petr Kiryakov. Thank you! Sun, 22 Mar 2009 Worst error messages this month This month's winner is: Line 319 in XML document from class path resource [applicationContext-standalone.xml] is invalid; nested exception is org.xml.sax.SAXParseException: cvc-complex-type.2.3: Element 'beans' cannot have character [children], because the type's content type is element-only. Experienced technicians will of course want to look at line 319. Silly! If looking at line 319 were any help, this would not be this month's lucky winner. Line 319 is the last line of the document, and says, in whole, "</beans>". What this actually means is that there is a stray plus sign at the end of line 54. Well, that is the ultimate cause. The Fregean Bedeutung, as it were. What it really means (the Sinn) is that the <beans>...</beans> element is allowed to contain sub-elements, but not naked text ("content type is element-only") and the stray plus sign is naked text. The mixture of weird jargon ("cvc-complex-type.2.3") and obscure anaphora ("character [children]" for "plus sign") got this message nominated for the competition. The totally wrong line number is a bonus. But what won this message the prize is that even if you somehow understand what it means, it doesn't help you find the actual problem! You get to grovel over the 319-line XML file line-by-line, looking for the extra character. Come on, folks, it's a SAX parser, so how hard is it to complain about the plus sign as soon as it shows up? What do we have for the lucky winner, Johnny? You'll be flown to lovely Centralia, Pennsylvania, where you'll enjoy four days and three nights of solitude in an abandoned coal mine being flogged with holly branches and CAT-5 ethernet cable by the cast of "The Hills"! Thank you, Johnny. And there is a runner-up! The badblocks utility that is distributed as part of the Linux e2fsprogs package, produces the following extremely useful error message: % badblocks /home badblocks: invalid starting block (0): must be less than 0 Apparently this is Linux-speak for "This program needs the name of a device file, and the programmer was too lazy to have it detect that you supplied the name of the mount point instead". Happy spring, everyone! Thu, 12 Feb 2009 More Uzi-clubbing: a counter­example Last year I wrote an article about iterating over a hash, searching for a certain key. Larry Wall called said this was like "clubbing someone to death with a loaded Uzi", because the whole point of a hash is that you don't have to scan all the keys to find the one you want. I ended the article by saying: I had already realized that you could, in principle, commit this error with a regular array instead of with a hash, but I had never seen an example until... Just recently I saw another example, which I think is interesting because it seems to be a counterexample. It's part of a somewhat longer Java program. The crucial section is: ... LINE: while ( ( line = in.readLine()) != null ) { String[] fields = line.split("\t"); ... for ( int i = 0; i < fields.length; i++ ) { if ( ! isEmpty(fields[i]) ) { switch(i) { case 0: citation.setCitationType(fields[i]); break; case 1: setAuthors(citation,fields[i],personHome,false); break; case 2: citation.setPublishYear(Integer.parseInt(fields[i])); break; case 3: citation.setTitle(fields[i]); break; ... case 19: citation.setURL(fields[i]); break; case 20: citation.setDoi(fields[i]); break; default: warn("Empty field expected, found: " + fields[i] + " for line: " + line); break; } } } } ... The Perlishness of this Java code might lead you to think that I wrote it, but I did not. My temptation here was to replace the loop and the switch with code like this: citation.setCitationType(fields[0]); setAuthors(citation,fields[1],personHome,false); citation.setPublishYear(Integer.parseInt(fields[2])); citation.setTitle(fields[3]); ... citation.setURL(fields[19]); citation.setDoi(fields[20]); We lost the warnings, but there were only 4 of those, so we can add them back explicitly: if (! isEmpty(fields[13])) warn("Empty field expected..."); This might have been an improvement, except that we also lost the isEmpty tests on the nonempty fields. To get them back we must spend at least all our gains, possibly more: if (! isEmpty(fields[0])) citation.setCitationType(fields[0]); if (! isEmpty(fields[1])) setAuthors(citation,fields[1],personHome,false); if (! isEmpty(fields[2])) citation.setPublishYear(Integer.parseInt(fields[2])); if (! isEmpty(fields[3])) citation.setTitle(fields[3]); ... if (! isEmpty(fields[13])) warn("Empty field expected..."); ... if (! isEmpty(fields[19])) citation.setURL(fields[19]); if (! isEmpty(fields[20])) citation.setDoi(fields[20]); So at least in this case, my instinct to eliminate the loop-switch was not helpful. There are plenty of Java-esque techniques for cutting up the complexity and sweeping each little piece underneath its own little carpet ("Replace fields with an object! Or with a series of 20 objects!") but nothing that actually reduces the entia multiplicantis. There may be ways to easily improve this code, but I have not been able to think of any. Sat, 24 Jan 2009 Higher-Order Perl: nonmemoizing streams The first version of tail() in the streams chapter looks like this: sub tail { my$s = shift;
if (is_promise($s->[1])) { return$s->[1]->();  # Force promise
} else {
return $s->[1]; } } But this is soon replaced with a version that caches the value returned by the promise: sub tail { my$s = shift;
if (is_promise($s->[1])) {$s->[1] = $s->[1]->(); # Force and save promise } return$s->[1];
}
The reason that I give for this in the book is a performance reason. It's accompanied by an extremely bad explanation. But I couldn't do any better at the time.

There are much stronger reasons for the memoizing version, also much easier to explain.

Why use streams at all instead of the iterators of chapter 4? The most important reason, which I omitted from the book, is that the streams are rewindable. With the chapter 4 iterators, once the data comes out, there is no easy way to get it back in. For example, suppose we want to process the next bit of data from the stream if there is a carrot coming up soon, and a different way if not. Consider:

# Chapter 4 iterators
my $data =$iterator->();
if (carrot_coming_soon($iterator)) { # X } else { # Y } sub carrot_coming_soon { my$it = shift;
my $soon = shift || 3; while ($soon-- > 0) {
my $next =$it->();
return 1 if is_carrot($next); } return; # No carrot } Well, this probably doesn't work, because the carrot_coming_soon() function extracts and discards the upcoming data from the iterator, including the carrot itself, and now that data is lost. One can build a rewindable iterator: sub make_rewindable { my$it = shift;
my @saved;  # upcoming values in LIFO order
return sub {
my $action = shift || "next"; if ($action eq "put back") {
push @saved, @_;
} elsif ($action eq "next") { if (@saved) { return pop @saved; } else { return$it->(); }
}
};
}
But it's kind of a pain in the butt to use:

sub carrot_coming_soon {
my $it = shift; my$soon = shift || 3;
my @saved;
my $saw_carrot; while ($soon-- > 0) {
push @saved, $it->();$saw_carrot = 1, last if is_carrot($saved[-1]); }$it->("put back", @saved);
return $saw_carrot; } Because you have to explicitly restore the data you extracted. With the streams, it's all much easier: sub carrot_coming_soon { my$s = shift;
my $soon = shift || 3; while ($seen-- > 0) {
return 1 if is_carrot($s->head); drop($s);
}
return;
}
The working version of carrot_coming_soon() for streams looks just like the non-working version for iterators.

But this version of carrot_coming_soon() only works for memoizing streams, or for streams whose promise functions are pure. Let's consider a counterexample:

my $bad = filehandle_stream(\*DATA); sub filehandle_stream { my$fh = shift;
return node(scalar <$fh>, promise { filehandle_stream($fh) });
}

__DATA__
fish
dog
carrot
goat rectum
Now consider what happens if I do this:

$carrot_soon = carrot_coming_soon($bad);
print "A carrot appears soon after item ", head($bad), "\n" if$carrot_soon;
It says "A carrot appears soon after item fish". Fine. That's because $bad is a node whose head contains "fish". Now let's see what's after the fish: print "After ", head($bad), " is ", head(tail($bad)), "\n"; This should print After fish is dog, and for the memoizing streams I used in the book, it does. But a non-memoizing stream will print "After fish is goat rectum". Because tail($bad) invokes the promise function, which, since the next() was not saved after carrot_coming_soon() examined it, builds a new node, which reads the next item from the filehandle, which is "goat rectum".

I wish I had explained the rewinding property of the streams in the book. It's one of the most significant omissions I know about. And I wish I'd appreciated sooner that the rewinding property only works if the tail() function autosaves the tail node returned from the promise.

Wed, 12 Nov 2008

Flag variables in Bourne shell programs
Who the heck still programs in Bourne shell? Old farts like me, occasionally. Of course, almost every time I do I ask myself why I didn't write it in Perl. Well, maybe this will be of some value to some fart even older than me..

Suppose you want to set a flag variable, and then later you want to test it. You probably do something like this:

if some condition; then
IS_NAKED=1
fi

...

if [ "$IS_NAKED" == "1" ]; then flag is set else flag is not set fi Or maybe you use${IS_NAKED:-0} or some such instead of "$IN_NAKED". Whatever. Today I invented a different technique. Try this on instead: IS_NAKED=false if some condition; then IS_NAKED=true fi ... if$IS_NAKED; then
flag is set
else
flag is not set
fi
The arguments both for and against it seem to be obvious, so I won't make them.

I have never seen this done before, but, as I concluded and R.J.B. Signes independently agreed, it is obvious once you see it.

[ Addendum 20090107: some followup notes ]

Thu, 18 Sep 2008

data Mu f = In (f (Mu f))
Last week I wrote about one of two mindboggling pieces of code that appears in the paper Functional Programming with Overloading and Higher-Order Polymorphism, by Mark P. Jones. Today I'll write about the other one. It looks like this:

data Mu f = In (f (Mu f))                       -- (???)
I bet a bunch of people reading this on Planet Haskell are nodding and saying "Oh, that!"

When I first saw this I couldn't figure out what it was saying at all. It was totally opaque. I still have trouble recognizing in Haskell what tokens are types, what tokens are type constructors, and what tokens are value constructors. Code like (???) is unusually confusing in this regard.

Normally, one sees something like this instead:

data Maybe f = Nothing | Just f
Here f is a type variable; that is, a variable that ranges over types. Maybe is a type constructor, which is like a function that you can apply to a type to get another type. The most familiar example of a type constructor is List:

data List e = Nil | Cons e (List e)
Given any type f, you can apply the type constructor List to f to get a new type List f. For example, you can apply List to Int to get the type List Int. (The Haskell built-in list type constructor goes by the funny name of [], but works the same way. The type [Int] is a synonym for ([] Int).)

Actually, type names are type constructors also; they're argumentless type constructors. So we have type constructors like Int, which take no arguments, and type constructors like List, which take one argument. Haskell also has type constructors that take more than one argument. For example, Haskell has a standard type constructor called Either for making union types:

data Either a b = Left a | Right b;
Then the type Either Int String contains values like Left 37 and Right "Cotton Mather".

To keep track of how many arguments a type constructor has, one can consider the, ahem, type, of the type constructor. But to avoid the obvious looming terminological confusion, the experts use the word "kind" to refer to the type of a type constructor. The kind of List is * → *, which means that it takes a type and gives you back a type. The kind of Either is * → * → *, which means that it takes two types and gives you back a type. Well, actually, it is curried, just like regular functions are, so that Either Int is itself a type constructor of kind * → * which takes a type a and returns a type which could be either an Int or an a. The nullary type constructor Int has kind *.

Continuing the "Maybe" example above, f is a type, or a constructor of kind *, if you prefer. Just is a value constructor, of type fMaybe f. It takes a value of type f and produces a value of type Maybe f.

Now here is a crucial point. In declarations of type constructors, such as these:

data Either a b = ...
data List e = ...
data Maybe f = ...
the type variables a, b, e, and f actually range over type constructors, not over types. Haskell can infer the kinds of the type constructors Either, List, and Maybe, and also the kinds of the type variables, from the definitions on the right of the = signs. In this case, it concludes that all four variables must have kind *, and so really do represent types, and not higher-order type constructors. So you can't ask for Either Int List because List is known to have kind * → *, and Haskell needs a type constructor of kind * to serve as an argument to Either.

But with a different definition, Haskell might infer that a type variable has a higher-order kind. Here is a contrived example, which might be good for something, perhaps. I'm not sure:

data TyCon f = ValCon (f Int)
This defines a type constructor TyCon with kind (* → *) → *, which can be applied to any type constuctor f that has kind * → *, to yield a type. What new type? The new type TyCon f is isomorphic to the type f of Int. For example, TyCon List is basically the same as List Int. The value Just 37 has type Maybe Int, and the value ValCon (Just 37) has type TyCon Maybe.

Similarly, the value [1, 2, 3] has type [Int], which, you remember, is a synonym for [] Int. And the value ValCon [1, 2, 3] has type TyCon [].

Now that the jargon is laid out, let's look at (???) again:

data Mu f = In (f (Mu f))                       -- (???)
When I was first trying to get my head around this, I had trouble seeing what the values were going to be. It looks at first like it has no bottom. The token f here, like in the TyCon example, is a variable that ranges over type constructors with kind * → *, so could be List or Maybe or [], something that takes a type and yields a new type. Mu itself has kind (* → *) → *, taking something like f and yielding a type. But what's an actual value? You need to apply the value constructor In to a value of type f (Mu f), and it's not immediately clear where to get such a thing.

I asked on #haskell, and Cale Gibbard explained it very clearly. To do anything useful you first have to fix f. Let's take f = Maybe. In that particular case, (???) becomes:

data Mu Maybe = In (Maybe (Mu Maybe))
So the In value constructor will take a value of type Maybe (Mu Maybe) and return a value of type Mu Maybe. Where do we get a value of type Maybe (Mu Maybe)? Oh, no problem: the value Nothing is polymorphic, and has type Maybe a for all a, so in particular it has type Maybe (Mu Maybe). Whatever Maybe (Mu Maybe) is, it is a Maybe-type, so it has a Nothing value. So we do have something to get started with.

Since Nothing is a Maybe (Mu Maybe) value, we can apply the In constructor to it, yielding the value In Nothing, which has type Mu Maybe. Then applying Just, of type a → Maybe a, to In Nothing, of type Mu Maybe, produces Just (In Nothing), of type Maybe (Mu Maybe) again. We can repeat the process as much as we want and produce as many values of type Mu Maybe as we want; they look like these:

In Nothing
In (Just (In Nothing))
In (Just (In (Just (In Nothing))))
In (Just (In (Just (In (Just (In Nothing))))))
...
And that's it, that's the type Mu Maybe, the set of those values. It will look a little simpler if we omit the In markers, which don't really add much value. We can just agree to omit them, or we can get rid of them in the code by defining some semantic sugar:

nothing = In Nothing
just = In . Just
Then the values of Mu Maybe look like this:
nothing
just nothing
just (just nothing)
just (just (just nothing))
...
It becomes evident that what the Mu operator does is to close the type under repeated application. This is analogous to the way the fixpoint combinator works on values. Consider the usual definition of the fixpoint combinator:

Y f = f (Y f)
Here f is a function of type aa. Y f is a fixed point of f. That is, it is a value x of type a such that f x = x. (Put x = Y f in the definition to see this.)

The fixed point of a function f can be computed by considering the limit of the following sequence of values:

f(⊥)
f(f(⊥))
f(f(f(⊥)))
...

This actually finds the least fixed point of f, for a certain definition of "least". For many functions f, like xx + 1, this finds the uninteresting fixed point ⊥, but for many f, like x → λ n. if n = 0 then 1 else n * x(n - 1), it's something better.

Mu is analogous to Y. Instead of operating on a function f from values to values, and producing a single fixed-point value, it operates on a type constructor f from types to types, and produces a fixed-point type. The resulting type T is the least fixed point of the type constructor f, the smallest set of values such that f T = T.

Consider the example of f = Maybe again. We want to find a type T such that T = Maybe T. Consider the following sequence:

{ ⊥ }
Maybe { ⊥ }
Maybe(Maybe { ⊥ })
Maybe(Maybe(Maybe { ⊥ }))
...

The first item is the set that contains nothing but the bottom value, which we might call t0. But t0 is not a fixed point of Maybe, because Maybe { ⊥ } also contains Nothing. So Maybe { ⊥ } is a different type from t0, which we can call t1 = { Nothing, ⊥ }.

The type t1 is not a fixed point of Maybe either, because Maybe t1 evidently contains both Nothing and Just Nothing. Repeating this process, we find that the limit of the sequence is the type Mu Maybe = { ⊥, Nothing, Just Nothing, Just (Just Nothing), Just (Just (Just Nothing)), ... }. This type is fixed under Maybe.

It might be worth pointing out that this is not the only such fixed point, but is is the least fixed point. One can easily find larger types that are fixed under Maybe. For example, postulate a special value Q which has the property that Q = Just Q. Then Mu Maybe ∪ { Q } is also a fixed point of Maybe. But it's easy to see (and to show, by induction) that any such fixed point must be a superset of Mu Maybe. Further consideration of this point might take me off to co-induction, paraconsistent logic, Peter Aczel's nonstandard set theory, and I'd never get back again. So let's leave this for now.

So that's what Mu really is: a fixed-point operator for type constructors. And having realized this, one can go back and look at the definition and see that oh, that's precisely what the definition says, how obvious:

Y f =     f  (Y f)             -- ordinary fixed-point operator
data Mu f = In (f (Mu f))            -- (???)
Given f, a function from values to values, Y(f) calculates a value x such that x = f(x). Given f, a function from types to types, Mu(f) calculates a type T such that f(T) = T. That's why the definitions are identical. (Except for that annoying In constructor, which really oughtn't to be there.)

You can use this technique to construct various recursive datatypes. For example, Mu Maybe turns out to be equivalent to the following definition of the natural numbers:

data Number = Zero | Succ Number;
Notice the structural similarity with the definition of Maybe:

data Maybe a = Nothing | Just a;
One can similarly define lists:

data Mu f = In (f (Mu f))
data ListX a b = Nil | Cons a b deriving Show
type List a = Mu (ListX a)

-- syntactic sugar
nil :: List a
nil = In Nil
cons :: a → List a → List a
cons x y = In (Cons x y)

-- for example
ls = cons 3 (cons 4 (cons 5 nil))          -- :: List Integer
lt = (cons 'p' (cons 'y' (cons 'x' nil)))  -- :: List Char
Or you could similarly do trees, or whatever. Why one might want to do this is a totally separate article, which I am not going to write today.

Here's the point of today's article: I find it amazing that Haskell's type system is powerful enough to allow one to defined a fixed-point operator for functions over types.

We've come a long way since FORTRAN, that's for sure.

A couple of final, tangential notes: Google search for "Mu f = In (f (Mu f))" turns up relatively few hits, but each hit is extremely interesting. If you're trying to preload your laptop with good stuff to read on a plane ride, downloading these papers might be a good move.

The Peter Aczel thing seems to be less well-known that it should be. It is a version of set theory that allows coinductive definitions of sets instead of inductive definitions. In particular, it allows one to have a set S = { S }, which standard set theory forbids. If you are interested in co-induction you should take a look at this. You can find a clear explanation of it in Barwise and Etchemendy's book The Liar (which I have read) and possibly also in Aczel's book Non Well-Founded Sets (which I haven't read).

Thu, 11 Sep 2008

Return return
Among the things I read during the past two months was the paper Functional Programming with Overloading and Higher-Order Polymorphism, by Mark P. Jones. I don't remember why I read this, but it sure was interesting. It is an introduction to the new, cool features of Haskell's type system, with many examples. It was written in 1995 when the features were new. They're no longer new, but they are still cool.

There were two different pieces of code in this paper that wowed me. When I started this article, I was planning to write about #2. I decided that I would throw in a couple of paragraphs about #1 first, just to get it out of the way. This article is that couple of paragraphs.

Suppose you have a type that represents terms over some type v of variable names. The v type is probably strings but could possibly be something else:

data Term v = TVar v                -- Type variable
| TInt                  -- Integer type
| TString               -- String type
| Fun (Term v) (Term v) -- Function type
There's a natural way to make the Term type constructor into an instance of Monad:

return v          = TVar v
TVar v   >>= f = f v
TInt     >>= f = TInt
TString  >>= f = TString
Fun d r  >>= f = Fun (d >>= f) (r >>= f)
That is, the return operation just lifts a variable name to the term that consists of just that variable, and the bind operation just maps its argument function over the variable names in the term, leaving everything else alone.

Jones wants to write a function, unify, which performs a unification algorithm over these terms. Unification answers the question of whether, given two terms, there is a third term that is an instance of both. For example, consider the two terms a → Int and String → b, which are represented by Fun (TVar "a") TInt and Fun TString (TVar "b"), respectively. These terms can be unified, since the term String → Int is an instance of both; one can assign a = TString and b = TInt to turn both terms into Fun TString TInt.

The result of the unification algorithm should be a set of these bindings, in this example saying that the input terms can be unified by replacing the variable "a" with the term TString, and the variable "b" with the term TInt. This set of bindings can be represented by a function that takes a variable name and returns the term to which it should be bound. The function will have type v → Term v. For the example above, the result is a function which takes "a" and returns TString, and which takes "b" and returns TInt. What should this function do with variable names other than "a" and "b"? It should say that the variable named "c" is "replaced" by the term TVar "c", and similarly other variables. Given any other variable name x, it should say that the variable x is "replaced" by the term TVar x.

The unify function will take two terms and return one of these substitutions, where the substition is a function of type v → Term v. So the unify function has type:

unify :: Term v → Term v → (v → Term v)
Oh, but not quite. Because unification can also fail. For example, if you try to unify the terms ab and Int, represented by Fun (TVar "a") (TVar "b") and TInt respectively, the unfication should fail, because there is no term that is an instance of both of those; one represents a function and the other represents an integer. So unify does not actually return a substitution of type v → Term v. Rather, it returns a monad value, which might contain a substitution, if the unification is successful, and otherwise contains an error value. To handle the example above, the unify function will contain a case like this:

unify	TInt	(Fun _ _) = fail ("Cannot unify" ....)
It will fail because it is not possible to unify functions and integers.

If unification is successful, then instead of using fail, the unify function will construct a substitution and then return it with return. Let's consider the result of unifying TInt with TInt. This unification succeeds, and produces a trivial substitition with no bindings. Or more precisely, every variable x should be "replaced" by the term TVar x. So in this case the substitution returned by unify should be the trivial one, a function which takes x and returns TVar x for all variable names x.

But we already have such a function. This is just what we decided that Term's return function should do, when we were making Term into a monad. So in this case the code for unify is:

unify	TInt	TInt	  = return return
Yep, in this case the unify function returns the return function.

Wheee!

At this point in the paper I was skimming, but when I saw return return, I boggled. I went back and read it more carefully after that, you betcha.

That's my couple of paragraphs. I was planning to get to this point and then say "But that's not what I was planning to discuss. What I really wanted to talk about was...". But I think I'll break with my usual practice and leave the other thing for tomorrow.

Happy Diada Nacional de Catalunya, everyone!

Sat, 12 Jul 2008

runN revisited
Exactly one year ago I discussed runN, a utility that I invented for running the same command many times, perhaps in parallel. The program continues to be useful to me, and now Aaron Crane has reworked it and significantly improved the interface. I found his discussion enlightening. He put his finger on a lot of problems that had been bothering me that I had not quite been able to pin down.

Check it out. Thank you, M. Crane.

Tue, 17 Jun 2008

De­function­al­iz­a­tion and Java
A couple of weeks ago I was introduced to the notion of defunctionalization by this article on Ken Knowles' blog. Defunctionalization is a program transformation that removes the higher-order functions from a program. The idea is that you replace something like λx.x+y with a data structure that encapsulates a value of y somewhere, say (HOLD y). And instead of using the language's built-in function application to apply this object directly to an argument x, you write a synthetic applicator that takes (HOLD y) and x and returns x + y. And anyone who wanted to apply λx.x+y to some argument x in some context in which y was bound should first construct (HOLD y), then use the synthetic applicator on (HOLD y) and x.

Consider, for example, the following Haskell program:

aux f = f 1 + f 10
res x = aux (λz -> z + x)
The defunctionalization of this example is:

data Hold = HOLD Int
fake_apply (HOLD a) b = a + b
aux held = fake_apply held 1 + fake_apply held 10
res x = aux (HOLD x)
I hope this will make the idea clear.

M. Knowles cites the paper Defunctionalization at work by Olivier Danvy and Lasse R. Nielsen, which was lots of fun. (My Haskell example above is a simplification of the example from page 5 of Danvy and Nielsen.) Among other things, Danvy and Nielsen point out that this defunctionalization transformation is in a certain sense dual to the transformation that turns ordinary data structures into λ-terms in Church encoding. Church encloding turns data items like pairs or booleans into higher-order functions; defunctionalization turns them back again.

Section 1.4 of the Danvy and Nielsen paper lists a whole bunch of contexts in which this technique has been studied and used, but one thing I didn't think I saw there is that this is essentially the transformation that Java programmers use when they want to use closures.

For example, suppose a Java programmer wants to write something like aux in:

aux f = f 1 + f 10
res x = aux (λz -> z + x)
But they can't, because Java doesn't have closures.

/* Java */

class Hold {
private int a;

public Hold(int a) {
this.a = a;
}

public int fake_apply(int b) {
return this.a + b;
}
}

private static int aux(Hold h) {
return h.fake_apply(1) + h.fake_apply(10);
}

static int res(int x) {
Hold h = new Hold(x);
return aux(h);
}
Where the class Hold corresponds directly to the data type Hold in the defunctionalized Haskell code.

Here is a real example. Consider GNU Emacs. When I enter text-mode in Emacs, I want a bunch of subsystems to be notified. Emacs has a text-mode-hook variable, which is basically a list of functions, and when an Emacs buffer is put into text-mode, Emacs invokes the hooks. Any subsystem that wants to be notified puts its own hook function into that variable. If I wanted to accomplish something similar in Haskell or SML, I would similarly use a list of functions.

In Java, the corresponding facility is called java.util.Observable. Were one implementing Emacs in Java (perish the thought!) the mode object would inherit from Observable, and so would provide an addObserver method for adding a hook to a list somewhere. When the mode was switched to text-mode, the mode object would call notifyObservers, which would loop over the hook list, calling the hooks. So far this is just like Emacs Lisp.

But in Java the hooks are not functions, as they are in Emacs, because in Java functions are not first-class entities. Instead, the hooks are objects which conform to the Observer interface specification, and instead of invoking functions directly, the notifyObservers method calls the update method on each hook object.

Here's another example. I wrote a recursive descent parser in Java a while back. An ActionParser is just like a Parser, except that if its parse succeeds, it invokes a callback. If I were programming in SML or Haskell or Perl, an ActionParser would be nothing but a Parser with an associated closure, something like this:

# Perl
package ActionParser;

sub new {
my ($class,$parser, $action) = @_; bless { Parser =>$parser,
Action => $action } =>$class;
}

# Just like the embedded parser, but invoke the action on success
sub parse {
my $self = shift; my$input = shift;
my $result =$self->{Parser}->parse($input); if ($result->success)
$self->{Action}->($result);   # Invoke action
}
return $result; } Here the Action member is expected to be a closure, which is automatically invoked if the parse succeeds. To use this, I would write something like this: # Perl my$missiles;
...
my $parser = ActionParser->new($otherParser,
sub { $missiles->launch() } );$parser->parse($input); And then if the input parses correctly, the parser launches the missiles from the anonymous closure, which has captured the local$missiles object.

But in Java, you have no closures. Instead, you defunctionalize, and represent closures with objects:

/* Java */
abstract class Action {
void invoke(ParseResults results) {}
}

class ActionParser extends Parser {
Action action;
Parser parser;

ActionParser(Parser p, Action a) {
action = a;
parser = p;
}

ParseResults Parse(Input input) {
ParseResults res = this.parser.Parse(input);
if (res.isSuccess) {
this.action.invoke(res);
}
return res;
}
}
To use this, one writes something like this:

/* Java */

class LaunchMissilesAction extends Action {
Missiles m;

LaunchMissilesAction(Missiles m) { this.m = m; }
void invoke(ParseResults results) {
m.launch();
}
}

...

Action a = new LaunchMissilesAction(missiles);
Parser p = new ActionParser(otherParser, a);
p.parse(input);
The constructor argument missiles takes the place of a free variable in a closure. The closure itself has been replaced with an object from an ad hoc class, just as in Danvy and Nielsen's formulation, the closure is replaced with a synthetic data object that holds the values of the free variables. The invoke method plays the role of fake_apply.

Now, it's not a particularly interesting observation that this can be done. The interesting part, I think, is that this is what Java programmers actually do. And also, perhaps, that Danvy and Nielsen didn't mention it in their paper, because I think the technique is pretty widespread.

Fri, 30 May 2008

Last week I needed to mock up a dialog box I was talking about in this article:

I wasn't sure how to do this, and my first draft just had a description. But the day before, I had happened to notice a new item that had appeared in the "Programming" menu on my Ubuntu computer: It said "Glade Interface Designer". I had started it up, for no particular reason, and tinkered with it for about two minutes.

Glade lets you design a window interface, by positioning buttons and sliders and things, and then does something or other. At the time I didn't know what it would do, but I knew I could mock up the window I wanted, and I thought maybe I could screenshot the mockup for the blog article.

The Glade thing was so easy to use that the easiest way to get a mockup of the dialog was to have Glade generate a complete, working windowing application, compile and run the application, and then screenshot the application. I got this done in about fifteen minutes.

The application I made doesn't actually do anything, but it does compile, run, and pop up the dialog box I designed. I'm confident that I could get it to do something pretty easily, if I wanted. The auto-generated code, and some of the Glade controls, are very suggestive.

I give Glade a big gold star. I went from having never heard of it to a working (although trivial) window application in one two-minute session and one fifteen-minute session. Maybe two big gold stars and a "Good work!" sticker.

[ Addendum 20080530: I went ahead with making an application that actually does something. It worked. ]

After writing about Glade Interface Designer today, I decided to go ahead and see if it would be as easy to make a working application as I hoped it would be.

The outcome: big success.

The application has a window with two input fields, a "+" button, and an output field that shows the sum of the input fields when you press the "+" button. It took about half an hour from start to finish, and the only thing I had to look up in the manual was the names of the functions that read and write the values of the text fields. Everything else I got through bricolage and tinkering with the autogenerated monkey code.

The biggest problem that I encountered was that the application didn't exit when I clicked the close box, although the window disappeared. I figured out that the close box was sending a "delete" event and not a "destroy" event and fixed it up right quick.

Gtk+ and Glade Interface Designer get at least two gold stars. Maybe three. Maybe fifty-three.

Fri, 28 Mar 2008

Suffering from "make install"
I am writing application X, which uses the nonstandard perl modules DBI, DBD::SQLite, and Template. These might not be available on the target system, so I got the idea to include them in the distribution for X and have the build process for X build and install the modules. X already carries its own custom Perl modules in X/lib anyway, so I can just install DBI and the others into X/lib and everything will Just Work. Or so I thought.

After building DBI, for example, how do you get it to install itself into X/lib instead of the default system-wide location, which only the super-user has permission to modify?

There are at least five solutions to this common problem.

Uh-oh. If solution #1 had worked, people would not have needed to invent solution #2. If solution #2 had worked, people would not have needed to invent solution #3. Since there are five solutions, there is a good chance that none of them work.

You can, I am informed:

• Set PREFIX=X when building the Makefile
• Set INSTALLDIRS=vendor and VENDORPREFIX=X when building the Makefile
• Or maybe instead of VENDORPREFIX you need to set INSTALLVENDORLIB or something
• Or maybe instead of setting them while building the Makefile you need to set them while running the make install target
• Set LIB=X/lib when building the Makefile
• Use PAR
• Use local::lib
Some of these fail by being excessively complicated. Some fail by addressing a larger problem set that is too large. For example, I do not want to do whatever PAR does; I just want to install the damn modules into X/lib where the application can find them.

Some of these items fail because they just plain fail. For example, the first thing everyone says is that you can just set PREFIX to X. No, because then the module Foo does not go into X/lib/Foo.pm. It goes into X/Foo/lib/perl5/site_perl/5.12.23/Foo.pm. Which means that if X does use lib 'X/lib'; it will not be able to find Foo.

The manual (which goes by the marvelously obvious and easily-typed name of ExtUtils::MakeMaker, by the way) is of limited help. It recommends solving the problem by travelling to Paterson, NJ, gouging your eyes out with your mom's jewelry, and then driving over the Passaic River falls. Ha ha, just kidding. That would be a big improvement on what it actually suggests, for three reasons. First, it is clear and straightforward. Second, it would feel better than the stuff it does suggest. And third, it would actually solve your problem, although obliquely.

It turns out there is a simple solution that doesn't involve travelling to New Jersey. The first thing you have to do is give up entirely on trying to use make install to install the modules. It is completely broken for this application, because even if the destination could somehow be forced to be what you wanted—and, after all, why would you expect that make install would let you configure the destination directory in a simple fashion?—it would still install not only the contents of MODULE/lib, but also the contents of MODULE/bin, MODULE/man, MODULE/share, MODULE/pus, MODULE/dork, MODULE/felch, and MODULE/scrotum, some of which you probably didn't want.

So no. But the solution is actually simple. The normal module build process (as distinct from the install process) puts all this crap under MODULE/blib. The test suite is run against the blib installation. So the test programs have the same problem that X has. If they can find the stuff under blib, so can X, by replicating the layout under blib and then doing what the test suite does.

In fact, the modules are installed into the proper subdirectories of MODULE/blib/lib. So the simple solution is just to build the module and then, instead of trying to get the installer to put the right stuff in the right place, use cp -pr MODULE/blib/lib/* X/lib. Problem solved.

For modules with a shared library, you need to copy MODULE/blib/arch/auto/* into X/lib/auto also.

I remember suffering over this at least ten years ago, when a student in a class I was teaching asked me how to do it and I let ExtUtils::MakeMaker make a monkey of me. I was amazed to find myself suffering over it once again. I am relieved to have found the right answer.

This is one of those days when I am not happy with software. It sometimes surprises me how many of those days involve make.

Dennis Ritchie once said that "make is like Pascal. Everybody likes it, so they go in and change it." I never really thought about this before, but it now occurs to me that probably Ritchie meant that they like make in about the same way that they like bladder stones. Because Dennis Ritchie probably does not like Pascal, and actually nobody else likes Pascal either. They may say they do, and they may even think they do, but if you look a little closer it always turns out that the thing they like is not actually Pascal, but some language that more or less resembles Pascal. Unfortunately, the changes people make to make tend to make it bigger and wartier, and this improves make about as much as it would improve a bladder stone.

I would like to end this article on a positive note. If you haven't already, please read Recursive make Considered Harmful and be prepared to be blinded by the Glorious Truth therein.

Fri, 21 Mar 2008

my $command = shift; for my$file (@ARGV) {
if ($file =~ /\.gz$/) {
my $fh; unless (open$fh, "<", $file) { warn "Couldn't open$file: $!; skipping\n"; next; } my$fd = fileno $fh;$file = "/proc/self/fd/$fd"; } } exec$command, @ARGV;
die "Couldn't run command '$command':$!\n";
When the loop exits, $fh is out of scope, and the filehandle it contains is garbage-collected, closing the file. "Duh." Several people suggested that it was because open files are not preserved across an exec, or because the meaning of /proc/self would change after an exec, perhaps because the command was being run in a separate process; this is mistaken. There is only one process here. The exec call does not create a new process; it reuses the same one, and it does not affect open files, unless they have been flagged with FD_CLOEXEC. Abhijit Menon-Sen ran a slightly different test than I did: % z cat foo.gz bar.gz cat: /proc/self/fd/3: No such file or directory cat: /proc/self/fd/3: No such file or directory As he said, this makes it completely obvious what is wrong, since the two files are both represented by the same file descriptor. Thu, 20 Mar 2008 Closed file descriptors I wasn't sure whether to file this on the /oops section. It is a mistake, and I spent a lot longer chasing the bug than I should have, because it's actually a simple bug. But it isn't a really big conceptual screwup of the type I like to feature in the /oops section. It concerns a program that I'll discuss in detail tomorrow. In the meantime, here's a stripped-down summary, and a stripped-down version of the code: my$command = shift;
for my $file (@ARGV) { if ($file =~ /\.gz$/) { my$fh;
unless (open $fh, "<",$file) {
warn "Couldn't open $file:$!; skipping\n";
next;
}
my $fd = fileno$fh;
$file = "/proc/self/fd/$fd";
}
}

exec $command, @ARGV; die "Couldn't run command '$command': $!\n"; The idea here is that this program, called z, will preprocess the arguments of some command, and then run the command with the modified arguments. For some of the command-line arguments, here the ones named *.gz, the original file will be replaced by the output of some file descriptor. In the example above, the descriptor is attached to the original file, which is pointless. But once this part of the program was working, I planned to change the code so that the descriptor would be attached to a pipe instead. Having written something like this, I then ran a test, which failed: % z cat foo.gz cat: /proc/self/fd/3: No such file or directory "Aha," I said instantly. "I know what is wrong. Perl set the close-on-exec flag on file descriptor 3." You see, after a successful exec, the kernel will automatically close all file descriptors that have the close-on-exec flag set, before the exec'ed image starts running. Perl normally sets the close-on-exec flag on all open files except for standard input, standard output, and standard error. Actually it sets it on all open files whose file descriptor is greater than the value of$^F, but $^F defaults to 2. So there is an easy fix for the problem: I just set$^F = 100000 at the top of the program. That is not the best solution, but it can be replaced with a better one once the program is working properly. Which I expected it would be:

% z cat foo.gz
cat: /proc/self/fd/3: No such file or directory
Huh, something is still wrong.

Maybe I misspelled /proc/self/fd? No, it is there, and contains the special files that I expected to find.

Maybe $^F did not work the way I thought it did? I checked the manual, but it looked okay. Nevertheless I put in use Fcntl and used the fcntl function to remove the close-on-exec flags explicitly. The code to do that looks something like this: use Fcntl; .... my$flags = fcntl($fh, F_GETFD, 0); fcntl($fh, F_SETFD, flags & ~FD_CLOEXEC); And try it again: % z cat foo.gz cat: /proc/self/fd/3: No such file or directory Huh. I then wasted a lot of time trying to figure out an easy way to tell if the file descriptor was actually open after the exec call. (The answer turns out to be something like this: perl -MPOSIX=fstat -le 'print "file descriptor 3 is ", fstat(3) ? "open" : "closed"'.) This told me whether the error from cat meant what I thought it meant. It did: descriptor 3 was indeed closed after the exec. Now your job is to figure out what is wrong. It took me a shockingly long time. No need to email me about it; I have it working now. I expect that you will figure it out faster than I did, but I will also post the answer on the blog tomorrow. Sometime on Friday, 21 March 2008, this link will start working and will point to the answer. [ Addendum 20080321: I posted the answer. ] Fri, 14 Mar 2008 Drawing lines As part of this thing I sometimes do when I'm not writing in my blog—what is it called?—oh, now I remember. As part of my job I had to produce the following display: The idea here is that the user can fill in the names of three organisms into the form blanks, and the application will find all the studies in its database which conclude that those organisms are related in the indicated way. For example, the user can put "whale" and "hippo" in the top two blanks and "cow" in the bottom one, and the result will be all the studies that conclude (perhaps among other things) that whales and hippos are more closely related to each other than either is to cows. (I think "cothurnocystis bifida" is biologist jargon for cows.) If you wanted to hear more about phylogeny, Java programming, or tree algorithms, you are about to be disappointed. The subject of my article today is those fat black lines. The first draft of the page did not have the fat black lines. It had some incredibly awful ASCII-art that was not even properly aligned. Really it was terrible; it would have been better to have left it out completely. I will not make you look at it. I needed the lines, so I popped down the "graphics" menu on my computer and looked for something suitable. I tried the Gimp first. It seems that the Gimp has no tool for drawing straight lines. If someone wants to claim that it does, I will not dispute the claim. The Gimp has a huge and complex control panel covered with all sorts of gizmos, and maybe one of those gizmos draws a straight line. I did not find one. I gave up after a few minutes. Next I tried Dia. It kept selecting the "move the line around on the page" tool when I thought I had selected the "draw another line" tool. The lines were not constrained to a grid by default, and there was no obvious way to tell it that I wanted to draw a diagram smaller than a whole page. I would have had to turn the thing into a bitmap and then crop the bitmap. "By Zeus's Beard," I cried, "does this have to be so difficult?" Except that the oath I actually uttered was somewhat coarser and less erudite than I have indicated. I won't repeat it, but it started with "fuck" and ended with "this". Here's what I did instead. I wrote a program that would read an input like this: >-v-< '-+- and produce a jpeg file that looks like this: Or similarly this: .---, | >--, '--- '- Becomes this: You get the idea. Now I know some of you are just itching to write to me and ask "why didn't you just use...?", so before you do that, let me remind you of two things. First, I had already wasted ten or fifteen minutes on "just use..." that didn't work. And second, this program only took twenty minutes to write. The program depends on one key insight, which is that it is very, very easy to write a Perl program that generates a graphic output in "PBM" ("portable bitmap") format. Here is a typical PBM file: P1 10 10 1111111111 1000000001 1000000001 1001111001 1001111001 1001111001 1001111001 1000000001 1000000001 1111111111 The P1 is a magic number that identifies the file format; it is always the same. The 10 10 warns the processor that the upcoming bitmap is 10 pixels wide and 10 pixels high. The following characters are the bitmap data. I'm not going to insult you by showing the 10×10 bitmap image that this represents. PBM was invented about twenty years ago by Jef Poskanzer. It was intended to be an interchange format: say you want to convert images from format X to format Y, but you don't have a converter. You might, however, have a converter that turns X into PBM and then one that turns PBM into Y. Or if not, it might not be too hard to produce such converters. It is, in the words of the Extreme Programming guys, the Simplest Thing that Could Possibly Work. There are also PGM (portable graymap) and PPM (portable pixmap) formats for grayscale and 24-bit color images as well. They are only fractionally more complicated. Because these formats are so very, very simple, they have been widely adopted. For example, the JPEG reference implementation includes a sample cjpeg program, for converting an input to a JPEG file. The input it expects is a PGM or PPM file. Writing a Perl program to generate a P?M file, and then feeding the output to pbmtoxbm or ppmtogif or cjpeg is a good trick, and I have used it many times. For example, I used this technique to generate a zillion little colored squares in this article about the Pólya-Burnside counting lemma. Sure, I could have drawn them one at a time by hand, and probably gone insane and run amuck with an axe immediately after, but the PPM technique was certainly much easier. It always wins big, and this time was no exception. The program may be interesting as an example of this technique, and possibly also as a reminder of something else. The Perl community luminaries invest a lot of effort in demonstrating that not every Perl program looks like a garbage heap, that Perl can be as bland and aseptic as Java, that Perl is not necessarily the language that most closely resembles quick-drying shit in a tube, from which you can squirt out the contents into any shape you want and get your complete, finished artifact in only twenty minutes and only slightly smelly. No, sorry, folks. Not everything we do is a brilliant, diamond-like jewel, polished to a luminous gloss with pages torn from one of Donald Knuth's books. This line-drawing program was squirted out of a tube, and a fine brown piece of engineering it is. #!/usr/bin/perl my (S) = shift || 50;
$S here is "size". The default is to turn every character in the input into a 50×50 pixel tile. Here's the previous example with$S=10:

my ($h,$w);
my $output = []; while (<>) { chomp;$w ||= length();
$h++; push @$output, convert($_); } The biggest defect in the program is right here: it assumes that each line will have the same width$w. Lines all must be space-padded to the same width. Fixing this is left as an easy exercise, but it wasn't as easy as padding the inputs, so I didn't do it.

The magic happens here:

open STDOUT, "| pnmscale 1 | cjpeg" or die $!; print "P1\n",$w * $S, " ",$h * $S, "\n"; print$_, "\n" for @$output; exit; The output is run through cjpeg to convert the PBM data to JPEG. For some reason cjpeg doesn't accept PBM data, only PGM or PPM, however, so the output first goes through pnmscale, which resizes a P?M input. Here the scale factor is 1, which is a no-op, except that pnmscale happens to turn a PBM input into a PGM output. This is what is known in the business as a "trick". (There is a pbmtopgm program, but it does something different.) If we wanted gif output, we could have used "| ppmtogif" instead. If we wanted output in Symbolics Lisp Machine format, we could have used "| pgmtolispm" instead. Ah, the glories of interchange formats. I'm going to omit the details of convert, which just breaks each line into characters, calls convert_ch on each character, and assembles the results. (The complete source code is here if you want to see it anyway.) The business end of the program is convert_ch: # sub convert_ch { my @rows; my$ch = shift;
my $up =$ch =~ /[<|>^'+]/i;
my $dn =$ch =~ /[<|>V.,+]/i;
my $lt =$ch =~ /[-<V^,+]/i;
my $rt =$ch =~ /[->V^.'+]/i;
These last four variables record whether the tile has a line from its center going up, down, left, or right respectively. For example, "|" produces a tile with lines coming up and down from the center, but not left or right. The /i in the regexes is because I kept writing v instead of V in the inputs.

my $top = int($S * 0.4);
my $mid = int($S * 0.2);
my $bot = int($S * 0.4);
The tile is divided into three bands, of the indicated widths. This probably looks bad, or fails utterly, unless $S is a multiple of 5. I haven't tried it. Do you think I care? Hint: I haven't tried it. my$v0 = "0" x $S; my$v1 = "0" x $top . "1" x$mid . "0" x $bot; push @rows, ($up ? $v1 :$v0) x $top; This assembles the top portion of the tile, including the "up" line, if there is one. Note that despite their names,$top also determines the width of the left portion of the tile, and $bot determines the width of the right portion. The letter "v" here is for "vertical". Perhaps I should explain for the benefit of the readers of Planet Haskell (if any of them have read this far and not yet fainted with disgust) that "$a x $b" in Perl is like concat (replicate b a) in the better sorts of languages. my$ls = $lt ? "1" : "0"; my$ms = ($lt ||$rt || $up ||$dn) ? "1" : "0";
my $rs =$rt ? "1" : "0";
push @rows, ($ls x$top . $ms x$mid . $rs x$bot) x $mid; This assembles the middle section, including the "left" and "right" lines. push @rows, ($dn ? $v1 :$v0) x $bot; This does the bottom section. return @rows; } And we are done. Nothing to it. Adding diagonal lines would be a fairly simple matter. Download the complete source code if you haven't seen enough yet. There is no part of this program of which I am proud. Rather, I am proud of the thing as a whole. It did the job I needed, and it did it by 5 PM. Larry Wall once said that "a Perl script is correct if it's halfway readable and gets the job done before your boss fires you." Thank you, Larry. No, that is not quite true. There is one line in this program that I'm proud of. I noticed after I finished that there is exactly one comment in this program, and it is blank. I don't know how that got in there, but I decided to leave it in. Who says program code can't be funny? Thu, 24 Jan 2008 Emacs and alists [ This article is a few weeks old now. I wrote it and forgot to publish it at the time. ] Yesterday I upgraded Emacs, and since it was an upgrade, something that had been working for me for fifteen years stopped working, because that's what "upgrade" means. My .emacs file contains: (aput 'auto-mode-alist "\\.pl\\'" (function cperl-mode)) (aput 'auto-mode-alist "\\.t\\'" (function cperl-mode)) (aput 'auto-mode-alist "\\.cgi\\'" (function cperl-mode)) (aput 'auto-mode-alist "\\.pm\\'" (function cperl-mode)) (aput 'auto-mode-alist "\\.blog\\'" (function text-mode)) (aput 'auto-mode-alist "\\.sml\\'" (function sml-mode)) I should explain this, since I imagine that most readers of this blog are like me in that they touch Emacs Lisp only once a year on Saint Vibrissa's Day. An alist ("association list") is a common data structure in Lisp programs. It is a list of pairs; the first element of each pair is a key, and the second element is an associated value. The pairs in the special auto-mode-alist variable have regexes as their keys and functions as their values. Whenever Emacs opens a new file, it scans this alist, until it finds a regex that matches the name of the file. It then executes the associated function. Thus the effect of the first line above is to have Emacs enable the cperl-mode function on any file whose name ends in ".pl". The aput function is for maintaining alists. It takes an alist, a key, and a value, scans the alist looking for a matching key, and then if it finds it, it amends the corresponding value. Otherwise, it appends a new association onto the front of the alist. When I upgraded emacs, this broke. The aput function was moved into a separate package, which I now had to load with (require 'assoc). I asked about this on IRC, and was told that the correct way to do this, if I did not want to (require 'assoc), was to use the following abomination: (mapc (lambda (x) (when (eq 'perl-mode (cdr x)) (setcdr x 'cperl-mode))) (append auto-mode-alist interpreter-mode-alist)) The effect of this is to scan over auto-mode-alist (and also interpreter-mode-alist, a related variable) looking for any association whose value was the perl-mode function, and using setcdr to replace perl-mode with cperl-mode. (This does not address the issue of what to do with .t files or .blog files, for which no association exists yet, presumably, but I did not ask about those specifically on IRC.) I was totally boggled. Choosing the right editing mode for a file is a basic function of emacs. I could not believe that the best and simplest way to add or change associations was to use mapc lambda gobhorn oleo potatopudding quote potrzebie. I was assured that this was indeed the only correct method. Struck almost speechless, I managed to come up with "Bullshit." Apparently the issue was that if auto-mode-alist already contains an association for ".pl", there is no guarantee that my new association will be found and preferred to the old one, unless I somehow remove the old one, or edit it to be the way I want. This seemed very unlikely to me. You see, an alist is a list. This means that it is searched from head to tail, because this is the only way a list can be searched. So in particular, if you cons a second association to the front of the list, which has the same key as a later (older) association, the search will find the new one first, and the older one becomes inoperative. I asked if there was not a guarantee that the alist would be searched from front to back. I was told that there is not. I looked in the manual, and reported that the assoc function, which is the getter that corresponds to aput, taking an alist and a key, and returning the corresponding value, is expressly guaranteed to return the first matching item. I was told that there was no guarantee that assoc would be used. I pondered the manual some more and found this passage: However, association lists have their own advantages. Depending on your application, it may be faster to add an association to the front of an association list than to update a property. That is, it is expressly endorsing the technique of adding a new item to the front of an alist in order to override any later item that might have the same key. After finding that the add-to-the-front technique really did work, I reasoned that if someday Emacs stopped searching alists sequentially, I would not be in any more trouble than I had been today when they removed the aput function. So I did not take the advice I was given. Instead, I left it pretty much the way it was. I did take the opportunity to clean up the code a bit: (push '("\\.pl\\'" . cperl-mode) auto-mode-alist) (push '("\\.t\\'" . cperl-mode) auto-mode-alist) (push '("\\.cgi\\'" . cperl-mode) auto-mode-alist) (push '("\\.pm\\'" . cperl-mode) auto-mode-alist) (push '("\\.blog\\'" . text-mode) auto-mode-alist) (push '("\\.sml\\'" . sml-mode) auto-mode-alist) The push function simply appends an element to the front of a list, modifying the list in-place. But wow, the advice I got was phenomenally bad. It was bad in a really interesting way, too. It reminded me of the advice people get on the #math channel, where some guy comes in with some question about triangles and gets the category-theoretic viewpoint on triangles as natural transformations of something or other. The advice was bad because although it was correct, it was completely devoid of common sense. [ Addendum 20080124: It has been brought to my attention that the Emacs FAQ endorses my solution, which makes the category-theoretic advice proposed by the #emacs blockheads even less defensible. ] [ Addendum 20080201: Steve Vinoski suggests replacing the aput function. ] Fri, 11 Jan 2008 Help, help! (Readers of Planet Haskell may want to avert their eyes from this compendium of Perl introspection techniques. Moreover, a very naughty four-letter word appears, a word that begins with "g" and ends with "o". Let's just leave it at that.) Przemek Klosowski wrote to offer me physics help, and also to ask about introspection on Perl objects. Specifically, he said that if you called a nonexistent method on a TCL object, the error message would include the names of all the methods that would have worked. He wanted to know if there was a way to get Perl to do something similar. There isn't, precisely, because Perl has only a conventional distinction between methods and subroutines, and you Just Have To Know which is which, and avoid calling the subroutines as methods, because the Perl interpreter has no idea which is which. But it does have enough introspection features that you can get something like what you want. This article will explain how to do that. Here is a trivial program that invokes an undefined method on an object: use YAML; my$obj = YAML->new;
$obj->nosuchmethod; When run, this produces the fatal error: Can't locate object method "nosuchmethod" via package "YAML" at test.pl line 4. (YAML in this article is just an example; you don't have to know what it does. In fact, I don't know what it does.) Now consider the following program instead: use YAML; use Help 'YAML'; my$obj = YAML->new;
$obj->nosuchmethod; Now any failed method calls to YAML objects, or objects of YAML's subclasses, will produce a more detailed error message: Unknown method 'nosuchmethod' called on object of class YAML Perhaps try: Bless Blessed Dump DumpFile Load LoadFile VALUE XXX as_heavy (inherited from Exporter) die (inherited from YAML::Base) dumper_class dumper_object export (inherited from Exporter) export_fail (inherited from Exporter) export_ok_tags (inherited from Exporter) export_tags (inherited from Exporter) export_to_level (inherited from Exporter) field freeze global_object import (inherited from Exporter) init_action_object loader_class loader_object new (inherited from YAML::Base) node_info (inherited from YAML::Base) require_version (inherited from Exporter) thaw warn (inherited from YAML::Base) ynode Aborting at test.pl line 5 Some of the methods in this list are bogus. For example, the stuff inherited from Exporter should almost certainly not be called on a YAML object. Some of the items may be intended to be called as functions, and not as methods. Some may be functions imported from some other module. A common offender here is Carp, which places a carp function into another module's namespace; this function will show up in a list like the one above, without even an "inherited from" note, even though it is not a method and it does not make sense to call it on an object at all. Even when the items in the list really are methods, they may be undocumented, internal-use-only methods, and may disappear in future versions of the YAML module. But even with all these warnings, Help is at least a partial solution to the problem. The real reason for this article is to present the code for Help.pm, not because the module is so intrinsically useful itself, but because it is almost a catalog of weird-but-useful Perl module hackery techniques. A full and detailed tour of this module's 30 lines of code would probably make a decent 60- or 90-minute class for intermediate Perl programmers who want to become wizards. (I have given many classes on exactly that topic.) Here's the code: package Help; use Carp 'croak'; sub import { my ($selfclass, @classes) = @_;
for my $class (@classes) { push @{"$class\::ISA"}, $selfclass; } } sub AUTOLOAD { my ($bottom_class, $method) =$AUTOLOAD =~ /(.*)::(.*)/;
my %known_method;

my @classes = ($bottom_class); while (@classes) { my$class = shift @classes;
next if $class eq __PACKAGE__; unshift @classes, @{"$class\::ISA"};
for my $name (keys %{"$class\::"}) {
next unless defined &{"$class\::$name"};
$known_method{$name} ||= $class; } } warn "Unknown method '$method' called on object of class $bottom_class\n"; warn "Perhaps try:\n"; for my$name (sort keys %known_method) {
warn "  $name " . ($known_method{$name} eq$bottom_class
? ""
: "(inherited from $known_method{$name})") .
"\n";
}
croak "Aborting";
}

sub help {
$AUTOLOAD = ref($_[0]) . '::(none)';
}

sub DESTROY {}

1;

use Help 'Foo'

When any part of the program invokes use Help 'Foo', this does two things. First, it locates Help.pm, loads it in, and compiles it, if that has not been done already. And then it immediately calls Help->import('Foo').

Typically, a module's import method is inherited from Exporter, which gets control at this point and arranges to make some of the module's functions available in the caller's namespace. So, for example, when you invoke use YAML 'freeze' in your module, Exporter's import method gets control and puts YAML's "freeze" function into your module's namespace. But that is not what we are doing here. Instead, Help has its own import method:

sub import {
my ($selfclass, @classes) = @_; for my$class (@classes) {
push @{"$class\::ISA"},$selfclass;
}
}
The $selfclass variable becomes Help and @classes becomes ('Foo'). Then the module does its first tricky thing. It puts itself into the @ISA list of another class. The push line adds Help to @Foo::ISA. @Foo::ISA is the array that is searched whenever a method call on a Foo objects fails because the method doesn't exist. Perl will search the classes named in @Foo::ISA, in order. It will search the Help class last. That's important, because we don't want Help to interfere with Foo's ordinary inheritance. Notice the way the variable name Foo::ISA is generated dynamically by concatenating the value of$class with the literal string ::ISA. This is how you access a variable whose name is not known at compile time in Perl. We will see this technique over and over again in this module.

The backslash in @{"$class\::ISA"} is necessary, because if we wrote @{"$class::ISA"} instead, Perl would try to interpolate the value of $ISA variable from the package named class. We could get around this by writing something like @{$class . '::ISA'}, but the backslash is easier to read.

So what happens when the program calls $foo->nosuchmethod? If one of Foo's base classes includes a method with that name, it will be called as usual. But when method search fails, Perl doesn't give up right away. Instead, it tries the method search a second time, this time looking for a method named AUTOLOAD. If it finds one, it calls it. It only throws an exception of there is no AUTOLOAD. The Help class doesn't have a nosuchmethod method either, but it does have AUTOLOAD. If Foo or one of its other parent classes defines an AUTOLOAD, one of those will be called instead. But if there's no other AUTOLOAD, then Help's AUTOLOAD will be called as a last resort. $AUTOLOAD

When Perl calls an AUTOLOAD function, it sets the value of $AUTOLOAD to include the full name of the method it was trying to call, the one that didn't exist. In our example,$AUTOLOAD is set to "Foo::nosuchmethod".

This pattern match dismantles the contents of $AUTOLOAD into a class name and a method name: sub AUTOLOAD { my ($bottom_class, $method) =$AUTOLOAD =~ /(.*)::(.*)/;
The $bottom_class variable contains Foo, and the$method variable contains nosuchmethod.

The AUTOLOAD function is now going to accumulate a table of all the methods that could have been called on the target object, print out a report, and throw a fatal exception.

The accumulated table will reside in the private hash %known_method. Keys in this hash will be method names. Values will be the classes in which the names were found.

Accumulating the table of method names

The AUTOLOAD function accumulates this hash by doing a depth-first search on the @ISA tree, just like Perl's method resolution does internally. The @classes variable is a stack of classes that need to be searched for methods but that have not yet been searched. Initially, it includes only the class on which the method was actually called, Foo in this case:

my @classes = ($bottom_class); As long as some class remains unsearched, this loop will continue to look for more methods. It begins by grabbing the next class off the stack: while (@classes) { my$class = shift @classes;
Foo inherits from Help too, but we don't want our error message to mention that, so the search skips Help:

next if $class eq __PACKAGE__; (__PACKAGE__ expands at compile time to the name of the current package.) Before the loop actually looks at the methods in the current class it's searching, it looks to see if the class has any base classes. If there are any, it pushes them onto the stack to be searched next: unshift @classes, @{"$class\::ISA"};
Now the real meat of the loop: there is a class name in $class, say Foo, and we want the program to find all the methods in that class. Perl makes the symbol table for the Foo package available in the hash %Foo::. Keys in this hash are variable, subroutine, and filehandle names. To find out if a name denotes a subroutine, we use defined(&{subroutine_name}) for each name in the package symbol table. If there is a subroutine by that name, the program inserts it and the class name into %known_method. Otherwise, the name is a variable or filehandle name and is ignored: for my$name (keys %{"$class\::"}) { next unless defined &{"$class\::$name"};$known_method{$name} ||=$class;
}
}
The ||= sets a new value for $name in the hash only if there was not one already. If a method name appears in more than one class, it is recorded as being in the first one found in the search. Since the search is proceeding in the same order that Perl uses, the one recorded is the one that Perl will actually find. For example, if Foo inherits from Bar, and both classes define a this method, the search will find Foo::this before Bar::this, and that is what will be recorded in the hash. This is correct, because Foo's this method overrides Bar's. If you have any clever techniques for identifying other stuff that should be omitted from the output, this is where you would put them. For example, many authors use the convention that functions whose names have a leading underscore are private to the implementation, and should not be called by outsiders. We might omit such items from the output by adding a line here: next if$name =~ /^_/;
After the loop finishes searching all the base classes, the %known_method hash looks something like this:

(
this => Foo,
that => Foo,
new => Base,
blookus => Mixin::Blookus,
other => Foo
)
This means that methods this, that, and other were defined in Foo itself, but that new is inherited from Base and that blookus was inherited from Mixin::Blookus.

Printing the report

The AUTOLOAD function then prints out some error messages:

warn "Unknown method '$method' called on object of class$bottom_class\n";
warn "Perhaps try:\n";
And at last the payoff: It prints out the list of methods that the programmer could have called:

for my $name (sort keys %known_method) { warn "$name " .
($known_method{$name} eq $bottom_class ? "" : "(inherited from$known_method{$name})") . "\n"; } croak "Aborting"; } Each method name is printed. If the class in which the method was found is not the bottom class, the name is annotated with the message (inherited from wherever). The output for my example would look like this: Unknown method 'nosuchmethod' called on object of class Foo: Perhaps try: blookus (inherited from Mixin::Blookus) new (inherited from Base) other that this Aborting at YourErroneousModule.pm line 679 Finally the function throws a fatal exception. If we had used die here, the fatal error message would look like Aborting at Help.pm line 34, which is extremely unhelpful. Using croak instead of die makes the message look like Aborting at test.pl line 5 instead. That is, it reports the error as coming from the place where the erroneous method was actually called. Synthetic calls Suppose you want to force the help message to come out. One way is to call$object->fgsfds, since probably the object does not provide a fgsfds method. But this is ugly, and it might not work, because the object might provide a fgsfds method. So Help.pm provides another way.

You can always force the help message by calling $object->Help::help. This calls a method named help, and it starts the inheritance search in the Help package. Control is transferred to the following help method: sub help {$AUTOLOAD = ref($_[0]) . '::(none)'; goto &AUTOLOAD; } The Help::help method sets up a fake$AUTOLOAD value and then uses "magic goto" to transfer control to the real AUTOLOAD function. "Magic goto" is not the evil bad goto that is Considered Harmful. It is more like a function call. But unlike a regular function call, it erases the calling function (help) from the control stack, so that to subsequently executed code it appears that AUTOLOAD was called directly in the first place.

Calling AUTOLOAD in the normal way, without goto, would have worked also. I did it this way just to be a fusspot.

DESTROY

Whenever a Perl object is destroyed, its DESTROY method is called, if it has one. If not, method search looks for an AUTOLOAD method, if there is one, as usual. If this lookup fails, no fatal exception is thrown; the object is sliently destroyed and execution continues.

It is very common for objects to lack a DESTROY method; usually nothing additional needs to be done when the object's lifetime is over. But we do not want the Help::AUTOLOAD function to be invoked automatically whenever such an object is destroyed! So Help defines a last-resort DESTROY method that is called instead; this prevents Perl from trying the AUTOLOAD search when an object with no DESTROY method is destroyed:

sub DESTROY {}
This DESTROY method restores the default behavior, which is to do nothing.

Living dangerously

Perl has a special package, called UNIVERSAL. Every class inherits from UNIVERSAL. If you want to apply Help to every class at once, you can try:

use Help 'UNIVERSAL';
but don't blame me if something weird happens.

Whenever I present code like this, I always get questions (or are they complaints?) from readers about why I omitted "use strict". "Always use strict!" they say.

Well, this code will not run with "use strict". It does a lot of stuff on purpose that "strict" was put in specifically to keep you from doing by accident.

At some point you have to take off the training wheels, kiddies.

Share and enjoy.

Tue, 08 Jan 2008

Clubbing someone to death with a loaded Uzi
I once had an intern who wrote wrote the following code to process a web survey form. The form input widgets were named q1, q2, and so forth:

foreach $k (keys %in) { if ($k eq q1) {
if ($in{$k} eq agree) {
$count{q10} =$count{q10} + 1;
}
if ($in{$k} eq disaagree) {
$count{q11} =$count{q11} + 1;
}
}
if ($k eq q2) { @q2split = split(/\0/,$in{$k}); foreach (@q2split) {$count{$_} =$count{$_} + 1; } } if ($k eq q3) {
$count{$in{$k}} =$count{$in{$k}} + 1;
}

...
}
There is a lot wrong with this code, but it's all trivial compared with the one big problem, which is the wholly unnecessary loop and tests. The whole thing could be (and should be, and was) rewritten as:

if ($in{q1} eq agree) {$count{q10} = $count{q10} + 1; } if ($in{q1} eq disaagree) {
$count{q11} =$count{q11} + 1;
}

@q2split = split(/\0/, $in{q2}); foreach (@q2split) {$count{$_} =$count{$_} + 1; }$count{$in{q3}} =$count{$in{q3}} + 1; ... After which one could start addressing the smaller problems, like the fact that "disagree" is misspelled. This is the sort of mistake you expect from an intern. I chuckled and corrected him. But I've seen it several times since from non-interns. Here's another example. I am not making this up. Whether it's more or less odious than the intern code is up to you to decide: foreach$location_name (%LOCATION ) {
$location_code =$LOCATION{$location_name}; if ($location_name eq $location ) { printf FILE "$location_code\,";
printf FILE "%4s", "$min3\,"; printf FILE "%4s", "$max3\,";
printf FILE "%1s", "$wx3\n"; } } It could have been written like this: printf FILE "$LOCATION{$location}\,"; printf FILE "%4s", "$min3\,";
printf FILE "%4s", "$max3\,"; printf FILE "%1s", "$wx3\n";
I started using this problem as an interview question. I'll present the subject with trivial code like this:

for my $k (keys %hash) { if ($k eq "name") {
$hash{$k}++;
}
}
and then ask if they have any comments about it. One nice thing about the question is that it translates naturally into whatever imperative language they claim expertise in.

It's appalling how many supposedly professional programmers see nothing wrong here. They squint at the code, and say "I think you need parentheses around %hash there", or they criticize the choice of variable names.

I first used this as an interview question because the Python code sample submitted by a job applicant contained an example of it. "Weird," I thought, "but maybe she's outgrown that." Since she claimed to be an expert Perl user, I asked her about it in Perl, using code like the example above. After she made a syntactic suggestion, I said "It's not a syntax problem, and it's not a trick question." She criticized the syntax some more. Finally I told her the answer: "Couldn't you just use $hash{name}++?" "Oh, yeah, I guess so," she said. A few minutes later we were going over her Python code sample and I pointed out the place where she had done the exact same thing, and asked if she was happy with that loop and wanted to change it. No, she thought it was just fine. "Doesn't this look like the example I showed you on the whiteboard a little while ago?" "Oh, I guess it does." We didn't hire her. Larry Wall once said that iterating over the keys of a hash is like clubbing someone to death with a loaded Uzi. I had already realized that you could, in principle, commit this error with a regular array instead of with a hash, but I had never seen an example until today's episode of the Daily WTF. The Daily WTF code is so awful, all the way through, that I was afraid that people might miss this slightly-more subtle gem lurking in the middle, and that was what motivated me to write this article in the first place. Here's the gem: // Java for (int a=1;a<=params.size();a++) switch (a) { case 1 : if (params.get(0) != null) this.one=params.get(0).toString(); break; case 2 : if (params.get(1) != null) this.two=params.get(1).toString(); break; ... case 14 : if (params.get(13) != null) this.fourteen=params.get(13).toString(); break; } } Wow, that is just, uh, stunning. [ Addendum 20080201: A bit more. ] [ Addendum 20090213: A counterexample. ] Thu, 03 Jan 2008 Note on point-free programming style This old comp.lang.functional article by Albert Y. C. Lai, makes the point that Unix shell pipeline programming is done in an essentially "point-free" style, using the shell example: grep '^X-Spam-Level' | sort | uniq | wc -l and the analogous Haskell code: length . nub . sort . filter (isPrefixOf "X-Spam-Level") Neither one explicitly mentions its argument, which is why this is "point-free". In "point-free" programming, instead of defining a function in terms of its effect on its arguments, one defines it by composing the component functions themselves, directly, with higher-order operators. For example, instead of: foo x y = 2 * x + y one has, in point-free style: foo = (+) . (2 *) where (2 *) is the function that doubles its argument, and (+) is the (curried) addition function. The two definitions of foo are entirely equivalent. As the two examples should make clear, point-free style is sometimes natural, and sometimes not, and the example chosen by M. Lai was carefully selected to bias the argument in favor of point-free style. Often, after writing a function in pointful style, I get the computer to convert it automatically to point-free style, just to see what it looks like. This is usually educational, and sometimes I use the computed point-free definition instead. As I get better at understanding point-free programming style in Haskell, I am more and more likely to write certain functions point-free in the first place. For example, I recently wrote: soln = int 1 (srt (add one (neg (sqr soln)))) and then scratched my head, erased it, and replaced it with the equivalent: soln = int 1 ((srt . (add one) . neg . sqr) soln) I could have factored out the int 1 too: soln = (int 1 . srt . add one . neg . sqr) soln I could even have removed soln from the right-hand side: soln = fix (int 1 . srt . add one . neg . sqr) but I am not yet a perfect sage. Sometimes I opt for an intermediate form, one in which some of the arguments are explicit and some are implicit. For example, as an exercise I wrote a function numOccurrences which takes a value and a list and counts the number of times the value occurs in the list. A straightforward and conventional implementation is: numOccurrences x [] = 0 numOccurrences x (y:ys) = if (x == y) then 1 + rest else rest where rest = numOccurrences x ys but the partially point-free version I wrote was much better: numOccurrences x = length . filter (== x) Once you see this, it's easy to go back to a fully pointful version: numOccurrences x y = length (filter (== x) y) Or you can go the other way, to a point-free version: numOccurrences = (length .) . filter . (==) which I find confusing. Anyway, the point of this note is not to argue that the point-free style is better or worse than the pointful style. Sometimes I use the one, and sometimes the other. I just want to point out that the argument made by M. Lai is deceptive, because of the choice of examples. As an equally biased counterexample, consider: bar x = x*x + 2*x + 1 which the automatic converter informs me can be written in point-free style as: bar = (1 +) . ap ((+) . join (*)) (2 *) Perusal of this example will reveal much to the attentive reader, including the definitions of join and ap. But I don't think many people would argue that it is an improvement on the original. (Maybe I'm wrong, and people would argue that it was an improvement. I won't know for sure until I have more experience.) For some sort of balance, here is another example where I think the point-free version is at least as good as the pointful version: a recent comment on Reddit suggested a >>> operator that composes functions just like the . operator, but in the other order, so that: f >>> g = g . f or, if you prefer: >>> f g x = g(f(x)) The point-free definition of >>> is: (>>>) = flip (.) where the flip operator takes a function of two arguments and makes a new function that does the same thing, but with the arguments in the opposite order. Whatever your feelings about point-free style, it is undeniable that the point-free definition makes perfectly clear that >>> is nothing but . with its arguments in reverse order. Mon, 31 Dec 2007 Welcome to my ~/bin In the previous article I mentioned "a conference tutorial about the contents of my ~/bin directory". Usually I have a web page about each tutorial, with a description, and some sample slides, and I wanted to link to the page about this tutorial. But I found to my surprise that I had forgotten to make the page about this one. So I went to fix that, and then I couldn't decide which sample slides to show. And I haven't given the tutorial for a couple of years, and I have an upcoming project that will prevent me from giving it for another couple of years. Eh, figuring out what to put online is more trouble than it's worth. I decided it would be a lot less toil to just put the whole thing online. The materials are copyright © 2004 Mark Jason Dominus, and are not under any sort of free license. But please enjoy them anyway. I think the title is an accidental ripoff of an earlier class by Damian Conway. I totally forgot that he had done a class on the same subject, and I think he used the same title. But that just makes us even, because for the past few years he has been making money going around giving talks on "Conference Presentation Aikido", which is a blatant (and deliberate) ripoff of my 2002 Perl conference talk on Conference Presentation Judo. So I don't feel as bad as I might have. Welcome to my ~/bin complete slides and other materials. I hereby wish you a happy new year, unless you don't want one, in which case I wish you a crappy new year instead. Fri, 21 Dec 2007 Another trivial utility: accumulate As usual, whenever I write one of these things, I wonder why it took me so long to get off my butt and put in the five minutes of work that were actually required. I've wanted something like this for years. It's called accumulate. It reads an input of this form: k1 v1 k1 v2 k2 v3 k1 v4 k2 v5 k3 v6 and writes it out in this format: k1 v1 v2 v4 k2 v3 v5 k3 v6 I wanted it this time because I had a bunch of files that included some duplicates, and wanted to get rid of the duplicates. So: md5sum * | accumulate | perl -lane 'unlink @F[2..$#F]'
(Incidentally, people sometimes argue that Perl's .. operator should count backwards when the left operand exceeds the right one. These people are wrong. There is only one argument that needs to be made to refute this idea; maybe it is the only argument that can be made. And examples of it abound. The code above is one such example.)

I'm afraid of insulting you by showing the source code for accumulate, because of course it is so very trivial, and you could write it in five minutes, as I did. But who knows; maybe seeing the source has some value:

#!/usr/bin/perl

use Getopt::Std;
my %opt = (k => 1, v => 2);
getopts('k:v:', \%opt) or usage();
for (qw(k v)) {
$opt{$_} -= 1 if $opt{$_} > 0;
}

while (<>) {
chomp;
my @F = split;
push @{$K{$F[$opt{k}]}},$F[$opt{v}]; } for my$k (keys %K) {
print "$k @{$K{$k}}\n"; } It's tempting to add a -F option to tell it that the input is not delimited by white space, or an option to change the output format, or blah blah blah, but I managed to restrain myself, mostly. Several years ago I wrote a conference tutorial about the contents of my ~/bin directory. The clearest conclusion that transpired from my analysis was that the utilities I write have too many features that I don't use. The second-clearest was that I waste too much time writing custom argument-parsing code instead of using Getopt::Std. I've tried to learn from this. One thing I found later is that a good way to sublimate the urge to put in some feature is to put in the option to enable it, and to document it, but to leave the feature itself unimplemented. This might work for you too if you have the same problem. I did put in -k and -v options to control which input columns are accumulated. These default to the first and second columns, naturally. Maybe this was a waste of time, since it occurs to me now that accumulate -k k -v v could be replaced by cut -fk,v | accumulate, if only cut didn't suck quite so badly. Of course one could use awk {print "$k $v" } | accumulate to escape cut's suckage. And some solution of this type obviates the need for accumulate's putative -F option also. Well, I digress. The accumulate program itself reminds me of a much more ambitious project I worked on for a while between 1998 and 2001, as does the yucky line: push @{$K{$F[$opt{k}]}}, $F[$opt{v}];
The ambitious project was tentatively named "twingler".

Beginning Perl programmers often have trouble with compound data structures because Perl's syntax for the nested structures is so horrendous. Suppose, for example, that you have a reference to a two-dimensional array $aref, and you want to produce a hash, such that each value in the array appears as a key in the hash, associated with a list of strings in the form "m,n" indicating where in the array that value appeared. Well, of course it is obviously nothing more than: for my$a1 (0 .. $#$aref) {
for my $a2 (0 ..$#{$aref->[$a1]}) {
push @{$hash{$aref->[$a1][$a2]}}, "$a1,$a2";
}
}
Obviously. <sarcasm>Geez, a child could see that.</sarcasm>

The idea of twingler was that you would specify the transformation you wanted declaratively, and it would then write the appropriate Perl code to perform the transformation. The interesting part of this project is figuring out the language for specifying the transformation. It must be complex enough to be able to express most of the interesting transformations that people commonly want, but if it isn't at the same time much simpler than Perl itself, it isn't worth using. Nobody will see any point in learning a new declarative language for expressing Perl data transformations unless it is itself simpler to use than just writing the Perl would have been.

[ Addendum 20150508: I dumped all my Twingler notes on the blog last year. ]

There are some hard problems here: What do people need? What subset of this can be expressed simply? How can we design a simple, limited language that people can use to express their needs? Can the language actually be compiled to Perl?

I had to face similar sorts of problems when I was writing linogram, but in the case of linogram I was more successful. I tinkered with twingler for some time and made several pages of (typed) notes but never came up with anything I was really happy with.

[ Addendum 20150508: I dumped all my Twingler notes on the blog last year. ]

At one point I abandoned the idea of a declarative language, in favor of just having the program take a sample input and a corresponding sample output, and deduce the appropriate transformation from there. For example, you would put in:

[ [ A, B ],
[ C, B ],
[ D, E ] ]
and
{ B => [A, C],
E => [D],
}
and it would generate:
for my $a1 (@$input) {
my ($e1,$e2) = @$a1; push @{$output{$e2}},$e1;
}
And then presumably you could eyeball this, and if what you really wanted was @{$a1}[0, -1] instead of @$a1 you could tinker it into the form you needed without too much extra trouble. This is much nicer from a user-experience point of view, but at the same time it seems more difficult to implement.

I had some ideas. One idea was to have it generate a bunch of expressions for mapping single elements from the input to the output, and then to try to unify those expressions. But as I said, I never did figure it out.

It's a shame, because it would have been pretty cool if I had gotten it to work.

The MIT CS grad students' handbook used to say something about how you always need to have several projects going on at once, because two-thirds of all research projects end in failure. The people you see who seem to have one success after another actually have three projects going on all the time, and you only see the successes. This is a nice example of that.

Mon, 29 Oct 2007

Undefined behavior in Perl and other languages
Miles Gould wrote what I thought was an interesting article on implementation-defined languages, and cited Perl as an example. One of his points was that a language that is defined by its implementation, as Perl is, rather than by a standards document, cannot have any "undefined behavior".

Undefined behavior

For people unfamiliar with this concept, I should explain briefly. The C standard is full of places that say "if the program contains x, the behavior is undefined", which really means "C programs do not contain x, so If the program contains x, it is not written in C, and, as this standard only defines the meaning of programs in C, it has nothing to say about the meaning of your program." There are around a couple of hundred of these phrases, and a larger number of places where it is implied.

For example, everyone knows that it means when you write x = 4;, but what does it mean if you write 4 = x;? According to clause 6.3.2.1[#1], it means nothing, and this is not a C program. The non-guarantee in this case is extremely strong. The C compiler, upon encountering this locution, is allowed to abort and spontaneously erase all your files, and in doing so it is not violating the requirements of the standard, because the standard does not require any particular behavior in this case.

The memorable phrase that the comp.lang.c folks use is that using that construction might cause demons to fly out of your nose.

[ Addendum 20071030: I am informed that I misread the standard here, and that the behavior of this particular line is not undefined, but requires a compiler diagnostic. Perhaps a better example would have been x = *(char *)0. ]

I mentioned this in passing in one of my recent articles about a C program I wrote:

unsigned strinc(char *s)
{
char *p = strchr(s, '\0') - 1;
while (p >= s && *p == 'A' + colors - 1) *p-- = 'A';
if (p < s) return 0;
(*p)++;
return 1;
}
Here the pointer p starts at the end of the string s, and the loop might stop when p points to the position just before s. Except no, that is forbidden, and the program might at that moment cause demons to fly out of your nose. You are allowed to have a pointer that points to the position just after an object, but not one that points just before.

Well anyway, I seem to have digressed. My point was that M. Gould says that one advantage of languages like Perl that are defined wholly by their (one) implementation is that you never have "undefined behavior". If you want to know what some locution does, you type it in and see what it does. Poof, instant definition.

Although I think this is a sound point, it occurred to me that that is not entirely correct. The manual is a specification of sorts, and even if the implementation does X in situation Y, the manual might say "The implementation does X in situation Y, but this is unsupported and may change without warning in the future." Then what you have is not so different from Y being undefined behavior. Because the manual is (presumably) a statement of official policy from the maintainers, and, as a communiqué from the people with the ultimate authority to define the future meaning of the language, it has some of the same status that a formal specification would.

Perl: the static variable hack

Such disclaimers do appear in the Perl documentation. Probably the most significant example of this is the static variable hack. For various implementation reasons, the locution my $static if 0 has a strange and interesting effect: sub foo { my$static = 42 if 0;
print "static is now $static\n";$static++;
}

foo() for 1..5;
This makes $static behave as a "static" variable, and persist from call to call of foo(). Without the ... if 0, the code would print "static is now 42" five times. But with ... if 0, it prints: static is now static is now 1 static is now 2 static is now 3 static is now 4 This was never an intentional feature. It arose accidentally, and then people discovered it and started using it. Since the behavior was the result of a strange quirk of the implementation, caused by the surprising interaction of several internal details, it was officially decided by the support group that this behavior would not be supported in future versions. The manual was amended to say that this behavior was explicitly undefined, and might change in the future. It can be used in one-off programs, but not in any important program, one that might have a long life and need to be run under several different versions of Perl. Programs that use pointers that point outside the bounds of allocated storage in C are in a similar position. It might work on today's system, with today's compiler, today, but you can't do that in any larger context. Having the "undefined behavior" be determined by the manual, instead of by a language standard, has its drawbacks. The language standard is fretted over by experts for months. When the C standard says that behavior is undefined, it is because someone like Clive Feather or Doug Gwyn or P.J. Plauger, someone who knows more about C than you ever will, knows that there is some machine somewhere on which the behavior is unsupported and unsupportable. When the Perl manual says that some behavior is undefined, you might be hearing from the Perl equivalent of Doug Gwyn, someone like Nick Clark or Chip Salzenberg or Gurusamy Sarathy. Or you might be hearing from a mere nervous-nellie who got their patch into the manual on a night when the release manager had stayed up too late. Perl: modifying a hash in a loop Here is an example of this that has bothered me for a long time. One can use the each() operator to loop lazily over the contents of a hash: while (my$key = each %hash) {
# do something with $key and$hash{$key} } What happens if you modify the hash in the middle of the loop? For various implementation reasons, the manual forbids this. For example, suppose the loop code adds a new key to the hash. The hash might overflow as a result, and this would trigger a reorganization that would move everything around, destroying the ordering information. The subsequent calls to each() would continue from the same element of the hash, but in the new order, making it likely that the loop would visit some keys more than once, or some not at all. So the prohibition in that case makes sense: The each() operator normally guarantees to produce each key exactly once, and adding elements to a hash in the middle of the loop might cause that guarantee to be broken in an unpredictable way. Moreover, there is no obvious way to fix this without potentially wrecking the performance of hashes. But the manual also forbids deleting keys inside the loop, and there the issue does not come up, because in Perl, hashes are never reorganized as the result of a deletion. The behavior is easily described: Deleting a key that has already been visited will not affect the each() loop, and deleting one that has not yet been visited will just cause it to be skipped when the time comes. Some people might find this general case confusing, I suppose. But the following code also runs afoul of the "do not modify a hash inside of an each loop" prohibition, and I don't think anyone would find it confusing: while (my$key = each %hash) {
delete $hash{$key} if is_bad($hash{$key});
}
Here we want to delete all the bad items from the hash. We do this by scanning the hash and deleting the current item whenever it is bad. Since each key is deleted only after it is scanned by each, we should expect this to visit every key in the hash, as indeed it does. And this appears to be a useful thing to write. The only alternative is to make two passes, constructing a list of bad keys on the first pass, and deleting them on the second pass. The code would be more complicated and the time and memory performance would be much worse.

There is a potential implementation problem, though. The way that each() works is to take the current item and follow a "next" pointer from it to find the next item. (I am omitting some unimportant details here.) But if we have deleted the current item, the implementation cannot follow the "next" pointer. So what happens?

In fact, the implementation has always contained a bunch of code, written by Larry Wall, to ensure that deleting the current key will work properly, and that it will not spoil the each(). This is nontrivial. When you delete an item, the delete() operator looks to see if it is the current item of an each() loop, and if so, it marks the item with a special flag instead of deleting it. Later on, the next time each() is invoked, it sees the flag and deletes the item after following the "next" pointer.

So the implementation takes some pains to make this work. But someone came along later and forbade all modifications of a hash inside an each loop, throwing the baby out with the bathwater. Larry and perl paid a price for this feature, in performance and memory and code size, and I think it was a feature well bought. But then someone patched the manual and spoiled the value of the feature. (Some years later, I patched the manual again to add an exception for this case. Score!)

Perl: modifying an array in a loop

Another example is the question of what happens when you modify an array inside a loop over the array, as with:

@a = (1..3);
for (@a) {
print;
push @a, $_ + 3 if$_ % 2 == 1;
}
(This prints 12346.) The internals are simple, and the semantics are well-defined by the implementation, and straightforward, but the manual has the heebie-jeebies about it, and most of the Perl community is extremely superstitious about this, claiming that it is "entirely unpredictable". I would like to support this with a quotation from the manual, but I can't find it in the enormous and disorganized mass that is the Perl documentation.

[ Addendum: Tom Boutell found it. The perlsyn page says "If any part of LIST is an array, foreach will get very confused if you add or remove elements within the loop body, for example with splice. So don't do that." ]

The behavior, for the record, is quite straightforward: On the first iteration, the loop processes the first element in the array. On the second iteration, the loop processes the second element in the array, whatever that element is at the time the second iteration starts, whether or not that was the second element before. On the third iteration, the loop processes the third element in the array, whatever it is at that moment. And so the loop continues, terminating the first time it is called upon to process an element that is past the end of the array. We might imagine the following pseudocode:

index = 0;
while (index < array.length()) {
process element array[index];
index += 1;
}
There is nothing subtle or difficult about this, and claims that the behavior is "entirely unpredictable" are probably superstitious confessions of ignorance and fear.

Let's try to predict the "entirely unpredictable" behavior of the example above:

@a = (1..3);
for (@a) {
print;
push @a, $_ + 3 if$_ % 2 == 1;
}
Initially the array contains (1, 2, 3), and so the first iteration processes the first element, which is 1. This prints 1, and, since 1 is odd, pushes 4 onto the end of the array.

The array now contains (1, 2, 3, 4), and the loop processes the second element, which is 2. 2 is printed. The loop then processes the third element, printing 3 and pushing 6 onto the end. The array now contains (1, 2, 3, 4, 6).

On the fourth iteration, the fourth element (4) is printed, and on the fifth iteration, the fifth element (6) is printed. That is the last element, so the loop is finished. What was so hard about that?

My blog was recently inserted into the feed for planet.haskell.org, and of course I immediately started my first streak of posting code-heavy articles about C and Perl. This is distressing not just because the articles were off-topic for Planet Haskell—I wouldn't give the matter two thoughts if I were posting my usual mix of abstract math and stuff—but it's so off-topic that it feels weird to see it sitting there on the front page of Planet Haskell. So I thought I'd make an effort to talk about Haskell, as a friendly attempt to promote good relations between tribes. I'm not sure what tribe I'm in, actually, but what the heck. I thought about Haskell a bit, and a Haskell example came to mind.

Here is a definition of the factorial function in Haskell:

fact 0 = 1
fact n = n * fact (n-1)
I don't need to explain this to anyone, right?

Okay, now here is another definition:

fact 0     = 1
fact (n+1) = (n+1) * fact n
Also fine, and indeed this is legal Haskell. The pattern n+1 is allowed to match an integer that is at least 1, say 7, and doing so binds n to the value 6. This is by a rather peculiar special case in the specification of Haskell's pattern-matcher. (It is section 3.17.2#8 of Haskell 98 Language and Libraries: The Revised Report, should you want to look it up.) This peculiar special case is known sometimes as a "successor pattern" but more often as an "n+k pattern".

The spec explicitly deprecates this feature:

Many people feel that n+k patterns should not be used. These patterns may be removed or changed in future versions of Haskell.

(Page 33.) One wonders why they put it in at all, if they were going to go ahead and tell you not to use it. The Haskell committee is usually smarter than this.

I have a vague recollection that there was an argument between people who wanted to use Haskell as a language for teaching undergraduate programming, and those who didn't care about that, and that this was the compromise result. Like many compromises, it is inferior to both of the alternatives that it interpolates between. Putting the feature in complicates the syntax and the semantics of the language, disrupts its conceptual purity, and bloats the spec—see the Perlesque yikkity-yak on pages 57–58 about how x + 1 = ... binds a meaning to +, but (x + 1) = ... binds a meaning to x. Such complication is worth while only if there is a corresponding payoff in terms of increased functionality and usability in the language. In this case, the payoff is a feature that can only be used in one-off programs. Serious programs must avoid it, since the patterns "may be removed or changed in future versions of Haskell". The Haskell committee purchased this feature at a certain cost, and it is debatable whether they got their money's worth. I'm not sure which side of that issue I fall on. But having purchased the feature, the committee then threw it in the garbage, squandering their sunk costs. Oh well. Not even the Haskell committee is perfect.

I think it might be worth pointing out that the version of the program with the n+k pattern is technically superior to the other version. Given a negative integer argument, the first version recurses forever, possibly taking a long time to fail and perhaps taking out the rest of the system on which it is running. But the n+k version fails immediately, because the n+1 pattern will only match an integer that is at least 1.

XML screws up

The "nasal demons" of the C standard are a joke, but a serious one. The C standard defines what C compilers must do when presented with C programs; it does not define what they do when presented with other inputs, nor what other software does when presented with C programs. The authors of C standard clearly understood the standard's role in the world.

Earlier versions of the XML standard were less clear. There was a particularly laughable clause in the first edition of the XML 1,0 standard:

XML documents may, and should, begin with an XML declaration which specifies the version of XML being used. For example, the following is a complete XML document, well-formed but not valid:

<?xml version="1.0"?>
<greeting>Hello, world!</greeting>

...

The version number "1.0" should be used to indicate conformance to this version of this specification; it is an error for a document to use the value "1.0" if it does not conform to this version of this specification.

(Emphasis is mine.) The XML 1.0 spec is just a document. It has no power, except to declare that certain files are XML 1.0 and certain files are not. A file that complies with the requirements of the spec is XML 1.0; all other files are not XML 1.0. But in the emphasized clause, the spec says that certain behavior "is an error" if it is exhibited by documents that do not conform to the spec. That is, it is declaring certain non-XML-1.0 documents "erroneous". But within the meaning of the spec, "erroneous" simply means that the documents are not XML 1.0. So the clause is completely redundant. Documents that do not conform to the spec are erroneous by definition, whether or not they use the value "1.0".

It's as if the Catholic Church issued an edict forbidding all rabbis from wearing cassocks, on pain of excommunication.

I am happy to discover that this dumb error has been removed from the most recent edition of the XML 1.0 spec.

Mon, 15 Oct 2007

Van der Waerden's problem: programs 3 and 4
In this series of articles I'm analyzing five versions of a program that I wrote around 1988, and then another program that does the same thing that I wrote last month without referring to the 1988 code. (I said before that it was four versions, but apparently I'm not so good at counting to five.)

If you don't remember what the program does, here's an explanation.

Here is program 1, which was an earlier attempt to do the same thing. Here's program 2.

Program 3

Complete source code for this version.

I said of the previous program:

The problem is all in the implementation. You see, this program actually constructs the entire tree in memory.

Somewhere along the line it dawned on me that constructing the tree was unnecessary, so I took that machinery out, and the result was version 3.

Consequently, this program is easy to explain once you have seen the previous version: almost all I have to do is list the stuff that I took out.

Since this program does not construct a tree of node structures, it omits the definition of the node structure and the macro for manufacturing nodes. Since it gets rid of the node allocation, it also gets rid of the memory leak of the previous version, and so omits the customized memory allocation functions Malloc and Free that performed memory tracking.

The previous program had a compiled-in limit on the number of colors it would handle, because at the time I didn't know how to do a dynamic array. In this program, I got rid of the node structures, so there was no array of node structures, so no need for a limit on the number of node structures in the array. And all the code that enforced the limit is gone.

The apchk function, which checks to see if a string is good, remains unchanged from the previous version.

The makenodes function, which was the principal function in the previous program, remains, but has lost a lot of code. It is simpler to call, too; the node argument is gone:

makenodes(maxlen,"");
I got rid of the silly !howfar test in favor of a more easily-understood howfar == 0 test. There are lots of times when ! is appropriate, but testing whether a non-negative integer has reached zero is not one of them. I was going to comment earlier about what a novice error this is, and I'm glad to see that I fixed it.

The main use of apchk in the previous program had if (!apchk(...)) { ... }. That was okay, because apchk returns a Boolean result. But the negation is annoying. It suggests that apchk's return value is backward. (Instead of returning true for a bad string, it should return true for a good string.) This is not very much a big deal, and I only brought it up so that I could diffidently confess that these days I would probably have done:

#define unless(c)       if(!(c))
...
}
There are a lot of stories of doofus Pascal programmers who do:

#define begin {
#define end }
and Fortran programmers who do:

#define GT >
#define GE >=
#define LT <
#define LE <=
and I find, to my shame, that I have become one of them. Anyone seeing #define unless(c) if(!(c)) would snort and say "Oh, this was obviously written by a Perl programmer."

But at least I was a C programmer first.

Actually I was a Fortran programmer first. But I was never a big enough doofus to #define GE >=.

The big flaw in the current program is the string argument to makenodes. Each call to makenodes copies this string so that it can append a character to the end. I discussed this at some length in the previous article, so I don't want to make too much of it now; I'll just say that a better technique would have reused the string buffer from call to call. This obviously saves a little memory, and since most of the contents of the string doesn't change, it also saves a lot of time.

This might be worth seeing, since it seems to me now to be a marvel of wasted code:

ls = strlen(s);
newarg = STRING(ls + 1);
if (!newarg)
{
fprintf(stderr,"Couldn't get %d bytes for newarg in makenodes\n",ls+2);
fprintf(stderr,"Total get was %d.\n",gotten);
fprintf(stderr,"P\n L\n  O\n   P\n    !\n");
abort();
}
strcpy(newarg,s);
newarg[ls+1] = '\0';
newarg[ls] = 'A' + i;
makenodes(howfar-1,newarg);
free(newarg);
The repeated strlen, for example, when ls could be calculated as maxlen - howfar. The excessively verbose failure message, which should be inside the STRING macro anyway. (The code that maintains gotten has gone away with the debugging allocation routines, so the second fprintf is superfluous.) And why did I think abort was the right thing to call on an out-of-memory condition?

Oh well, you live and learn.

Program 4

Complete source code for this version.

The fourth version of the program is even more trimmed-down. In this version of the program I did get the idea to reuse the string buffer instead of copying the string on every recursive call. But I also got an even better idea, and eliminated the recursive call. The makenodes function is now down to one argument, which tells it how deep a tree to search.

void
makenodes(maxdepth)
int maxdepth;
{
int apchk(), depth = 0;
char curlet, *curstring = STRING(maxdepth);

curstring[0] = '\0';
curlet = 'A';

while (depth >= 0)
{
while (curlet <= 'A' - 1 + colors)
{
#ifdef DIAG
printf("%s makenoding with string %s%c, depth %d.\n",
TABS+12-depth,curstring,curlet,depth);
#endif
if (apchk(curstring,curlet))
curlet++;
else
if (depth < maxdepth)
{
curstring[depth] = curlet;
curstring[depth+1] = '\0';
depth += 1;
curlet = 'A';
}
else
{
printf("%s%c\n",curstring,curlet);
curlet++;
}
}
depth -= 1;
curlet = curstring[depth] + 1;
curstring[depth] = '\0';
}
}
This is a better job all around, and not very different from what I wrote last month to do the same thing. I was going to title this series of articles "I have become a better programmer!", and now that I see this version, I'm glad I didn't, because there's no evidence here that I am much better. This version of the program gets a solid A from my older self.

The value depth scans forward in the string when the search is going well, and is decremented again when the search needs to backtrack. If depth == maxdepth, a witness of the desired length has been found, and is printed out.

The curlet ("current letter") variable tracks which branch of the current tree node we are "recursing" down. After the function recurses down, by incrementing depth, curlet is set to 'A' to visit the first sub-node of the new current node. The curstring buffer tracks the path through the tree to the current node. When the function needs to backtrack, it restores the state of curlet from the last character in the buffer and then trims that character off the end of the path.

I'd only want to make two changes to this code. One would be to make depth a pointer into the curstring buffer instead of an index into it. Then again, the compiler may well have optimized it into one anyway. But it would also allow me to eliminate curlet in favor of just using *depth everywhere.

The other change would address a more serious defect: the contents of curstring are kept properly zero-terminated at all times, whenever depth is advanced or retracted. This zero-termination is unnecessary, since curstring is never used as a string except when depth == maxdepth. When printfing curstring, I could have used something like:

printf("%.*s%c\n",curstring,maxlen,curlet);
which prints exactly maxlen characters from the buffer, regardless of whether it is zero-terminated.

It would, however, have required that I know about %.*s, which I'm sure I did not. Was %.*s even available in 1988? I forget, and my copy of K&R First Edition is in a box somewhere since my recent move. Anyway, if %.*s was unavailable for whatever reason, the code could have had a single curstring[maxdepth] = 0 up front, which would have been quite sufficient for the one printf it needed to do.

Coming next: one very different program to solve the same problem, and a comparison with last month's effort.

Fri, 05 Oct 2007

Van der Waerden's problem: program 2
In this series of articles I'm going to analyze four versions of a program that I wrote around 1988, and then another program that does the same thing that I wrote last month without referring to the 1988 code.

If you don't remember what the program does, here's an explanation.

Here is program 1, which was an earlier attempt to do the same thing.

Program 2

In yesterday's article I wrote about a crappy program to search for "good" strings in van der Waerden's problem. It was crappy because it searched the entire space of all 327 strings, with no pruning.

I can't remember whether I expected this to be practical at the time. Did I really think it would work? Well, there was some sense to it. It does work just fine for the 29 case. I think probably my idea was to do the simplest thing that could possibly work, and get as much information out of it as I could. On my current machine, this method proves that V(3,3) > 19 by finding a witness (RRBRRBBYYRRBRRBBYYB) in under 10 seconds. If we estimate that the computer I had then was 10,000 times slower, then I could have produced the same result in about 28 hours. I was at college, and there was plenty of free computing power available, so running a program for 28 hours was easily done. While I was waiting for it to finish, I could work on a better program.

Excerpts of the better program follow. The complete source code is here.

The idea behind this program is that the strings of length less than V form a tree, with the empty string as the root, and the children of string s are obtained from s by appending a single character to the end of s. If the string at a node is bad, so will be all the strings under it, and we can prune the entire branch at that node. This leaves us with a tree of all the good strings. The ones farthest from the root will be the witnesses we seek for the values of V(n, C), and we can find these by doing depth-first search on the tree,

There is nothing wrong with this idea in principle; that's the way my current program works too. The problem is all in the implementation. You see, this program actually constructs the entire tree in memory:

#define NEWN		((struct tree *) Malloc(sizeof(struct tree)));\
printf("*")
struct tree {
struct tree *away[MAXCOLORS];
} *root;
struct tree is a tree node structure. It represents a string s, and has a flag to record whether s is bad. It also has pointers to its subnodes, which will represents strings sA, sB, and so on.

MAXCOLORS is a compiled-in limit on the number of different symbols the strings can contain, an upper bound on C. Apparently I didn't know the standard technique for avoiding this inflexibility. You declare the array as having length 1, but then when you allocate the structure, you allocate enough space for the array you are actually planning to use. Even though the declared size of the array is 1, you are allowed to refer to node->away[37] as long as there is actually enough space in the allocated chunk. The implementation would look like this:

struct tree {
struct tree *away[1];
} ;

struct tree *make_tree_node(char bad, unsigned n_subnodes)
{
struct tree *t;
unsigned i;

t =  malloc(sizeof(struct tree)
+ (n_subnodes-1) * sizeof(struct tree *));

if (t == NULL) return NULL;

for (i=0; i < n_subnodes; i++) t->away[i] = NULL;

return t;
}
(Note for those who are not advanced C programmers: I give you my solemn word of honor that I am not doing anything dodgy or bizarre here; it is a standard, widely-used, supported technique, guaranteed to work everywhere.)

(As before, this code is in a pink box to indicate that it is not actually part of the program I am discussing.)

Another thing I notice is that the NEWN macro is very weird. Note that it may not work as expected in a context like this:

for(i=0; i<10; i++)
s[i] = NEWN;
This allocates ten nodes but prints only one star, because it expands to:

for(i=0; i<10; i++)
s[i] = ((struct tree *) Malloc(sizeof(struct tree)));
printf("*");
and the for loop does not control the printf. The usual fix for multiline macros like this is to wrap them in do...while(0), but that is not appropriate here. Had I been writing this today, I would have made NEWN a function, not a macro. Clevermacroitis is a common disorder of beginning C programmers, and I was no exception.

The main business of the program is in the makenodes function; the main routine does some argument processing and then calls makenodes. The arguments to the makenodes function are the current tree node, the current string that that node represents, and an integer howfar that says how deep a tree to construct under the current node.

There's a base case, for when nothing needs to be constructed:

if (!howfar)
{
for (i=0; i<colors; i++)
n->away[i] = NULL;
return;
}
But in general the function calls itself recursively:

for (i=0; i<colors; i++)
{
n->away[i] = NEWN;
if (apchk(s,'A'+i))
{
}
else
...
Recall that apchk checks a string for an arithmetic progression of equal characters. That is, it checks to see if a string is good or bad. If the string is bad, the function prunes the tree at the current node, and doesn't recurse further.

Unlike the one in the previous program, this apchk doesn't bother checking all the possible arithmetic progressions. It only checks the new ones: that is, the ones involving the last character. That's why it has two arguments. One is the old string s and the other is the new symbol that we want to append to s.

If s would still be good with symbol 'A'+i appended to the end, the function recurses:

...
else
{
ls = strlen(s);
newarg = STRING(ls + 1);
strcpy(newarg,s);
newarg[ls+1] = '\0';
newarg[ls] = 'A' + i;
makenodes(n->away[i],howfar-1,newarg);
Free(newarg,ls+2);
Free(n->away[i],sizeof(struct tree));
}
}
}
The entire string is copied here into a new buffer. A better technique sould have been to allocate a single buffer back up in main, and to reuse that buffer over again on each call to makenodes. It would have looked something like this:

char *s = String(maxlen);
memset(s, 0, maxlen+1);
makenodes(s, s, maxlen);

void
makenodes(char *start, char *end, unsigned howfar)
{
...
for (i=0; i<colors; i++) {
*end = 'A' + i;
makenodes(start, end+1, howfar-1);
}
*end = '\0';
...
}
This would have saved a lot of consing, ahem, I mean a lot of mallocing. Also a lot of string copying. We could avoid the end pointer by using start+maxlen-howfar instead, but this way is easier to understand.

I was thinking this afternoon how it's intersting the way I wrote this. It's written the way it would have been done, had I been using a functional programming language. In a functional language, you would never mutate the same string for each function call; you always copy the old structure and construct a new one, just as I did in this program. This is why C programmers abominate functional languages.

Had I been writing makenodes today, I would probably have eliminated the other argument. Instead of passing it a node and having it fill in the children, I would have had it construct and return a complete node. The recursive call would then have looked like this:

struct tree *new = NEWN;
...
for (i=0; i<colors; i++) {
new->away[i] = makenodes(...);
...
}
return new;
One thing I left out of all this was the diagnostic printfs; you can see them in the complete code if you want. But there's one I thought was worth mentioning anyway:

#define TABS	"                                        "
....

#ifdef DIAG
printf("%s makenoding with string %s, depth %d.\n",
TABS+12-maxlen+howfar,s,maxlen-howfar);
#endif
The interesting thing here is the TABS+12-maxlen+howfar argument, which indents the display depending on how far the recursion has progressed. In Perl, which has nonaddressable strings, I usually do something like this:

my $TABS = " " x (maxlen - howfar); print$TABS, "....";
The TABS trick here is pretty clever, and I'm a bit surprised that I thought of it in 1988, when I had been programming in C for only about a year. It makes an interesting contrast to my failure to reuse the string buffer in makenodes earlier.

(Peeking ahead, I see that in the next version of the program, I did reuse the string buffer in this way.)

TABS is actually forty spaces, not tabs. I suspect I used tabs when I tested it with V(2, 3), where maxlen was only 9, and then changed it to spaces for calculating V(3, 3), where maxlen was 27.

The apchk function checks to see if a string is good. Actually it gets a string, qq, and a character, q, and checks to see if the concatenation of qq and q would be good. This reduces its running time to O(|qq|) rather than O(|qq|2).

int
apchk(qq,q)
char *qq ,q;
{
int lqq, f, s, t;

t = lqq = strlen(qq);
if (lqq < 2) return NO;

for (f=lqq % 2; f <= lqq - 2; f += 2)
{
s = (f + t) / 2;
if ((qq[f] == qq[s]) && (qq[s] == q))
return YES;
}
return NO;
}
It's funny that it didn't occur to me to include an extra parameter to avoid the strlen, or to use q instead of qq[s] in the first == test. Also, as in the previous program, I seem unaware of the relative precedences of && and ==. This is probably a hangover from my experience with Pascal, where the parentheses are required.

It seems I hadn't learned yet that predicate functions like apchk should be named something like is_bad, so that you can understand code like if (is_bad(s)) { ... } without having to study the code of is_bad to figure out what it returns.

I was going to write that I hated this function, and that I could do it a lot better now. But then I tried to replace it, and wasn't as successful as I expected I would be. My replacement was:

unsigned
{
size_t qql = strlen(qq);
char *f = qq + qql%2;
char *s = f + qql/2;
while (f < s) {
if (*f == q && *s == q) return 1;
f += 2; s += 1;
}
return 0;
}
I could simplify the initializations of f and s, which are the parts I dislike most here, by making the pointers move backward instead of forward, but then the termination test becomes more complicated:
unsigned
{
char *s = strchr(qq, '\0')-1;
char *f = s-1;
while (1) {
if (*f == q && *s == q) return 1;
if (f - qq < 2) break;
f -= 2; s -= 1;
}
return 0;
}
Anyway, I thought I could improve it, but I'm not sure I did. On the one hand, I like the f -= 2; s -= 1;, which I think is pretty clear. On the other hand, s = (f + t) / 2 is pretty clear too; s is midway between f and t. I'm willing to give teenage Dominus a passing grade on this one.

Someone probably wants to replace the while loop here with a for loop. That person is not me.

The Malloc and Free functions track memory usage and were presumably introduced when I discovered that my program used up way too much memory and crashed—I think I remember that the original version omitted the calls to free. They aren't particularly noteworthy, except perhaps for this bit, in Malloc:

if (p == NULL)
{
fprintf(stderr,"Couldn't get %d bytes.\n",c);
fprintf(stderr,"Total get was %d.\n",gotten);
fprintf(stderr,"P\n L\n  O\n   P\n    !\n");
abort();
}
Plop!

It strikes me as odd that I was using void in 1988 (this is before the C90 standard) but still K&R-style function declarations. I don't know what to make of that.

Behavior

This program works, almost. On my current machine, it can find the length-26 witnesses for V(3, 3) in no time. (In 1998, it took several days to run on a Sequent Balance 21000.) The major problem is that it gobbles memory: the if (!howfar) base case in makenodes forgets to release the memory that was allocated for the new node. I wonder if the Malloc and Free functions were written in an unsuccessful attempt to track this down.

Sometime after I wrote this program, while I was waiting for it to complete, it occurred to me that it never actually used the tree for anything, and I could take it out.

I have this idea that one of the principal symptoms of novice programmers is that they take the data structures too literally, and always want to represent data the way it will appear when it's printed out. I haven't developed the idea well enough to write an article about it, but I hope it will show up here sometime in the next three years. This program, which constructs an entirely unnecessary tree structure, may be one of the examples of this idea.

I'll show the third version sometime in the next few days, I hope.

[ Addendum 20071014: Here is part 3. ]

Thu, 04 Oct 2007

The world's worst macro preprocessor: postmortem
I see that the world's worst macro processor, subject of a previous article, is a little over a year old. A year ago I said that it was a huge success. I think it's time for a postmortem analysis.

My overall assessment is that it has been a huge success, and that if I were doing it over I would do it the same way.

A recent article contained a bunch of red and blue dots:

Well, clearly you can do four: . And then you can add another red one on the end: . And then another that could be either red or blue: . And then the next can be either color, say blue: .

I typed this using these macros:

#define R* <span style="color: red">&bull;</span>
#define B* <span style="color: blue">&bull;</span>
#define Y* <span style="color: yellow">&bull;</span>
Without the macro processor, I would have had to suffer a lot. Then, a little while later, I needed to prepare this display:

No problem; the lines just look like R*R*B*B*R*R*B*Y*B*Y*Y*R*Y*R*R*B*R*B*B*Y*R*Y*Y*B*Y*B*.

Some time later I realized that this display would be totally illegible to the blind, the color-blind, and people using text-only browsers. So I just changed the macros:

#define R* <span style="color: red">R</span>
#define B* <span style="color: blue">B</span>
#define Y* <span style="color: yellow">Y</span>
Problem solved. instantly becomes R R B B R B B. And a good thing, too, because I discovered afterward that a lot of aggregators, like bloglines and feedburner, discard the color information.

I find that I've used the macro feature 114 times so far. The most common use has been:

#define ^2 <sup>2</sup>
But I also have files with:

That last one appears in three files. Clearly, making the macros local to files was a good decision.

Those uses are pretty typical. A less typical one is:

#define <OVL> <span style="text-decoration: overline">
#define </OVL> </span>
This is the sort of thing that you can get away with on a one-time basis, but which you wouldn't want to make a convention of. Since the purpose of the macro processor is to enable such hacks for the duration of a single article, it's all good.

I did run into at least one problem: I was writing an article in which I had defined ^i to abbreviate <sup><i>i</i></sup>. And then several paragraphs later I had a TeX formula that contained the ^i sequence in its TeX meaning. This was being replaced with a bunch of HTML, which was then passed to TeX, which then produced the wrong output.

One can solve this by reordering the plugins. If I had put the TeX plugin before the macro plugin, the problem would have gone away, because the TeX plugin would have replaced the TeX formula with an image element before the macro plugin ever saw the ^i.

This approach has many drawbacks. One is that it would no longer have been possible to use Blosxom macros in a TeX formula. I wasn't willing to foreclose this possibility, and I also wasn't sure that I hadn't done it somewhere. If I had, the TeX formula that depended on the macro expansion would have broken. And this is a risk whenever you move the macro plugin: if you move it from before plugin X to after plugin X, you have to worry that maybe something in some article depended on the text passed to X having been macro-processed.

When I installed the macro processor, I placed it first in plugin order for precisely this reason. Moving the macro substitution later would have required me to remember which plugins would be affected by the macro substitutions and which not. With the macro processing first, the question has a simple answer: all of them are affected.

Also, I didn't ever want to have to worry that some macro definition might mangle the output of some plugin. What if you are hacking on some plugin, and you change it to return <span style="Foo"> instead of <span style="foo">, and then discover that three articles you wrote back in 1997 are now totally garbled because they contained #define Foo >WUGGA<? It's just too unpredictable. Having the macro processing occur first means that you can always see in the original article file just what might be macro-replaced.

So I didn't reorder the plugins.

Another way to solve the TeX ^i problem would have been to do something like this:

#define ^i <sup><i>i</i></sup>
#define ^*i ^i
with the idea that I could write ^*i in the TeX formula, and the macro processor would replace it with ^i after it was done replacing all the ^i's.

At present the macro processor does not define any order to macro replacements, but it does guarantee to replace each string only once. That is, the results of macro replacement are not themselves searched for macro replacement. This limits the power of the macro system, but I think that is a good thing. One of the powers that is thus proscribed is the power to get stuck in an infinite loop.

It occurs to me now that although I call it the world's worst macro system, perhaps that doesn't give me enough credit for doing good design that might not have been obvious. I had forgotten about my choice of single-substituion behavior, but looking back on it a year later, I feel pleased with myself for it, and imagine that a lot of people would have made the wrong choice instead.

(A brief digression: unlimited, repeated substitution is a bad move here because it is complex—much more complex than it appears. A macro system with single substitution is nothing much, but a macro system with repeated substitution is a programming language. The semantics of the λ-calculus is nothing more than simple substitution, repeated as necessary, and the λ-calculus is a maximally complex computational engine. Term-rewriting systems are a more obvious theoretical example, and TeX is a better-known practical example of this phenomenon. I was sure I did not want my macro system to be a programming language, so I avoided repeated substitution.)

Because each input text is substituted at most once, the processor's refusal to define the order of the replacements is not something you have to think about, as long as your macros are prefix-unique. (That is, as long as none is a prefix of another.) So you shouldn't define:

#define foo   bar
#define fool  idiot
because then you don't know if foolish turns into barlish or idiotish. This is not a big deal in practice.

Well, anyway, I did not solve the problem with #define ^*i ^i. I took a much worse solution, which was to hack a #undefall directive into the macro processor. In my original article, I boasted that the macro processor "has exactly one feature". Now it has two, and it's not an improvement. I disliked the new feature at the time, and now that I'm reviewing the decision, I think I'm going to take it out.

I see that I did use the double-macro solution elsewhere. In the article about Gödel and the U.S. Constitution, I macroed an abbreviation for the umlaut:

#define Godel G&ouml;del
But this sequence also ocurred in the URLs in the link elements, and the substitution broke the links. I should probably have changed this to:

#define Go:del G&ouml;del

#define GODEL Godel
and then used GODEL in the URLs. Oh well, whatever works, I guess.

Perhaps my favorite use so far is in an (unfinished) article about prosopagnosia. I got tired of writing about prosopagnosia and prosopagnosiacs, so

#define PAa prosopagnosia
#define PAic prosopagnosiac
Note that with these definitions, I get PAa's, and PAics for free. I could use PAac instead of defining PAic, but that would prevent me from deciding later that prosopagnosiac should be spelled "prosopagnosic".

Wed, 03 Oct 2007

Van der Waerden's problem: program 1
In this series of articles I'm going to analyze four versions of a program that I wrote around 1988, and then another program that does the same thing that I wrote last month without referring to the 1988 code.

If you don't remember what the program does, here's an explanation.

Program 1

I'm going to discuss the program a bit at a time. The complete program is here.

This program does an unpruned exhaustive search of the string space. Since for V(3, 3) the string space contains 327 = 7,625,597,484,987 strings, it takes a pretty long time to finish. I quickly realized that I was wasting my time with this program.

The program is invoked with a length argument and an optional colors argument, which defaults to 2. It then looks for good strings of the specified length, printing those it finds. If there are none, one then knows that V(3, colors) > length. Otherwise, one knows that V(3, colors) ≤ length, and has witness strings to prove it.

I don't want to spend a lot of time on it because there are plenty of C programming style guides you can read if you care for that. But already on lines 4–5 we have something I wouldn't write today:

#define NO	0
#define YES	!NO
Oh well.

The program wants to iterate through all Cn strings. How does it know when it's done? It's not easy to make a program as slow as this one even slower, but I found a way to do it.

last = STRING(length);
stuff(last,'A' - 1 + colors);

for (i=0; i<colors; i++)
last[i] = 'A' + i;

for (; strcmp(seq,last); strinc(seq))
...
It manufactures the string ABCDDDDDDDDD....D and compares the current string to that one every time through the loop. A much simpler method is to detect completion while incrementing the target string. The function that does the increment looks like this:

void
strinc(s)
char *s;
{
int i;

for (i= length - 1; i>=0; i--)
{
if (s[i] != 'A' - 1 + colors)
{
s[i]++;
return;
}
s[i] = 'A';
}
return;
}
Had I been writing it today, it would have looked more like this:

unsigned strinc(char *s)
{
char *p = strchr(s, '\0') - 1;
while (p >= s && *p == 'A' + colors - 1) *p-- = 'A';
if (p < s) return 0;
(*p)++;
return 1;
}
(This code is in a pink box to show that it is not actually part of the program I am discussing in this article.)

The function returns true on success and false on failure. A false return can be taken by the caller as the signal to terminate the program.

This replacement function invokes undefined behavior, because there is no guarantee that p is allowed to run off the beginning of the string in the way that it does. But there is no need to check the strings in lexicographic order. Instead of scanning the strings in the order AAA, AAB, ABA, ABB, BAA, etc., one can scan them in reverse lexicographic order: AAA, BAA, ABA, BBA, AAB, etc. Then instead of running off the beginning of the string, p runs off the end, which is allowed. This fixes the undefined behavior problem and also eliminates the call to strchr that finds the end of the string. This is likely to produce a significant speedup:

unsigned strinc(char *s)
{
while (*s == 'A' + colors - 1) *s++ = 'A';
if (!*s) return 0;
(*s)++;
return 1;
}
Here we're depending on the optimizer to avoid recomputing the value of 'A' + colors - 1 every time through the loop.

The heart of the program is the apchk() function, which checks whether a string q contains an arithmetic progression of length 3:

int
apchk(q)
char *q;
{
int f, s, t;

for (f=0; f <= length - 3; f++)
for (s=f+1; s <= length - 2; s++)
{
t = s+s-f;
if (t >= length) break;
if ((q[f] == q[s]) && (q[s] == q[t])) return YES;
}
return NO;
}
I hesitate to say that this is the biggest waste of time in the whole program, since after all it is a program whose job is to examine 7,625,597,484,987 strings. But look. 2/3 of the calls to this function are asking it to check a string that differs from the previous string in the final character only. Nevertheless, it still checks all 49 possible arithmetic progressions, even the ones that didn't change.

The t ≥ length test is superfluous, or if it isn't, it should be.

Also notice that I wasn't sure of the precendence in the final test.

It didn't take me long to figure out that this program was not going to finish in time. I wrote a series of others, which I hope to post here in coming days. The next one sucks too, but in a completely different way.

[ Addendum 20071005: Here is part 2. ]

[ Addendum 20071014: Here is part 3. ]

Tue, 02 Oct 2007

Van der Waerden's problem
In this series of articles I'm going to analyze four versions of a program that I wrote around 1988, and then another program that does the same thing that I wrote last month without referring to the 1988 code.

First I'll explain what the programs are about.

Van der Waerden's problem

Color each of a row of dots red or blue, so that no three evenly-spaced dots are the same color. (That is, if dots n and n+i are the same color, dot n+2i must be a different color.) How many dots can you do?

Well, clearly you can do four: R R B B. And then you can add another red one on the end: R R B B R. And then another that could be either red or blue: R R B B R B. And then the next can be either color, say blue: R R B B R B B.

But now you are at the end, because if you make the next dot red, then dots 2, 5, and 8 will all be red (R R B B R B B R), and if you make the next dot blue then dots 6, 7, and 8 will be blue (R R B B R B B B).

But maybe we made a mistake somewhere earlier, and if the first seven dots were colored differently, we could have made a row of more than 7 that obeyed the no-three-evenly-spaced-dots requirement. In fact, this is so: R R B B R R B B is an example.

But this is the end of the line. Any coloring of a row of 9 dots contains three evenly-spaced dots of the same color. (I don't know a good way to prove this, short of an enumeration of all 512 possible arrangements of dots. Well, of course it is sufficient to enumerate the 256 that begin with R, but that is pretty much the same thing.)

[Addendum 20141208: In this post I give a simple argument that !!V(3,2)\le 9!!.]

Van der Waerden's theorem says that for any number of colors, say C, a sufficiently-long row of colored dots will contain n evenly-spaced same-color dots for any n. Or, put another way, if you partition the integers into C disjoint classes, at least one class will contain arbitrarily long arithmetic progressions.

The proof of van der Waerden's theorem works by taking C and n and producing a number V such that a row of V dots, colored with C colors, is guaranteed to contain n evenly-spaced dots of a single color. The smallest such V is denoted V(n, C). For example V(3, 2) is 9, because any row of 9 dots of 2 colors is guaranteed to contain 3 evenly-spaced dots of the same color, but this is not true of such row of only 8 dots.

Van der Waerden's theorem does not tell you what V(n, C) actually is; it provides only an upper bound. And here's the funny thing about van der Waerden's theorem: the upper bound is incredibly bad.

For V(3, 2), the theorem tells you only that V(3, 2) ≤ 325. That is, it tells you that any row of 325 red and blue dots must contain three evenly spaced dots of the same color. This is true, but oh, so sloppy, since the same is true of any row of 9 dots.

For V(3, 3), the question is how many red, yellow, and blue dots do you need to guarantee three evenly-spaced same-colored dots. The theorem helpfully suggests that:

$$V(3,3) \leq 7(2\cdot3^7+1)(2\cdot3^{7(2\cdot3^7+1)}+1)$$

This is approximately 5.79·1014613. But what is the actual value of V(3, 3)? It's 27. Urgggh.

In fact, there is a rather large cash prize available to be won by the first person who comes up with a general upper bound for V(n, C) that is smaller than a tower of 2's of height n. (That's 222... with n 2's.)

In the rest of this series, a string which does not contain three evenly-spaced equal symbols will be called good, and one which does contain three such symbols will be called bad. Then a special case of Van der Waerden's theorem, with n=3, says that, for any fixed number of symbols, all sufficiently long strings are bad.

In college I wanted to investigate this a little more. In particular, I wanted to calculate V(3, 3). These days you can just look it up on Wikipedia, but in those benighted times such information was hard to come by. I also wanted to construct the longest possible good strings, witnesses of length V(3, 3)-1. Although I did not know it at the time, V(3, 3) = 27, so a witness should have length 26. It turns out that there are exactly 48 witnesses of length 26. Here are the 1/6 of them that begin with RB or RRB:

RRBBRRBYBYYRYRRBRBBYRYYBYB
RRBBYRRYRYBBYYBBYRYRRYBBRR
RRBYBRRYRYBBYYBBYRYRRBYBRR
RBRRBRBYYBBYYBRBRRBYYRRYRY
RBRBBRRYBBYBYRRYYRRYBYBBYR
RBRBBRRYBBYBYRRYYRRYBYBBYB
RBRBBYBRRYRYYBYBBRBRYYRRYY
RBYYBYBRRBBRRBYBYYBRRYYRYR

The rest of the witnesses may be obtained by permuting the colors in these eight.

I wrote a series of C programs around 1988 to exhaustively search for good strings. Last month I was in a meeting and I decided to write the program again for some reason. I wrote a much better program. This series of articles will compare the five programs. I will post the first one tomorrow.

[ Addendum 20071003: Here is part 1. ]

[ Addendum 20071005: Here is part 2. ]

[ Addendum 20071005: I made a mistake in the expression I gave for the upper bound on V(3,3) and left out a factor of 7 in the exponent on the last 3. I had said that the upper bound was around 102092, but actually it is more like the seventh power of this. ]

[ Addendum 20071014: Here is part 3. ]

Sat, 28 Jul 2007

Lightweight Database Strategies for Perl
Several years ago I got what I thought was a great idea for a three-hour conference tutorial: lightweight data storage techniques. When you don't have enough data to be bothered using a high-performance database, or when your data is simple enough that you don't want to bother with a relational database, you stick it in a flat file and hack up some file code to read it. This is the sort of thing that people do all the time in Perl, and I thought it would be a big seller. I was wrong.

I don't know why. I tried giving the class a snappier title, but that didn't help. I'm really bad at titles. Maybe people are embarrassed to think about all the lightweight data storage hackery they do in Perl, and feel that they "should" be using a relational database, and don't want to commit more resources to lightweight database techniques. Or maybe they just don't think there is very much to know about it.

But there is a lot to know; with a little bit of technique you can postpone the day when you need to go to an RDB, often for quite a long time, and often forever. Many of the techniques fall into the why-didn't-I-think-of-that category, stuff that isn't too weird to write or maintain, but that you might not have thought to try.

I think it's a good class, but since it never sold well, I've decided it would do more good (for me and for everyone else) if I just gave away the materials for free.

The class is in three sections. The first section is about using plain text files and talks about a bunch of useful techniques, such as how to do binary search on sorted text files (this is nontrivial) and how to replace records in-place, when they might not fit.

The second section is about the Tie::File module, which associates a flat text file with a Perl array.

The third section is about DBM files, with a comparison of the five major implementations. It finishes up with a discussion of some of Berkeley DB's lesser-known useful features, such as its DB_BTREE file type, which offers fast access like a hash but keeps the records in sorted order

• Text Files
• Rotating log file; deleting a user
• Copy the File
• -i.bak
• Using -i inside a program
• Problems with -i
• Atomicity issues
• Essential problem with files; fundamental operations; seeking
• Sorted files
• In-place modification of records
• Overwriting records
• Bytes vs. positions
• Gappy Files
• Fixed-length records
• Numeric indices
• Case study: lastlog
• Indexing
• Void fields
• Generic text indices
• Packed offsets
• Tie::File
• Tie::File Examples
• delete_user revisited
• Rotating log file revisited
• Most important thing to know about Tie::File
• Indexing with Tie::File
• Tie::File Internals
• Caching
• Record modification
• Immediate vs. Deferred Writing
• Autodeferring
• Miscellaneous Features
• DBM
• Common DBM Implementations
• What DBM Does
• Small DBMs: ODBM, NDBM, and SDBM
• GDBM
• DB_File
• Indexing revisited
• Ordered hashes
• Partial matching
• Sequential access
• Multiple values
• Filters
• BerkeleyDB

Online materials

Fri, 20 Jul 2007

"More intuitive" programming language syntax
Chromatic wrote an article today about The Broken Metric of "Intuitive to the Uneducated" Language Syntax in which he addresses the very common argument that some language syntax is better than some other because it is "more intuitive" or "easier for beginners to understand".

Chromatic says that these arguments are bunk because programming language syntax is much less important than programming language semantics. But I think that is straining at a gnat and swallowing a camel.

To argue that a certain programming language feature is bad because it is confusing to beginners, you have to do two things. You have to successfully argue that being confusing to beginners is an important metric. Chromatic's article tries to refute this, saying that it is not an important metric.

But before you even get to that stage, you first have to show that the programming language feature actually is confusing to beginners.

But these arguments are never presented with any evidence at all, because no such evidence exists. They are complete fabrications, pulled out of the asses of their propounders, and made of equal parts wishful thinking and bullshit.

 To support my assertion that nobody knows what makes programming hard for beginners, I wanted to cite this paper, The camel has two humps, by Dehnadi and Bornat, which I was rereading recently, but I couldn't find my copy and couldn't remember the title or authors. Happily, I eventually remembered. The abstract begins: Learning to program is notoriously difficult. A substantial minority of students fails in every introductory programming course in every UK university. Despite heroic academic effort, the proportion has increased rather than decreased over the years. Despite a great deal of research into teaching methods and student responses, we have no idea of the cause. But the situation isn't completely hopeless; the abstract also says: We have found a test for programming aptitude, of which we give details. We can predict success or failure even before students have had any contact with any programming language with very high accuracy, and by testing with the same instrument after a few weeks of exposure, with extreme accuracy. We present experimental evidence to support our claim. certain to succeed. What's the secret? Read and learn.

http://retractionwatch.com/2014/07/18/the-camel-doesnt-have-two-humps-programming-aptitude-test-canned-for-overzealous-conclusion/ Addendum 20160518: Bornat has retracted the paper mentioned above, which was never published. He says:

In 2006 I wrote an intemperate description of the results of an experiment carried out by Saeed Dehnadi. Many of the extravagant claims I made were insupportable, and I retract them. I continue to believe, however, that Dehnadi had uncovered the first evidence of an important phenomenon in programming learners. Later research seems to confirm that belief.
In particular, Bornat says “There wasn’t and still isn’t an aptitude test for programming based on Dehnadi’s work.” This retracts the specific claim that I quoted above. The entire retraction is worth reading.

Thu, 12 Jul 2007

Another useful utility
Every couple of years I get a good idea for a simple utility that will make my life easier. Last time it was the following triviality, which I call f:

#!/usr/bin/perl

my $field = shift or usage();$field -= 1 if $field > 0;$|=1;

while (<>) {
chomp;
my @f = split;
print $f[$field], "\n";
}

sub usage {
print STDERR "$0 fieldnumber\n"; exit 1; } I got tired of writing awk '{print$11}' when I wanted to extract the 11th field of some stream of data in a Unix pipeline, which is something I do about six thousand times a day. So I wrote this tiny thing. It was probably the most useful piece of software I wrote in that calendar year, and as you can see from the length, it certainly had the best cost-to-benefit ratio. I use it every day.

The point here is that you can replace awk '{print $11}' with just f 11. For example, f 11 access_log finds out the referrer URLs from my Apache httpd log. I also frequently use f -1, which prints the last field in each line. ls -l | grep '^l' | f -1 prints out the targets of all the symbolic links in the current directory. Programs like this won't win me any prizes, but they certainly are useful. Anyway, today's post was inspired by another similarly tiny utility that I expect will be similarly useful that I just finished. It's called runN: #!/usr/bin/perl use Getopt::Std; my %opt; getopts('r:n:c:v', \%opt) or usage();$opt{n} or usage();
$opt{c} or usage(); @ARGV = shuffle(@ARGV) if$opt{r};

my $N =$opt{n};
my %pid;
while (@ARGV) {
if (keys(%pid) < $N) {$pid{spawn($opt{c}, split /\s+/, shift @ARGV)} = 1; } else { delete$pid{wait()};
}
}

1 while wait() >= 0;

sub spawn {
my $pid = fork; die "fork:$!" unless defined $pid; return$pid if $pid; exec @_; die "exec:$!";
}
You can tell I just finished it because the shuffle() and usage() functions are unimplemented.

The idea is that you execute the program like this:

runN -n 3 -c foo arg1 arg2 arg3 arg4...
and it runs the commands foo arg1, foo arg2, foo arg3, foo arg4, etc., simultaneously, but with no more than 3 running at a time.

The -n option says how many commands to run simultaneously; after running that many the main control waits until one has exited before starting another.

If I had implemented shuffle(), then -r would run the commands in random order, instead of in the order specified. Probably I should get rid of -c and just have the program take the first argument as the command name, so that the invocation above would become runN -n 3 foo arg1 arg2 arg3 arg4.... The -v flag, had I implemented it, would put the program into verbose mode.

I find that it's best to defer the implementation of features like -r and -v until I actually need them, which might be never. In the past I've done post-analyses of the contents of ~mjd/bin, and what I found was that my tendency was to implement a lot more features than I needed or used.

In the original implementation, the -n is mandatory, because I couldn't immediately think of a reasonable default. The only obvious choice is 1, but since the point of the program was to run programs concurrently, 1 is not reasonable. But it occurs to me now that if I let -n default to 1, then this command would replace many of my current invocations of:

for i in ...; do
cmd $i done which I do quite a lot. Typing runN cmd ... would be a lot quicker and easier. As I've written before, when a feature you put in turns out to have unanticipated uses, it's a sign of a good, modular design. The code itself makes me happy for two reasons. One is that the program worked properly on the first try, which does not happen very often for me. When I was in elementary school, my teachers always complained that although I was very bright, I made a lot of careless mistakes because I was not methodical enough. They tried hard to fix this personality flaw. They did not succeed. The other thing I like about the code is that it's so very brief. Not to say that it is any briefer than it should be; I think it's just about perfect. One of the recurring themes of my study of programming for the last few years is that beginner programmers use way more code than is necessary, just like beginning writers use way too many words. The process and concurrency management turned out to be a lot easier than I thought they would be: the default Unix behavior was just exactly what I needed. I am particularly pleased with delete$pid{wait()}. Sometimes these things just come together.

The 1 while wait() >= 0 line is a non-obfuscated version of something I wrote in my prize-winning obfuscated program, of all places. Sometimes the line between the sublime and the ridiculous is very fine indeed.

Despite my wariness of adding unnecessary features, there is at least one that I will put in before I deploy this to ~mjd/bin and start using it. I'll implement usage(), since experience has shown that I tend to forget how to invoke these things, and reading the usage message is a quicker way to figure it out than is rereading the source code. In the past, usage messages have been good investments.

I'm tempted to replace the cut-rate use of split here with something more robust. The problem I foresee is that I might want to run a command with an argument that contains a space. Consider:

runN -n 2 -c ls foo bar "-l baz"
This runs ls foo, then ls bar, then ls -l baz. Without the split() or something like it, the third command would be equivalent to ls "-l baz" and would fail with something like -l baz: no such file or directory. (Actually it tries to interpret the space as an option flag, and fails for that reason instead.) So I put the split in to enable this usage. (Maybe this was a you-ain't-gonna-need-it moment; I'm not sure.) But this design makes it difficult or impossible to apply the command to an argument with a space in it. Suppose I'm trying to do ls on three directories, one of which is called old stuff. The natural thing to try is:

runN -n 2 -c ls foo bar "old stuff"
But the third command turns into ls old stuff and produces:

ls: old: No such file or directory
ls: stuff: No such file or directory
If the split() were omitted, it would just work, but then the ls -l baz example above would fail. If the split() were replaced by the correct logic, I would be able to get what I wanted by writing something like this:

runN -n 2 -c ls foo bar "'old stuff'"
But as it is this just produces another error:

ls: 'old: No such file or directory
ls: stuff': No such file or directory
Perl comes standard with a library called ShellWords that is probably close to what I want here. I didn't use it because I wasn't sure I'd actually need it—only time will tell—and because shell parsing is very complicated and error-prone, more so when it is done synthetically rather than by the shell, and even more so when it is done multiple times; you end up with horrible monstrosities like this:

s='q=echo "$s" | sed -e '"'"'s/'"'"'"'"'"'"'"'"'/'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'/g'"'"'; echo "s='"'"'"$q"'"'"'"; echo $s' q=echo "$s" | sed -e 's/'"'"'/'"'"'"'"'"'"'"'"'/g'; echo "s='"$q"'"; echo$s
So my fear was that by introducing a double set of shell-like interpretation, I'd be opening a horrible can of escape character worms and weird errors, and my hope was that if I ignored the issue the problems might be simpler, and might never arise in practice. We'll see.

[ Addendum 20080712: Aaron Crane wrote a thoughtful followup. Thank you, M. Crane. ]

Wed, 21 Feb 2007

A bug in HTML generation
A few days ago I hacked on the TeX plugin I wrote for Blosxom so that it would put the TeX source code into the ALT attributes of the image elements it generated.

But then I started to see requests in the HTTP error log for URLs like this:

/pictures/blog/tex/total-die-rolls.gif$${6/choose%20k}k!{N!/over%20/prod%20{i!}^{n_i}{n_i}!}/qquad%20/hbox{/rm%20where%20k%20=%20/sum%20n_i}$$.gif
Someone must be referring people to these incorrect URLs, and it is presumably me. The HTML version of the blog looked okay, so I checked the RSS and Atom files, and found that, indeed, they were malformed. Instead of <img src="foo.gif" alt="$TeX$">, they contained codes for <img src="foo.gif$TeX$">.

I tracked down and fixed the problem. Usually when I get a bug like this, I ask myself what I could learn from it. This one is unusual. I can't think of much. Here's the bug.

The <img> element is generated by a function called imglink. The arguments to imglink are the filename that contains the image (for use in the SRC attribute) and the text for the ALT attribute. The ALT text is optional. If it is omitted, the function tries to locate the TeX source code and fetch it. If this attempt fails, it continues anyway, and omits the ALT attribute. Then it generates and returns the HTML:

my $file = shift; ... my$alt = shift || fetch_tex($file); ...$alt = qq{alt="$alt"} if$alt;

qq{<img $alt border=0 src="$url">};
}
This function is called from several places in the plugin. Sometimes the TeX source code is available at the place from which the call comes, and the code has return imglink($file,$tex); sometimes it isn't and the code has return imglink($file) and hopes that the imglink function can retrieve the TeX. One such place is the branch that handles generation of tags for every type of output except HTML. When generating the HTML output, the plugin actually tries to run TeX and generate the resulting image file. For other types of output, it assumes that the image file is already prepared, and just calls imglink to refer to an image that it presumes already exists: return imglink($file, $tex) unless$blosxom::flavour eq "html";

return imglink($file.$tex) unless $blosxom::flavour eq "html"; The . here is a string concatenation operator. It's a bit surprising that I don't make more errors like this than I do. I am a very inaccurate typist. Stronger type checking would not have saved me here. Both arguments are strings, concatenation of strings is perfectly well-defined, and the imglink function was designed and implemented to accept either one or two arguments. The function did note the omission of the$tex argument, attempted to locate the TeX source code for the bizarrely-named file, and failed, but I had opted to have it recover and continue silently. I still think that was the right design. But I need to think about that some more.

The only lesson I have been able to extract from this so far is that I need a way of previewing the RSS and Atom outputs before publishing them. I do preview the HTML output, but in this case it was perfectly correct.

Wed, 14 Feb 2007

Subtlety or sawed-off shotgun?

 1 1 1 2 1 1 1 2 1 3 1 1 1 1 1 2 3 3 2 4 1 1 1 1 1 1 1 2 6 2 2 3 3 1 8 4 6 5 1 1 1 1 1 1 2 1 1 1 10 2 2 1 15 3 1 1 20 3 2 20 4 1 30 5 24 6 1 1 1 1 1 1 1 2 1 1 1 1 15 2 2 1 1 45 2 2 2 15 3 1 1 1 40 3 2 1 120 3 3 40 4 1 1 90 4 2 90 5 1 144 6 120

There's a line in one of William Gibson's short stories about how some situations call for a subtle and high-tech approach, and others call for a sawed-off shotgun. I think my success as a programmer, insofar as I have any, comes from knowing when to deploy each kind of approach.

In a recent article I needed to produce the table that appears at left.

This was generated by a small computer program. I learned a long time ago that although it it tempting to hack up something like this by hand, you should usually write a computer program to do it instead. It takes a little extra time up front, and that time is almost always amply paid back when you inevitably decide that that table should have three columns instead of two, or the lines should alternate light and dark gray, or that you forgot to align the right-hand column on the decimal points, or whatever, and then all you have to do is change two lines of code and rerun the program, instead of hand-editing all 34 lines of the output and screwing up two of them and hand-editing them again. And again. And again.

When I was making up the seating chart for my wedding, I used this approach. I wrote a raw data file, and then a Perl program to read the data file and generate LaTeX output. The whole thing was driven by make. I felt like a bit of an ass as I wrote the program, wondering if I wasn't indulging in an excessive use of technology, and whether I was really going to run the program more than once or twice. How often does the seating chart need to change, anyway?

Gentle readers, that seating chart changed approximately one million and six times.

 Order Higher-Order Perl from Powell's
The Nth main division of the table at left contains one line for every partition of the integer N. The right-hand entry in each line (say 144) is calculated by a function permcount, which takes the left-hand entry (say [5, 1]) as input. The permcount function in turn calls upon fact to calculate factorials and choose to calculate binomial coefficients.

But how is the left-hand column generated? In my book, I spent quite a lot of time discussing generation of partitions of an integer, as an example of iterator techniques. Some of these techniques are very clever and highly scalable. Which of these clever partition-generating techniques did I use to generate the left-hand column of the table?

Why, none of them, of course! The left-hand column is hard-wired into the program:

while (<DATA>) {
chomp;
my @p = split //;
...
}

...
__DATA__
1
11
2
111
12
3
...
51
6
I guessed that it would take a lot longer to write code to generate partitions, or even to find it already written and use it, than it would just to generate the partitions out of my head and type them in. This guess was correct. The only thing wrong with my approach is that it doesn't scale. But it doesn't need to scale.

The sawed-off shotgun wins!

Tue, 03 Oct 2006

Ralph Johnson on design patterns
Last month I wrote an article about design patterns which attracted a lot of favorable attention in blog world. I started by paraphrasing Peter Norvig's observation that:

"Patterns" that are used recurringly in one language may be invisible or trivial in a different language.

and ended by concluding:

Patterns are signs of weakness in programming languages.

When we identify and document one, that should not be the end of the story. Rather, we should have the long-term goal of trying to understand how to improve the language so that the pattern becomes invisible or unnecessary.

Ralph Johnson, one of the four authors of the famous book Design Patterns, took note of my article and responded. I found Johnson's response really interesting, and curious in a number of ways. I think everyone who was interested in my article should read his too.

[ Addendum 20070127: The link above to Ralph Johnson's response is correct, but your client will be rejected if you are referred from here. To see his blog page, visit the page without clicking on the link. ]

Johnson raises several points. First there is a meta-issue to deal with. Johnson says:

He clearly thinks that what he says is surprising. And other people think it is surprising, too. That is surprising to me.
I did think that what I had to say was interesting and worth saying, of course, or I would not have said it. And I was not surprised to find that other people agreed with me.

One thing that I did find surprising is the uniformity of other people's surprise and interest. There were dozens of blog posts and comments in the following two weeks, all pretty much saying what a great article I had written and how right I was. I tracked the responses as carefully as I could, and I did not see any articles that called me a dumbass; I did not see any except for Johnson's that suggested that what I was saying was unsurprising.

We can't conclude from this that I am right, of course; people agree with all sorts of stupid crap. But we can conclude that that what I said was surprising and interesting, since people were surprised and interested by it, even people who already have some knowledge of this topic. Johnson is right to be surprised by this, because he thought this was obvious and well-known, and that it was clearly laid out in his book, and he was mistaken. Many or most of the readers of his book have completely missed this point. I didn't miss it, but I didn't get it from the book, either.

Johnson and his three co-authors wrote this book, Design Patterns, which has had a huge influence on the way that programming is practiced. I think a lot of that influence has been malign. Any practice can be corrupted, of course, by being reduced to its formal aspects and applied in a rote fashion. (There's a really superb discussion of this in A. Ya. Khinchin's essay On the Teaching of Mathematics, and a shorter discussion in Polya's How to Solve It, in the section on "Pedantry and Mastery".) That will happen to any successful movement, and the Gang of Four can't take all the blame for that.

But if they really intended that everyone should understand that each design pattern is a demonstration of a weakness in its target language, then they blew it, because it appears that hardly anyone understood that.

Let's pause for a moment to imagine an alternate universe in which the subtitle of the Design Patterns book was not "Elements of Reusable Object-Oriented Software" but "Solutions for Recurring Problems in Object-Oriented Languages". And let's imagine that in each section, after "Pattern name", "Intent", "Motivation", "Applicability", and so forth, there was another subsection titled "Prophylaxis" that went something like this: "The need for the Iterator pattern in C++ appears to be due partly to its inflexible type system and partly to its lack of abstract iteration structures. The iterator pattern is unnecessary in the Python language, which avoids these defects as follows: ... at the expense of ... . In Common Lisp, on the other hand, ... (etc.)".

I would have liked to have seen that universe, but I suppose it's too late now. Oh well.

Anyway, moving on from meta-issues to the issues themselves, Johnson continues:

At the very end, he says that patterns are signs of weakness in programming languages. This is wrong.
This is interesting, and I was going to address it later, but I now think that it's the first evidence of a conceptual mistake that Johnson has made that underlies his entire response to my article, so I'll take it up now.

At the very end of his response, Johnson says:

No matter how complicated your language will be, there will always be things that are not in the language. These things will have to be patterns. So, we can eliminate one set of patterns by moving them into the language, but then we'll just have to focus on other patterns. We don't know what patterns will be important 50 years from now, but it is a safe bet that programmers will still be using patterns of some sort.
Here we are in complete agreement. So, to echo Johnson, I was surprised that he would think this was surprising. But how can we be in complete agrement if what I said was "wrong"? There must be a misunderstanding somewhere.

I think I know where it is. When I said "[Design] Patterns are signs of weakness in programming languages," what I meant was something like "Each design pattern is a sign of a weakness in the programming language to which it applies." But it seems that Johnson thinks that I meant that the very existence of design patterns, at all, is a sign of weakness in all programming languages everywhere.

If I thought that the existence of design patterns, at all, was a sign that current programming languages are defective, as a group, I would see an endpoint to programming language development: someday, we would have a perfect überlanguage in which it would be unnecessary to use patterns because all possible patterns would have been built in already.

I think Johnson thinks this was my point. In the passage quoted above, I think he is addressing the idea of the überlanguage that incorporates all patterns everywhere at all levels of abstraction. And similarly:

Some people like languages with a lot of features. . . . I prefer simple languages.

And again:

No matter how complicated your language will be, there will always be things that are not in the language.

But no, I don't imagine that someday we will have the ultimate language, into which every conceivable pattern has been absorbed. So a lot of what Johnson has to say is only knocking down a straw man.

What I imagine is that when pattern P applies to language L, then, to the extent that some programmer on some project finds themselves needing to use P in their project, the use of P indicates a deficiency in language L for that project.

The absence of a convenient and simple way to do P in language L is not always a problem. You might do a project in language L that does not require the use of pattern P. Then the problem does not manifest, and, whatever L's deficiencies might be for other projects, it is not deficient in that way for your project.

This should not be difficult for anyone to understand. Perl might be a very nice language for writing a program to compile a bioinformatic data file into a more reasonable form; it might be a terrible language for writing a real-time missile guidance system. Its deficiencies operate in the missile guidance project in a way that they may not in the data munging project.

But to the extent that some deficiency does come up in your project, it is a problem, because you are implementing the same design over and over, the same arrangement of objects and classes, to accomplish the same purpose. If the language provided more support for solving this recurring design problem, you wouldn't need to use a "pattern". Consider again the example of the "subroutine" pattern in assembly language: don't you have anything better to do than redesign and re-implement the process of saving the register values in a stack frame, over and over? Well, yes, you do. And that is why you use a language that has that built in. Consider again the example of the "object-oriented class" pattern in C: don't you have anything better to do than redesign and re-implement object-oriented method dispatch with inheritance, over and over? Yes, you do. And that is why you use a language that has that built in, if that is what you need.

By Gamma, Helm, Johnson, and Vlissides' own definition, the problems solved by patterns are recurring problems, and programmers must address them recurringly.

If these problems recurred in every language, we might conclude that they were endemic to programming itself. We might not, but it's hard to say, since if there are any such problems, they have not yet been brought to my attention. Every pattern discovered so far seems to be specific to only a small subset of the world's languages.

So it seems a small step to conclude that these recurring, language-specific problems are actually problems with the languages themselves. No problem is a problem in every language, but rather each problem is a red arrow, pointing at a design flaw in the language in which it appears.

Johnson continues:

Patterns might be a sign of weakness, but they might be a sign of simplicity. . . .
I think this argument fails, in light of the examples I brought up in my original article. The argument is loaded by the use of the word "simplicity". As Einstein said, things should be as simple as possible, but no simpler. In assembly language, "subroutine call" is a pattern. Does Johnson or anyone seriously think that C++ or Smalltalk or Common Lisp or Java would be improved by having the "subroutine call" pattern omitted? The languages might be "simpler", but would they be better?

The alternative, remember, is to require the programmer to use a "pattern": to make them consult a manual of "patterns" to implement a "general arrangement of objects and classes" to solve the subroutine-call problem every time it comes up.

I guess you could interpret that as a sign of "simplicity", but it's the wrong kind of simplicity. Language designers have a hard problem to solve. If they don't put enough stuff into the language, it'll be too hard to use. But if they put in too much stuff, it'll be confusing and hard to program, like C++. One reason it's hard to be a language designer is that it's hard to know what to put in and what to leave out. There is an extremely complex tradeoff between simplicity and functionality.

But in the case of "patterns", it's much easier to understand the tradeoff. A pattern, remember, is a general method for solving "a recurring design problem". Patterns might be a sign of "simplicity", but if so, they are a sign of simplicity in the wrong place, a place where the language needs to be less simple and more featureful. Because patterns are solutions to recurring design problems.

If you're a language designer, and a "pattern" comes to your attention, then you have a great opportunity. The programmers using your language have a recurring problem. They have to implement the same solution to it, over and over. Clearly, this is a good place to try to expend some design effort; perhaps you can trade off a little simplicity for some functionality and fix the language so that the problem is a problem no longer.

Getting rid of one recurring design problem might create new ones. But if the new problems are operating at a higher level of abstraction, you may have a win. Getting rid of the need for the "subroutine call" pattern in assembly language opened up all sorts of new problems: when and how do I do recursion? When and how do I do coroutines?

Getting rid of the "object-oriented class" pattern in C created a need for higher-level patterns, including the ones described in the Design Patterns book. When people didn't have to worry about implementing inheritance themselves, a lot of their attention was freed up, and they could notice patterns like Façade.

As Alfred North Whitehead says, civilization advances by extending the number of important operations which we can perform without thinking about them. The Design Patterns approach seems to be to identify the important operations and then to think about them over and over and over and over and over.

Or so it seems to me. Johnson's next paragraph makes me wonder if I've completely missed his point, because it seems completely senseless to me:

There is a trade-off between putting something in your programming language and making it be a convention, or perhaps putting it in the library. Smalltalk makes "constructor" be a convention. Arithmetic is in the library, not in the language. Control structures and exception handling are from the library, not in the language.
Huh? Why does "library" matter? Unless I have missed something essential, whether something is in the "language" or the "library" is entirely an implementation matter, to be left to the discretion of the compiler writer. Is printf part of the C language, or its library? The library, everyone knows that. Oh, well, except that its behavior is completely standardized by the language standard, and it is completely permissible for the compiler writer to implement printf by putting a special case into the compiler that is enabled when the compiler happens to see the directive #include <stdio.h>. There is absolutely no requirement that printf be loaded from a separate file or anything like that.

Or consider Perl's dbmopen function. Prior to version 5.000, it was part of the "language", in some sense; in 5.000 and later, it became part of the "library". But what's the difference, really? I can't find any.

Is Johnson talking about some syntactic or semantic difference here? Maybe if I knew more about Smalltalk, I would understand his point. As it is, it seems completely daft, which I interpret to mean that there's something that went completely over my head.

Well, the whole article leaves me wondering if maybe I missed his point, because Johnson is presumably a smart guy, but his argument about the built-in features vs. libraries makes no sense to me, his argument about simplicity seems so clearly and obviously dismantled by his own definition of patterns, and his apparent attack on a straw man seems so obviously erroneous.

But I can take some consolation in the thought that if I did miss his point, I'm not the only one, because the one thing I can be sure of in all of this is that a lot of other people have been missing his point for years.

Johnson says at the beginning that he "wasn't sure whether to be happy or unhappy". If I had written a book as successful and widely read as Design Patterns and then I found out that everyone had completely misunderstood it, I think I would be unhappy. But perhaps that's just my own grumpy personality.

[ Addendum 20080303: Miles Gould wrote a pleasant and insightful article on Johnson's point about libraries vs. language features. As I surmised, there was indeed a valuable point that went over my head. I said I couldn't find any difference between "language" and "library", but, as M. Gould explains, there is an important difference that I did not appreciate in this context. ]

Really real examples of HOP techniques in action
I recently stopped working for the University of Pennsylvannia's Informations Systems and Computing group, which is the organization that provides computer services to everyone on campus that doesn't provide it for themselves.

I used HOP stuff less than I might have if I hadn't written the HOP book myself. There's always a tradeoff with the use of any advanced techniques: it might provide some technical benefit, like making the source code smaller, but the drawback is that the other people you work with might not be able to maintain it. Since I'm the author of the book, I can be expected to be biased in favor of the techniques. So I tried to compensate the other way, and to use them only when I was absolutely sure it was the best thing to do.

There were two interesting uses of HOP techniques. One was in the username generator for new accounts. The other was in a generic server module I wrote.

Name generation

The name generator is used to offer account names to incoming students and faculty. It is given the user's full name, and optionally some additional information of the same sort. It then generates a bunch of usernames to offer the user. For example, if the user's name is "George Franklin Bauer, Jr.", it might generate usernames like:

george    bauer     georgef   fgeorge   fbauer    bauerf
gf        georgeb   fg        fb        bauerg    bf
georgefb  georgebf  fgeorgeb  fbauerg   bauergf   bauerfg
ge        ba        gef       gbauer    fge       fba
bgeorge   baf       gfbauer   gbauerf   fgbauer   fbgeorge
bgeorgef  bfgeorge  geo       bau       geof      georgeba
fgeo      fbau      bauerge   bauf      fbauerge  bauergef
bauerfge  geor      baue      georf     gb        fgeor
fbaue     bg        bauef     gfb       gbf       fgb
fbg       bgf       bfg       georg     georgf    gebauer
fgeorg    bageorge  gefbauer  gebauerf  fgebauer
The code that did this, before I got to it, was extremely long and convoluted. It was also extremely slow. It would generate a zillion names (slowly) and then truncate the list to the required length.

It was convoluted because people kept asking that the generation algorithm be tweaked in various ways. Each tweak was accompanied by someone hacking on the code to get it to do things a little differently.

I threw it all away and replaced it with a lazy generator based on the lazy stream stuff of Chapter 6. The underlying stream library was basically the same as the one in Chapter 6. Atop this, I built some functions that generated streams of names. For example, one requirement was that if the name generator ran out of names like the examples above, it should proceed by generating names that ended with digits. So:

sub suffix {
my ($s,$suffix) = @_;
smap { "$_$suffix" } $s; } # Given (a, b, c), produce a1, b1, c1, a2, b2, c2, a3... sub enumerate { my$s = shift;
lazyappend(smap { suffix($s,$_) } iota());
}

# Given (a, b, c), produce a, b, c, a1, b1, c1, a2, b2, c2, a3...
sub and_enumerate {
my $s = shift; append($s, enumerate($s)); } # Throw away names that are already used sub available_filter { my ($s, $pn) = @_;$pn ||= PennNames::Generate::InUse->new;
sgrep { $pn->available($_) } $s; } The use of the stream approach was strongly indicated here for two reasons. First, the number of names to generate wasn't known in advance. It was convenient for the generation module to pass back a data structure that encapsulated an unlimited number of names, and let the caller mine it for as many names as were necessary. Second, the frequent changes and tinkerings to the name generation algorithm in the past suggested that an extremely modular approach would be a benefit. In fact, the requirements for the generation algorithm chanced several times as I was writing the code, and the stream approach made it really easy to tinker with the order in which names were generated, by plugging together the prefabricated stream modules. Generic server For a different project, I wrote a generic forking server module. The module would manage a listening socket. When a new connection was made to the socket, the module would fork. The parent would go back to listening; the child would execute a callback function, and exit when the callback returned. The callback was responsible for communicating with the client. It was passed the client socket: sub child_callback { my$socket = shift;
# ... read and write the socket ...
return;   # child process exits
}
But typically, you don't want to have to manage the socket manually. For example, the protocol might be conversational: read a request from the client, reply to it, and so forth:

# typical client callback:
sub child_callback {
my $socket = shift; while (my$request = <$socket>) { # generate response to request print$socket $response; } } The code to handle the loop and the reading and writing was nontrivial, but was going to be the same for most client functions. So I provided a callback generator. The input to the callback generator is a function that takes requests and returns appropriate responses: sub child_behavior { my$request = shift;
if ($request =~ /^LOOKUP (\w+)/) { my$input = $1; if (my$result = lookup($input)) { return "OK$input $result"; } else { return "NOK$input";
}
} elsif ($request =~ /^QUIT/) { return; } elsif ($request =~ /^LIST/) {
my $N = my @N = all_names(); return join "\n", "OK$N", @N, ".";
} else {
return "HUH?";
}
}
This child_behavior function is not suitable as a callback, because the argument to the callback is the socket handle. But the child_behavior function can be turned into a callback:

$server->run(CALLBACK => make_callback(\&child_behavior)); make_callback() takes a function like child_behavior() and wraps it up in an I/O loop to turn it into a callback function. make_callback() looks something like this: sub make_callback { my$behavior = shift;
return sub {
my $socket = shift; while (my$request = <$socket>) { chomp$request;
my $response =$behavior->($request); return unless defined$response;
print $socket$response;
}
};
}
I think this was the right design; it kept the design modular and flexible, but also simple.

Wed, 20 Sep 2006

The world's worst macro preprocessor
Last week I added another plugin to my Blosxom installation. As I wrote before, the sole benefit of Blosxom is that it's incredibly simple and lightweight. So when I write plugins for it, I try to keep them incredibly simple and lightweight, lest I spoil the single major benefit of Blosxom. Sometimes I'm more successful, sometimes less so. This time I think I did a good job.

The goal last time was a macro processor. I write a lot of math articles. I get tired of writing <sup>2</sup> every time I want a superscript 2. Even if I bind a function key to that sequence of characters, it's hard to read. But now, with my new Blosxom macro processor, I just insert a line into my article that says:

#define ^2 <sup>2</sup>

and for the rest of the article, ^2 is expanded to <sup>2</sup>.

This has turned out really well, and I'm using it for all sorts of stuff. I use it for math notations, such as for making -> an abbreviation for &rarr; (→), and for making ~ an abbreviation for &not; (¬).

But I've also used it to #define Godel G&ouml;del. I've used it to #define KK <b>K</b> and #define SS <b>S</b>, which makes an article I'm writing about combinatory logic readable, where it wasn't readable before. In my recent article about job hunting, I used it to #define CV r&eacute;sum&eacute;, which saved me from having to interrupt my train of thought several times in the article.

There are some important points about the design that I think I got right on the first try. Whenever you write a macro system, you have to ask about escape sequences: what do you do if you don't want a macro expanded? For example, in the combinatory logic article I defined a macro SS. This meant that if I had written MOUSSE in the article somewhere, it would have turned into MOUSE. How should I prevent that kind of error?

Answer: I don't. I'm unlikely to do that. But if I do, I'll pick it up during the article proofreading phase. If I can't avoid writing MOUSSE, I have two choices: I can change the name of the SS macro to something easier to avoid—like S*, say, or I can define a second macro: #define !MOUSSE MOUSSE. But so far, it hasn't come up.

One alternative solution is to say that macros are expanded only in certain contexts. For example, SS might only be expanded when it is a complete word, not when it is in the middle of a word, as MOUSSE. I resisted this solution. It is much simpler to remember that every macro is expanded everywhere. And it it is much easier to fix the problem of a macro being expanded when I don't want it than it is to fix the problem of a macro not being expanded when I do want it. So every macro is expanded no matter where it appears.

Related to the unintentional-expansion issue is that each article has its own private macro set. I don't have to worry that by defining a macro named -> in one article that I might be sabotaging my opportunity to actually write -> in some unknown future article. Each set of macros can be totally ad hoc. I don't have to worry about global tradeoffs. Do I #define --- &mdash;, knowing that that will foreclose my opportunity to use --- in any other way? I can make the decision based on simple, local information.

It would have been tempting to over-engineer the system and add all sorts of complex escape facilities. I think I made the right choice here by not doing any of that.

Another escaping issue: What if I want to write something that looks like a definition but isn't? Here I avoided the problem by choosing a definition syntax that I was unlikely to write in any other context: #define in the leftmost column indicates a definition. In this article, I had to write some similar text. It was no trouble to indent it a couple of spaces, disabling the special meaning. But HTML is already full of escape mechanisms, and it would have been no trouble to write &#35;define instead of #define if for some reason I had really needed it to appear in the leftmost column. (Unlikely anyway, since HTML has no column semantics.)

Another right choice I think I made was not to parametrize the macros. An article on algebra might well have:

#define ^2 <sup>2</sup>
#define ^3 <sup>3</sup>
and it might be oh-so-tempting to try to eliminate the duplication à la C:

#define ^(\w+) <sup>$1</sup> I did not do this. It would have complicated the processing substantially. It would also have complicated the use of the package substantially: I would have to worry a lot more than I do about invoking macros unintentionally. And it is not needed. Not so far, anyway. Because macro definitions only last for the duration of the article, there is no pressure to make a complete or consistent set of definitions. If an article happens to use the notations 2, i, and N, I can define macros for those and only those notations. Also tempting is to extend the macro system to support something like this: #define BF(.*) <b>$1</b>
I have so far resisted this. My feeling is that if I want to do anything like this, I should take it as a sign that I should be writing the articles in some markup system other than HTML. Choice of that markup system should be made carefully, and not organically as an ad-hoc overburdening of the macro system.

I did run into one trouble with the macro system. Originally, it was invoked before some of my other plugins and after others. The earlier plugins automatically inserted certain text into the article that sometimes accidentally triggered my macros. I have not had any trouble with this since I changed the plugin order to invoke the macro processor before any of the other plugins.

The macro-processing code is about 19 lines long, of which three are diagnostic. It is the world's worst macro system. It has exactly one feature. It is, I think the simplest thing that could possibly work, and so a good companion to Blosxom. For this application, the world's worst macro system is the world's best.

[ Addendum 20071004: There's now a one-year retrospective analysis. ]

Mon, 11 Sep 2006

Design patterns of 1972
"Patterns" that are used recurringly in one language may be invisible or trivial in a different language.

Extended Example: "object-oriented class"

C programmers have a pattern that might be called "Object-oriented class". In this pattern, an object is an instance of a C struct.

struct st_employee_object *emp;
Or, given a suitable typedef:
EMPLOYEE emp;
Some of the struct members are function pointers. If "emp" is an object, then one calls a method on the object by looking up the appropriate function pointer and calling the pointed-to function:

emp->method(emp, args...);
Each struct definition defines a class; objects in the same class have the same member data and support the same methods. If the structure definition is defined by a header file, the layout of the structure can change; methods and fields can be added, and none of the code that uses the objects needs to know.

There are a bunch of variations on this. For example, you can get opaque implementation by defining two header files for each class. One defines the implementation:

struct st_employee_object {
unsigned salary;
struct st_manager_object *boss;
METHOD fire, transfer, competence;
};
The other defines only the interface:

struct st_employee_object {
char __SECRET_MEMBER_DATA_DO_NOT_TOUCH[4];
struct st_manager_object *boss;
METHOD fire, transfer, competence;
};
And then files include one or the other as appropriate. Here "boss" is public data but "salary" is private.

You get abstract classes by defining a constructor function that sets all the methods to NULL or to:

void _abstract() { abort(); }
If you want inheritance, you let one of the structs be a prefix of another:

struct st_manager_object;   /* forward declaration */

#define EMPLOYEE_FIELDS \
unsigned salary; \
struct st_manager_object *boss; \
METHOD fire, transfer, competence;

struct st_employee_object {
EMPLOYEE_FIELDS
};

struct st_manager_object {
EMPLOYEE_FIELDS
unsigned num_subordinates;
struct st_employee_object **subordinate;
};
And if obj is a manager object, you can still treat it like an employee and call employee methods on it.

This may seem weird or contrived, but the technique is widely used. The C standard contains guarantees that the common fields of struct st_manager_object and struct st_employee_object will be laid out identically in memory, specifically so that this object-oriented class technique can work. The code of the X window system has this structure. The code of the Athena widget toolkit has this structure. The code of the Linux kernel filesystem has this structure.

Rob Pike, one of the primary architects of the Plan 9 operating system (the Bell Labs successor to Unix) and co-author (with Brian Kernighan) of The Unix Programming Environment, recommends this technique in his article "Notes on Programming in C".

This is a pattern

There's only one way in which this technique doesn't qualify as a pattern according to the definition of Gamma, Helm, Johnson, and Vlissides. They say:

A design pattern systematically names, motivates, and explains a general design that addresses a recurring design problem in object-oriented systems. It describes the problem, the solution, when to apply the solution, and its consequences. It also gives implementation hints and examples. The solution is a general arrangement of objects and classes that solve the problem. The solution is customized and implemented to solve the problem in a particular context.

Their definition arbitrarily restricts "design patterns" to addressing recurring design problems "in object-oriented systems", and to being general arrangements of "objects and classes". If we ignore this arbitrary restriction, the "object-oriented class" pattern fits the description exactly.

The definition in Wikipedia is:

In software engineering, a design pattern is a general solution to a common problem in software design. A design pattern isn't a finished design that can be transformed directly into code; it is a description or template for how to solve a problem that can be used in many different situations.

And the "object-oriented class" solution certainly qualifies.

Codification of patterns

Peter Norvig's presentation on "Design Patterns in Dynamic Languages" describes three "levels of implementation of a pattern":

Invisible
So much a part of language that you don't notice

Formal
Implement pattern itself within the language
Instantiate/call it for each use
Usually implemented with macros

Informal
Design pattern in prose; refer to by name, but Must be reimplemented from scratch for each use

In C, the "object-oriented class" pattern is informal. It must be reimplemented from scratch for each use. If you want inheritance, you have to set it up manually. If you want abstraction, you have to set it up manually.

The single major driver for the invention of C++ was to codify this pattern into the language so that it was "invisible". In C++, you don't have to think about the structs and you don't have to worry about keeping data and methods private. You just declare a "class" (using syntax that looks almost exactly like a struct declaration) and annotate the items with "public" and "private" as appropriate.

But underneath, it's doing the same thing. The earliest C++ compilers simply translated the C++ code into the equivalent C code and invoked the C compiler on it. There's a reason why the C++ method call syntax is object->method(args...): it's almost exactly the same as the equivalent code when the pattern is implemented in plain C. The only difference is that the object is passed implicitly, rather than explicitly.

In C, you have to make a conscious decision to use OO style and to implement each feature of your OOP system as you go. If a program has fifty modules, you need to decide, fifty times, whether you will make the next module an OO-style module. In C++, you don't have to make a decision about whether or not you want OO programming and you don't have to implement it; it's built into the language.

Sherman, set the wayback machine for 1957

If we dig back into history, we can find all sorts of patterns. For example:

Recurring problem: Two or more parts of a machine language program need to perform the same complex operation. Duplicating the code to perform the operation wherever it is needed creates maintenance problems when one copy is updated and another is not.

Solution: Put the code for the operation at the end of the program. Reserve some extra memory (a "frame") for its exclusive use. When other code (the "caller") wants to perform the operation, it should store the current values of the machine registers, including the program counter, into the frame, and transfer control to the operation. The last thing the operation does is to restore the register values from the values saved in the frame and jump back to the instruction just after the saved PC value.

This is a "pattern"-style description of the pattern we now know as "subroutine". It addresses a recurring design problem. It is a general arrangement of machine instructions that solve the problem. And the solution is customized and implemented to solve the problem in a particular context. Variations abound: "subroutine with passed parameters". "subroutine call with returned value". "Re-entrant subroutine".

For machine language programmers of the 1950s and early 1960's, this was a pattern, reimplemented from scratch for each use. As assemblers improved, the pattern became formal, implemented by assembly-language macros. Shortly thereafter, the pattern was absorbed into Fortran and Lisp and their successors, and is now invisible. You don't have to think about the implementation any more; you just call the functions.

Iterators and model-view-controller

The last time I wrote about design patterns, it was to point out that although the movement was inspired by the "pattern language" work of Christopher Alexander, it isn't very much like anything that Alexander suggested, and that in fact what Alexander did suggest is more interesting and would probably be more useful for programmers than what the design patterns movement chose to take.

One of the things I pointed out was essentially what Norvig does: that many patterns aren't really addressing recurring design problems in object-oriented programs; they are actually addressing deficiencies in object-oriented programming languages, and that in better languages, these problems simply don't come up, or are solved so trivially and so easily that the solution doesn't require a "pattern". In assembly language, "subroutine call" may be a pattern; in C, the solution is to write result = function(args...), which is too simple to qualify as a pattern. In a language like Lisp or Haskell or even Perl, with a good list type and powerful primitives for operating on list values, the Iterator pattern is to a great degree obviated or rendered invisible. Henry G. Baker took up this same point in his paper "Iterators: Signs of Weakness in Object-Oriented Languages".

I received many messages about this, and curiously, some made the same point in the same way: they said that although I was right about Iterator, it was a poor example because it was a very simple pattern, but that it was impossible to imagine a more complex pattern like Model-View-Controller being absorbed and made invisible in this way.

This remark is striking for several reasons. It is an example of what is perhaps the most common philosophical fallacy: the writer cannot imagine something, so it must therefore be impossible. Well, perhaps it is impossible—or perhaps the writer just doesn't have enough imagination. It is worth remembering that when Edgar Allan Poe was motivated to investigate and expose Johann Maelzel's fraudulent chess-playing automaton, it was because he "knew" it had to be fraudulent because it was inconceivable that a machine could actually exist that could play chess. Not merely impossible, but inconceivable! Poe was mistaken, and the people who asserted that MVC could not be absorbed into a programming language were mistaken too. Since I gave my talk in 2002, several programming systems, such as Ruby on Rails and Subway have come forward that attempt to codify and integrate MVC in exactly the way that I suggested.

Progress in programming languages

Had the "Design Patterns" movement been popular in 1960, its goal would have been to train programmers to recognize situations in which the "subroutine" pattern was applicable, and to implement it habitually when necessary. While this would have been a great improvement over not using subroutines at all, it would have been vastly inferior to what really happened, which was that the "subroutine" pattern was codified and embedded into subsequent languages.

Identification of patterns is an important driver of progress in programming languages. As in all programming, the idea is to notice when the same solution is appearing repeatedly in different contexts and to understand the commonalities. This is admirable and valuable. The problem with the "Design Patterns" movement is the use to which the patterns are put afterward: programmers are trained to identify and apply the patterns when possible. Instead, the patterns should be used as signposts to the failures of the programming language. As in all programming, the identification of commonalities should be followed by an abstraction step in which the common parts are merged into a single solution.

Multiple implementations of the same idea are almost always a mistake in programming. The correct place to implement a common solution to a recurring design problem is in the programming language, if that is possible.

The stance of the "Design Patterns" movement seems to be that it is somehow inevitable that programmers will need to implement Visitors, Abstract Factories, Decorators, and Façades. But these are no more inevitable than the need to implement Subroutine Calls or Object-Oriented Classes in the source language. These patterns should be seen as defects or missing features in Java and C++. The best response to identification of these patterns is to ask what defects in those languages cause the patterns to be necessary, and how the languages might provide better support for solving these kinds of problems.

With Design Patterns as usually understood, you never stop thinking about the patterns after you find them. Every time you write a Subroutine Call, you must think about the way the registers are saved and the return value is communicated. Every time you build an Object-Oriented Class, you must think about the implementation of inheritance.

People say that it's all right that Design Patterns teaches people to do this, because the world is full of programmers who are forced to use C++ and Java, and they need all the help they can get to work around the defects of those languages. If those people need help, that's fine. The problem is with the philosophical stance of the movement. Helping hapless C++ and Java programmers is admirable, but it shouldn't be the end goal. Instead of seeing the use of design patterns as valuable in itself, it should be widely recognized that each design pattern is an expression of the failure of the source language.

If the Design Patterns movement had been popular in the 1980's, we wouldn't even have C++ or Java; we would still be implementing Object-Oriented Classes in C with structs, and the argument would go that since programmers were forced to use C anyway, we should at least help them as much as possible. But the way to provide as much help as possible was not to train people to habitually implement Object-Oriented Classes when necessary; it was to develop languages like C++ and Java that had this pattern built in, so that programmers could concentrate on using OOP style instead of on implementing it.

Summary

Patterns are signs of weakness in programming languages.

When we identify and document one, that should not be the end of the story. Rather, we should have the long-term goal of trying to understand how to improve the language so that the pattern becomes invisible or unnecessary.

[ Thanks to Garrett Rooney for pointing out some minor errors that I have since corrected. - MJD ]

[ Addendum 20061003: There is a followup article to this one, replying to a response by Ralph Johnson, one of the authors of the "Design Patterns" book. This link URL is correct, but Johnson's website will refuse it if you come from here. ]

Sat, 08 Jul 2006

A while back, I wrote an article in which I mentioned a programmer who had a problem, tried to solve it with weak references, and, as a result, had two problems. I said that weak references work unusually well in that little formula.

Yesterday I was about to make the same mistake. I had a problem, and weak references seemed like the solution. Fortunately, it was time to go home, which is a two-mile walk. Taking a two-mile walk is a great way to fix mistakes, especially the ones you haven't made yet. On this particular walk, I came to my senses and avoided the weak references.

The problem concerns the following classes and methods. You have a database object $db. You can call @rec =$db->lookup, which may return some record objects that represent records. You then call methods on the records, say $rec[3]->get_color, to extract data from them, or$rec[3]->set_color("purple"), to modify the data in the records. The updating is done in-memory only, and a later call to $db->flush writes all the updates back to the database. The database object needs to store the changes that have been made but not yet written out. The easy way to do this is to have it store a change log of the modified record objects. So set_color first makes its change to the target record object, and then calls an internal _update method on the original database object to attach the record to the change log. Later on, flush will process this array, writing out the indicated changes. In order for set_color to know which database to direct the _update call to, each record object must have a pointer back to the database that created it. This is convenient for other purposes too. Fine. But then if the record object is stored in the change log inside the database object, we now have a reference loop: the database contains a change log with a pointer to the record, which contains a pointer back to the database itself. This means that neither the database nor the record will ever be garbage collected. (This problem is common in complex Perl programs, and would simply vanish if Perl had even a slightly less awful garbage collector. Improvement is unlikely to occur before the release of Perl 6, now scheduled for October 28, 2073.) My first reaction when faced with a problem like this one is to gurgle contentedly in my sleep, turn over, and pull the blankets over my head. This strategy is the primary contributor to my success as a programmer; it is somewhat superior to the typical programmer's response, which is to swing into action, overthink the problem, and come up with an elaborate solution. Aron Nimzovitch once said that the problem chess novices have is the irrepressible urge to always be doing something. Programmers are similar. They are all very bright people, very good at solving problems, and they solve problems all the time, even the ones that don't need to be solved. I seem to be digressing. How unusual. In any case, this problem really did have to be solved. One wants the database object to flush out its pending changes at the time it becomes inacessible. If the object is never garbage collected, then the programmer must always remember to flush out the changes manually. Miss one call to flush, and your updates are lost. This is unacceptable. The primary purpose of a database is to record the updates. So I had to take my head out from under the covers, like it or not. I thought about several solutions, and even tried one out, but it was too complicated and got me into a horrible tar pit, so I threw it away and started over. (That is another superior strategy that programmers don't exercise as often as they should. As Erik Naggum says, they will drive a hundred miles through a forest, stopping every five feet to cut down another tree, instead of pausing to wonder if maybe they shouldn't have driven off the road in the first place.) Then I got the bright idea to use weak references, which seemed like just the thing. That's what weak references are for: breaking dependency loops so that things that need to be garbage collected can be. Fortunately, it was time to go, so I walked home instead of diving into the chyme-filled swimming pool of weak references. With the weak references, you need to decide which reference to weaken. There is a reference to the record object, in the change log inside the database object. And there is a reference to the database object, in the record object. Which do you weaken? If you weaken the reference to the record, you get a disaster: { my ($rec) = $db->lookup(...);$rec->set_color("purple");
}
$db->flush; When the block is exited, the last strong reference to the record goes away, and the modified record evaporates, leaving nothing inside the database object. The flush method can see by the lingering ghost that there was something there it was supposed to deal with, but it no longer knows what. So that choice is doomed. What if you weaken the reference inside the record, the one that points back to the database? That is hardly any better: my$rec;
{
my $db = FlatFile->new(...); ($rec) = $db->lookup(...); }$rec->set_color("purple");
We would like the database object to hang around as long as there are still some extant records from it. But because we weakened the references from the records to the database, it doesn't; it evaporates at the end of the block, leaving the record orphaned. The set_color method then fails, because the database to which it is supposed to write changes has evaporated.

Conclusion: I've heard it before, and it wasn't funny the first time.

On the walk home, I realized something else: actually storing the database data inside the record objects is a bad move. The general advice under which this is a bad move is something like Don't store the same data in two places. The specific problems in this instance are exemplified by this:

my ($a) =$db->lookup(unique_id => "142857");
my ($b) =$db->lookup(unique_id => "142857");
$a->set_color("red");$b->set_color("purple");
$a->color eq "purple"; # True or false? Since$a and $b represent the same record, the answer should be true. But in the implementation I had (and still have, actually; I haven't fixed this yet) it is false. The set_color method on$b updates the data that is cached in object $b, but has no idea that it should also update the data cached in$a.

To work properly, $a and$b should be identical objects. One way to do this is to store an object in memory for every record in the database, and hand out these preconstructed objects as needed; then both calls to lookup return the same object. This is time- and memory-intensive. Another way to do this is to cache the record objects as they are constructed, and arrange for lookup to return the cached objects when appropriate. This is more complicated.

A simpler solution is not to store the data in memory at all. Record objects are always created as needed, but contain nothing but a database handle and some sort of locator information that says how to get the record data, should it be asked for. ("Any problem can be solved by another layer of indirection," they say, although it's not really true. Still, there are several classes of problems that can be solved by adding another layer of indirection, and this particular object identity problem could serve well as an exemplar of one of those classes.) Then modifications don't go into the record objects themselves. Instead, they go into the database object as an instruction to modify a certain record in a certain way.

This solution, however, presupposes that there is a good way to build locator information for a flat file and update it as needed. Fortunately, there is. I did a really good job of solving this problem a few years ago when I wrote the Tie::File module. It represents a text file as a Perl array, so a record locator can simply be an index into the array, and a record object then becomes something like:

{
db => $db, recno => 37, } The change log inside the database object looks something like: { 0 => no change, 1 => no change, 2 => "color" field was set to "purple", 3 => no change, 4 => "size" field was set to "unusually large", ... } This happily gets rid of the garbage collection problem I had been trying to solve in the first place. Using Tie::File also eliminates a lot of I/O issues that I had solved before, and gets all the I/O code out of the database module. I had already been thinking about getting rid of the explicit I/O and having the database module depend on Tie::File, and when I recognized the lurking record object identity problem, I was convinced that it had to happen sooner rather than later. Having done it, I'm really pleased with the outcome. Fri, 07 Jul 2006 On design I'm writing this Perl module called FlatFile, which is supposed to provide lightweight simple access to flat-file databases, such as the Unix password file. An interesting design issue came up, and since I think that understanding is usually best served by minuscule examination of specific examples, that's what I'm going to do. The basic usage of the module is as follows: You create a database object that represents the entire database: my$db = FlatFile->new(FILE => "/etc/passwd",
'gecos', 'homedir', 'shell'],
FIELDSEP => ':',
) or die ...;
Then you can do queries on the database:

my @roots = $db->lookup(uid => 0); This returns a list of Record objects. (Actually it returns a list of FlatFile::Record::A objects, where FlatFile::Record::A is a dynamically-generated class that was manufactured at the time you did the new call, and which inherits from FlatFile::Record, but we can ignore that here.) Once we have the Record objects, we can query them or modify them: for my$root (@roots) {
if ($root->username eq 'root') {$root->set_shell('/bin/false');
} else {
$root->delete; } } This loops over the records that were selected in the earlier call and examines the username field in each one. if the username is root, the program sets the shell in the record to /bin/false; otherwise it deletes the record entirely. Since lookup returns all the matching records, there is the question of what this should do: my$root = $db->lookup(uid => 0); Here we have provided enough room for at most one root user. What if there is more than one? Every Perl function needs to make a decision about this issue. The function could be called in list context or in scalar context, and you need to choose the two behaviors sensibly. Here are some possibilities for what lookup might do if called in scalar context: 1. die unconditionally 2. return the number of matching records, analogous to the builtin grep function or the @array syntax 3. return the single matching record, if there is only one, and die if there is more than one. 4. return the first matching record, and discard the others 5. return a reference to an array of all matching records 6. return an iterator object which can be used to access all the matching records There are probably some other reasonable possibilities. How to decide on the best behavior? This is the kind of problem that I really enjoy. What will people expect? What will they want? What do they need? Two important criteria are: 1. Difficulty: Whatever I provide should be something that's not easy to get any other way. 2. Usefulness: Whatever I provide should be something that people will use a lot. The difficulty criterion argues strongly against behavior #5 (return an array), because it's too much like the current list context behavior. No matter what the method does in scalar context, no matter what design decision I make, the programmer will always be able to get behavior #5 very easily: my$ref = [ $db->lookup(...) ]; Or they can subclass the Record module and add a new one-line method that does the same: sub lookup_ref { my$self = shift;
[ $self->lookup(@_) ]; } Similarly, behavior #2 (return a count) is so easy to get that supporting it directly would probably not be a good use of my code or my precious interface space: my$N_recs = () = $db->lookup(...); I had originally planned to do #3 (require that the query produce a single record, on pain of death), and here's why: in my first forays into programming with this module, I frequently found myself writing things like my$rec = $db->lookup(...) without meaning to, and in spite of the fact that I had documented the behavior in scalar context as being undefined. I kept doing it unintentionally in cases where I expected only one record to be returned. So each time I wrote this code, I was putting in an implicit assumption that there would be only one match. I would have been quite surprised in each case if there had actually been multiple matches. That's the sort of assumption that you might like to have automatically checked. I ran the question by the folks on IRC, and reaction against this design was generally negative. Folks said that it's not the module's job to try to discern the programmer's intention and enforce this inference by committing suicide. I can certainly get behind that point of view. I once wrote an article complaining bitterly about modules that call die. I said it was like when you're having tea and crumpets on your 112-piece Spode china set, and you accidentally chip the teacup, and the butler comes running in, crying "Don't worry, Master! I'll take care of that for you!" and then he whips out a hammer and smashes all 112 pieces of china to tiny bits. I don't think the point applies here, though. I had mentioned it in connection with the Text::ParseWords module, which would throw an exception if the input string was unparseable, hardly an uncommon occurrence, and one that was entirely unavoidable: if I knew that the string would be unparseable, I wouldn't be calling Text::ParseWords to parse it. Folks on IRC said that when the method might call die, you have to wrap every call to it in an exception handler, which I certainly agree is a pain in the ass. But in this example, you do not have to do that. Here, to prevent the function from dying is very easy: just call it in list context; then it will never die. If what you want is behavior #4, to have it discard all the records but the first one, that is easy to get, regardless of the design I adopt for scalar context behavior: my ($rec) = $db->lookup(...); This argues against #4 (return the first matching record) in the same way that we argued against #2 and #5 already: it's so very easy to do already, maybe we don't need an even easier way to do it. But if so, couldn't the programmer just: sub lookup_first { my$self = shift;
my ($rec) =$self->lookup(@_);
return $rec; } A counterargument in favor of #4 might be based on the usefulness criterion: perhaps this behavior is so commonly wanted that we really do need an even easier way to do it. I was almost persuaded by the strong opinion in favor of #4, but then Roderick Schertler spoke up in favor of #3, for basically the reasons I set forth. I consider M. Schertler to have higher-than-normal reliability on matters of this type, so his opinion counterbalances several of the counteropinions on the other side. #3 is not too difficult to get, but still scores higher than most of the others on the difficulty scale. There doesn't seem to be a trivial inline expression of it, as there was with #2, #4, and #5. You would have to actually write a method, or else do something nasty like: (my ($rec) = $db->lookup(...)) < 2 or die ...; What about the other proposed behaviors? #1 (unconditional fatality) is simple, but both criteria seem to argue against it. It does, however, have the benefit of being a good temporary solution since it is easy to change without breaking backward compatibility. Were I to adopt it, it would be very unlikely (although not impossible) that anyone would write a program that would depend on that behavior; I would then be able to change it later on. #6 (return an iterator object) is very tempting, because it is the only one that scores high on the difficulty criterion scale: it is difficult or impossible to do this any other way, so by providing it, I am providing a real service to users of the module, rather than yet another way to do the same thing. The module's user cannot implement a good iterator interface as a wrapper around lookup, because lookup always searches the entire database before it returns, and allocates enough memory to store every returned record, whereas a good iterator interface will search only as far as is necessary to find the next matching record, and will store only one record at a time. This performance argument would be more important if we expected the databases to be very large. But since this is a module for manipulating plain text files, we can expect that they will not be too big, and perhaps the time and memory costs of searching them will be relatively small, so perhaps this design will score fairly low on the usefulness scale. I still haven't made up my mind, although writing this article has pushed me strongly toward #6. I would be glad to receive email on the matter. Mon, 15 May 2006 Creeping featurism and the ratchet effect "Creeping featurism" is a well-known phenomenon in the software world. It refers to the tendency of software to acquire more and more features, to the ultimate detriment of its usability. Software with more and more features is harder to learn to use; it's harder to document effectively. Perhaps most important, it is harder to maintain; the more complicated software is, the more likely it is to have bugs. Partly this is because the different features interact with one another in unanticipated ways; partly it is just that there is more stuff to spend the maintenance budget on. But the concept of "creeping featurism" his wider applicability than just to program features. We can recognize it in other contexts. For example, someone is reading the Perl manual. They read the section on the unpack function and they find it confusing. So they propose a documentation patch to add a couple of sentences, explicating the confusing point in more detail. It seems like a good idea at the time. But if you do it over and over—and we have—you end up with a 2,000 page manual—and we did. The real problem is that it's easy to see the benefit of any proposed addition. But it is much harder to see the cost of the proposed addition, that the manual is now 0.002% larger. The benefit has a poster child, an obvious beneficiary. You can imagine a confused person in your head, someone who happens to be confused in exactly the right way, and who is miraculously helped out by the presence of the right two sentences in the exact right place. The cost has no poster child. Or rather, the poster child is much harder to imagine. This is the person who is looking for something unrelated to the two-sentence addition. They are going to spend a certain amount of time looking for it. If the two-sentence addition hadn't been in there, they would have found what they were looking for. But the addition slowed them down just enough that they gave up without finding what they needed. Although you can grant that such a person might exist, they really aren't as compelling as the confused person who is magically assisted by timely advice. Even harder to imagine is the person who's kinda confused, and for whom the extra two sentences, clarifying some obscure point about some feature he wasn't planning to use in the first place, are just more confusion. It's really hard to understand the cost of that. But the benefit, such as it is, comes in one big lump, whereas the cost is distributed in tiny increments over a very large population. The benefit is clear, and the cost is obscure. It's easy to make a specific argument in favor of any particular addition ("people might be confused by X, so I'm going to explain it in more detail") and it's hard to make such an argument against the addition. And conversely: it's easy to make the argument that any particular bit of text should stay in, hard to argue that it should be removed. As a result, there's what I call a "ratchet effect": you can make the manual bigger, one tiny notch at a time, and people do. But having done so, you can't make it smaller again; someone will object to almost any proposed deletion. The manual gets bigger and bigger, worse and worse organized, more and more unusable, until finally it collapses under its own weight and all you can do is start over again. You see the same thing happen in software, of course. I maintain the Text::Template Perl module, and I frequently get messages from people saying that it should have some feature or other. And these people sometimes get quite angry when I tell them I'm not going to put in the feature they want. They're angry because it's easy to see the benefit of adding another feature, but hard to see the cost. "If other people don't like it," goes the argument, "they don't have to use it." True, but even if they don't use it, they still pay the costs of slightly longer download times, slightly longer compile times, a slightly longer and more confusing manual, slightly less frequent maintenance updates, slightly less prompt bug fix deliveries, and so on. It is so hard to make this argument, because the cost to any one person is so very small! But we all know where the software will end up if I don't make this argument every step of the way: on the slag heap. This has been on my mind on and off for years. But I just ran into it in a new context. Lately I've been working on a book about code style and refactoring in Perl. One thing you see a lot in Perl programs written by beginners is superfluous parentheses. For example: next if ($file =~ /^\./);
next if !($file =~ (/[0-9]/)); next if !($file =~ (/txt/));
Or:

die $usage if ($#ARGV < 0);
There are a number of points I want to make about this. First, I'd like to express my sympathy for Perl programmers, because Perl has something like 95 different operators at something like 17 different levels of precedence, and so nobody knows what all the precedences are and whether parentheses are required in all circumstances. Does the ** operator have higher or lower precedence than the <<= operator? I really have no idea.

So the situation is impossible, at least in principle, and yet people have to deal with it somehow. But the advice you often hear is "if you're not sure of the precedence, just put in the parentheses." I think that's really bad advice. I think better advice would be "if you're not sure of the precedence, look it up."

Because Perl's Byzantine operator table is not responsible for all the problems. Notice in the examples above, which are real examples, taken from real code written by other people: Many of the parentheses there are entirely superfluous, and are not disambiguating the precedence of any operators. In particular, notice the inner parentheses in:

next if !($file =~ (/txt/)); Inside the inner parentheses, there are no operators! So they cannot be disambiguating any precedence, and they are completely unnecessary: next if !($file =~ /txt/);
People sometimes say "well, I like to put them in anyway, just to be sure." This is pure superstition, and we should not tolerate it in people who purport to be engineers. Engineers should be capable of making informed choices, based on technical realities, not on some creepy feeling in their guts that perhaps a failure to sprinkle enough parentheses over their program will invite the wrath the Moon God.

By saying "if you're not sure, just avoid the problem" we are encouraging this kind of fearful, superstitious approach to the issue. That approach would be appropriate if it were the only way to deal with the issue, but fortunately it is not. There is a more rational approach: you can look it up, or even try an experiment, and then you will know whether the parentheses are required in a particular case. Then you can make an informed decision about whether to put them in.

But when I teach classes on this topic, people sometimes want to take the argument even further: they want to argue that even if you know the precedence, and even if you know that the parentheses are not required, you should put them in anyway, because the next person to see the code might not know that.

And there we see the creeping featurism argument again. It's easy to see the potential benefit of the superfluous parentheses: some hapless novice maintenance programmer might misunderstand the expression if I don't put them in. It's much harder to see the cost: The code is fractionally harder for everyone to read and understand, novice or not. And again, the cost of the extra parentheses to any particular person is so small, so very small, that it is really hard to make the argument against it convincingly. But I think the argument must be made, or else the code will end up on the slag heap much faster than it would have otherwise.

Programming cannot be run on the convoy system, with the program code written to address the most ignorant, uneducated programmer. I think you have to assume that the next maintenance programmer will be competent, and that if they do not know what the expression means, they will look up the operator precedence in the manual. That assumption may be false, of course; the world is full of incompetent programmers. But no amount of parentheses are really going to help this person anyway. And even if they were, you do not have to give in, you do not have to cater to incompetence. If an incompetent programmer has trouble understanding your code, that is not your fault; it is their fault for being incompetent. You do not have to take special steps to make your code understandable even by incompetents, and you certainly should not do so at the expense of making it harder for competent programmers to read and understand, no, not to the tiniest degree.

The advice that one should always put in the parentheses seems to me to be going in the wrong direction. We should be struggling for higher standards, both for ourselves and for our associates. The conventional advice, it seems to me, is to give up.

Sat, 04 Mar 2006

Structured BASIC
Aristotle Pagaltzis reminisces about programming microcomputers in BASIC in the 1980s:

That's what I started with, on the Acorn Electron. And I remember being excited about finding and understanding DEF FN. I also remember my disappointment about how limited it was. I remember my frustration whenever BASIC forced me into writing messy code.

I remember my frustration with this too. I realized fairly early on that it was important to organize one's code in a modular fashion. My clearest memory of this was in developing an Adventure-style program. Each of the locations in the world was assigned a sequence number. Location #23 was handled by lines 2300--2399 of the program. Lines 2300--2319 would print the description of the location. Line 2320 would set the variables that recorded the player's location, and called the subroutine to print the descriptions of the other objects at that location. Line 2380 would call the subroutine that prompted the user for their next command. Other lines in between would provide the implementation of whatever special effects were required for that location.

All the important utility subroutines were at mnemonic line numbers; the main loop was at line 50000, and the command processing was at 51000. Special handling for objects was in the 40000 range, with one hundred statement numbers reserved for each object.

After each user command was processed, control was dispatched back to the appropriate part of the program, depending on where the player was now. Microsoft BASIC didn't have a computed GOTO, so the dispatch was performed by a jump table. I was unhappy with the jump table, recognizing that it didn't scale well.

Object sizes and descriptions were stored in a table. I don't know why I didn't store the location descriptions in the table in the same way, but I suspect that I tried and found that my microcomputer didn't have enough string memory. I also discovered that the algorithm that mapped statement numbers to code did not scale well to programs with a lot of numbered statements; editing the program grew intolerably slow once the world contained more than about fifty locations.

Still, I was pleased with the outcome. My goal (at the tender age of sixteen, or whatever) had been to adopt conventions that made it easy to extend or modify the world and to add new locations or objects, and I felt at the time that I had achieved that.

M. Pagaltzis says:

I guess I have a natural penchant for structured code. Penchant? Instinct.

I think anyone who is really interested in writing programs in BASIC and who reflects on the results of his projects is going to come to the conclusion that BASIC is a very poor tool for the job. These problems force themselves on everyone, and if you are thoughtful you will see the problems and try to come up with some techniques to solve them.

I really wish I could see those old programs again. I'm sure I would learn a lot from them.

I do have some code I wrote in C as long ago as 1987. I remember that shortly after that I got sick of programming and took a vacation from it for a year.

One day the following year I was reading netnews, and I overheard a colleague complaining about his CS homework. He had to write a program in C to count the number of occurrences of each word in its input, using a binary tree to store the words. I said he was complaining about nothing and that I, a math major, could turn out such a program in two hours. I don't know why I said this, since I hadn't done any C programming in a year, and I didn't have any significant experience with C, but I was inspired, and I did finish it quickly, and it worked. I have been programming regularly ever since. I still have the source code for that program.

Here's the funny thing about the programs from that time: when I look at the pre-vacation programs, they look to me as though they were written by someone else. When I look at the tree-sort program or any other program I have written since then, I recognize it as my own code.

I don't know what happened in my brain during my one-year vacation, but my current programming style first emerged in that tree-sort program, and the code from after the break has all been a lot better than the code I wrote before.

I'd like to take another vacation, but I can't now, because I have to earn a living.

Mon, 30 Jan 2006

Rotten code in a ProFTPD plugin module
One of my work colleagues asked me to look at a piece of C source code today. He was tracking down a bug in the FTP server. He thought he had traced it to this spot, and wanted to know if I concurred and if I agreed with his suggested change.

Here's the (exceptionally putrid) (relevant portion of the) code:

static int gss_netio_write_cb(pr_netio_stream_t *nstrm, char *buf,size_t buflen) {

int     count=0;
int     total_count=0;
char    *p;

OM_uint32   maj_stat, min_stat;
OM_uint32   max_buf_size;

...
/* max_buf_size = maximal input buffer size */
p=buf;
while ( buflen > total_count ) {
/* */
if ( buflen - total_count > max_buf_size ) {
if ((count = gss_write(nstrm,p,max_buf_size)) != max_buf_size )
return -1;
} else {
if ((count = gss_write(nstrm,p,buflen-total_count)) != buflen-total_count )
return -1;
}
total_count = buflen - total_count > max_buf_size ? total_count + max_buf_size : buflen;
p=p+total_count;
}

return buflen;
}
(You know there's something wrong when the comment says "maximal input buffer size", but the buffer is for performing output. I have not looked at any of the other code in this module, which is 2,800 lines long, so I do not know if this chunk is typical.) Mr. Colleague suggested that p=p+total_count was wrong, and should be replaced with p=p+max_buf_size. I agreed that it was wrong, and that his change would fix the problem, although I suggested that p += count would be a better change. Mr. Colleague's change, although it would no longer manifest the bug, was still "wrong" in the sense that it would leave p pointing to a garbage location (and incidentally invokes behavior not defined by the C language standard) whereas my change would leave p pointing to the end of the buffer, as one would expect.

Since this is a maintenance programming task, I recommended that we not touch anything not directly related to fixing the bug at hand. But I couldn't stop myself from pointing out that the code here is remarkably badly written. Did I say "exceptionally putrid" yet? Oh, I did.

Good. It stinks like a week-old fish.

The first thing to notice is that the expression buflen - total_count appears four times in only nine lines of code—five if you count the buflen > total_count comparison. This strongly suggests that the algorithm would be more clearly expressed in terms of whatever buflen - total_count really is. Since buflen is the total number of characters to be written, and total_count is the number of characters that have been written, buflen - total_count is just the number of characters remaining. Rather than computing the same expression four times, we should rewrite the loop in terms of the number of characters remaining.

size_t left_to_write = buflen;
while ( left_to_write > 0 ) {
/* */
if ( left_to_write > max_buf_size ) {
if ((count = gss_write(nstrm,p,max_buf_size)) != max_buf_size )
return -1;
} else {
if ((count = gss_write(nstrm,p,left_to_write)) != left_to_write )
return -1;
}
total_count = left_to_write > max_buf_size ? total_count + max_buf_size : buflen;
p=p+total_count;
left_to_write -= count;
}
Now we should notice that the two calls to gss_write are almost exactly the same. Duplicated code like this can almost always be eliminated, and eliminating it almost always produces a favorable result. In this case, it's just a matter of introducing an auxiliary variable to record the amount that should be written:

size_t left_to_write = buflen, write_size;
while ( left_to_write > 0 ) {
write_size = left_to_write > max_buf_size ? max_buf_size : left_to_write;
if ((count = gss_write(nstrm,p,write_size)) != write_size )
return -1;
total_count = left_to_write > max_buf_size ? total_count + max_buf_size : buflen;
p=p+total_count;
left_to_write -= count;
}
At this point we can see that write_size is going to be max_buf_size for every write except possibly the last one, so we can simplify the logic the maintains it:

size_t left_to_write = buflen, write_size = max_buf_size;
while ( left_to_write > 0 ) {
if (left_to_write < max_buf_size)
write_size = left_to_write;
if ((count = gss_write(nstrm,p,write_size)) != write_size )
return -1;
total_count = left_to_write > max_buf_size ? total_count + max_buf_size : buflen;
p=p+total_count;
left_to_write -= count;
}
Even if we weren't here to fix a bug, we might notice something fishy: left_to_write is being decremented by count, but p, the buffer position, is being incremented by total_count instead. In fact, this is exactly the bug that was discovered by Mr. Colleague. Let's fix it:

size_t left_to_write = buflen, write_size = max_buf_size;
while ( left_to_write > 0 ) {
if (left_to_write < max_buf_size)
write_size = left_to_write;
if ((count = gss_write(nstrm,p,write_size)) != write_size )
return -1;
total_count = left_to_write > max_buf_size ? total_count + max_buf_size : buflen;
p += count;
left_to_write -= count;
}
We could fix up the line the maintains the total_count variable so that it would be correct, but since total_count isn't used anywhere else, let's just delete it.

size_t left_to_write = buflen, write_size = max_buf_size;
while ( left_to_write > 0 ) {
if (left_to_write < max_buf_size)
write_size = left_to_write;
if ((count = gss_write(nstrm,p,write_size)) != write_size )
return -1;
p += count;
left_to_write -= count;
}
Finally, if we change the != write_size test to < 0, the function will correctly handle partial writes, should gss_write be modified in the future to perform them:
size_t left_to_write = buflen, write_size = max_buf_size;
while ( left_to_write > 0 ) {
if (left_to_write < max_buf_size)
write_size = left_to_write;
if ((count = gss_write(nstrm,p,write_size)) < 0 )
return -1;
p += count;
left_to_write -= count;
}
We could trim one more line of code and one more state change by eliminating the modification of p:

size_t left_to_write = buflen, write_size = max_buf_size;
while ( left_to_write > 0 ) {
if (left_to_write < max_buf_size)
write_size = left_to_write;
if ((count = gss_write(nstrm,p+buflen-left_to_write,write_size)) < 0 )
return -1;
left_to_write -= count;
}
I'm not sure I think that is an improvement. (My idea is that if we do this, it would be better to create a p_end variable up front, set to p+buflen, and then use p_end - left_to_write in place of p+buflen-left_to_write. But that adds back another variable, although it's a constant one, and the backward logic in the calculation might be more confusing than the thing we were replacing. Like I said, I'm not sure. What do you think?)

Anyway, I am sure that the final code is a big improvement on the original in every way. It has fewer bugs, both active and latent. It has the same number of variables. It has six lines of logic instead of eight, and they are simpler lines. I suspect that it will be a bit more efficient, since it's doing the same thing in the same way but without the redundant computations, although you never know what the compiler will be able to optimize away.

Right now I'm engaged in writing a book about this sort of cleanup and renovation for Perl programs. I've long suspected that the same sort of processes could be applied to C programs, but this is the first time I've actually done it.

 Order Advanced Unix Programming from Powell's
The funny thing about this code is that it's performing a task that I thought every C programmer would already have known how to do: block-writing of a bufferfull of data. Examples of the right way to do this are all over the place. I first saw it done in Marc J. Rochkind's superb book Advanced Unix Programming around 1989. (I learned from the first edition, but the link to the right is for the much-expanded second edition that came out in 2004.) I'm sure it must pop up all over the Stevens books.

But the really exciting thing I've learned about code like this is that it doesn't matter if you don't already know how to do it right, because you can turn the wrong code into the right code, as we did here, by noticing a few common problems, like duplicate tests and repeated subexpressions, and applying a few simple refactorizations to get rid of them. That's what my book will be about.

(I am also very pleased that it has taken me 37 blog entries to work around to discussing any programming-related matters.)