The Universe of Discourse


Thu, 21 Jul 2016

A hack for getting the email address Git will use for a commit

Today I invented a pretty good hack.

Suppose I have branch topic checked out. It often happens that I want to

    git push origin topic:mjd/topic

which pushes the topic branch to the origin repository, but on origin it is named mjd/topic instead of topic. This is a good practice when many people share the same repository. I wanted to write a program that would do this automatically.

So the question arose, how should the program figure out the mjd part? Almost any answer would be good here: use some selection of environment variables, the current username, a hard-wired default, and the local part of Git's user.email configuration setting, in some order. Getting user.email is easy (git config get user.email) but it might not be set and then you get nothing. If you make a commit but have no user.email, Git doesn't mind. It invents an address somehow. I decided that I would like my program to to do exactly what Git does when it makes a commit.

But what does Git use for the committer's email address if there is no user.email set? This turns out to be complicated. It consults several environment variables in some order, as I suggested before. (It is documented in git-commit-tree if you are interested.) I did not want to duplicate Git's complicated procedure, because it might change, and because duplicating code is a sin. But there seemed to be no way to get Git to disgorge this value, short of actually making a commit and examining it.

So I wrote this command, which makes a commit and examines it:

    git log -1 --format=%ce $(git-commit-tree HEAD^{tree} < /dev/null)

This is extremely weird, but aside from that it seems to have no concrete drawbacks. It is pure hack, but it is a hack that works flawlessly.

What is going on here? First, the $(…) part:

    git-commit-tree HEAD^{tree} < /dev/null

The git-commit-tree command is what git-commit uses to actually create a commit. It takes a tree object, reads a commit message from standard input, writes a new commit object, and prints its SHA1 hash on standard output. Unlike git-commit, it doesn't modify the index (git-commit would use git-write-tree to turn the index into a tree object) and it doesn't change any of the refs (git-commit would update the HEAD ref to point to the new commit.) It just creates the commit.

Here we could use any tree, but the tree of the HEAD commit is convenient, and HEAD^{tree} is its name. We supply an empty commit message from /dev/null.

Then the outer command runs:

    git log -1 --format=%ce $(…)

The $(…) part is replaced by the SHA1 hash of the commit we just created with git-commit-tree. The -1 flag to git-log gets the log information for just this one commit, and the --format=%ce tells git-log to print out just the committer's email address, whatever it is.

This is fast—nearly instantaneous—and cheap. It doesn't change the state of the repository, except to write a new object, which typically takes up 125 bytes. The new commit object is not attached to any refs and so will be garbage collected in due course. You can do it in the middle of a rebase. You can do it in the middle of a merge. You can do it with a dirty index or a dirty working tree. It always works.

(Well, not quite. It will fail if run in an empty repository, because there is no HEAD^{tree} yet. Probably there are some other similarly obscure failure modes.)

I called the shortcut git-push program git-pusho but I dropped the email-address-finder into git-get, which is my storehouse of weird “How do I find out X” tricks.

I wish my best work of the day had been a little bit more significant, but I'll take what I can get.

[ Addendum: Twitter user @shachaf has reminded me that the right way to do this is

    git var GIT_COMMITTER_IDENT

which prints out something like

    Mark Jason Dominus (陶敏修) <mjd@plover.com> 1469102546 -0400

which you can then parse. @shachaf also points out that a Stack Overflow discussion of this very question contains a comment suggesting the same weird hack! ]


[Other articles in category /prog] permanent link

Thu, 14 Jul 2016

Surprising reasons to use a syntax-coloring editor

[ Danielle Sucher reminded me of this article I wrote in 1998, before I had a blog, and I thought I'd repatriate it here. It should be interesting as a historical artifact, if nothing else. Thanks Danielle! ]

I avoided syntax coloring for years, because it seemed like a pretty stupid idea, and when I tried it, I didn't see any benefit. But recently I gave it another try, with Ilya Zakharevich's `cperl-mode' for Emacs. I discovered that I liked it a lot, but for surprising reasons that I wasn't expecting.

I'm not trying to start an argument about whether syntax coloring is good or bad. I've heard those arguments already and they bore me to death. Also, I agree with most of the arguments about why syntax coloring is a bad idea. So I'm not trying to argue one way or the other; I'm just relating my experiences with syntax coloring. I used to be someone who didn't like it, but I changed my mind.

When people argue about whether syntax coloring is a good idea or not, they tend to pull out the same old arguments and dust them off. The reasons I found for using syntax coloring were new to me; I'd never seen anyone mention them before. So I thought maybe I'd post them here.

Syntax coloring is when the editor understands something about the syntax of your program and displays different language constructs in different fonts. For example, cperl-mode displays strings in reddish brown, comments in a sort of brick color, declared variables (in my) in gold, builtin function names (defined) in green, subroutine names in blue, labels in teal, and keywords (like my and foreach) in purple.

The first thing that I noticed about this was that it was easier to recognize what part of my program I was looking at, because each screenful of the program had its own color signature. I found that I was having an easier time remembering where I was or finding that parts I was looking for when I scrolled around in the file. I wasn't doing this consciously; I couldn't describe the color scheme any particular part of the program was, but having red, gold, and purple blotches all over made it easier to tell parts of the program apart.

The other surprise I got was that I was having more fun programming. I felt better about my programs, and at the end of the day, I felt better about the work I had done, just because I'd spent the day looking at a scoop of rainbow sherbet instead of black and white. It was just more cheerful to work with varicolored text than monochrome text. The reason I had never noticed this before was that the other coloring editors I used had ugly, drab color schemes. Ilya's scheme won here by using many different hues.

I haven't found many of the other benefits that people say they get from syntax coloring. For example, I can tell at a glance whether or not I failed to close a string properly—unless the editor has screwed up the syntax coloring, which it does often enough to ruin the benefit for me. And the coloring also slows down the editor. But the two benefits I've described more than outweigh the drawbacks for me. Syntax coloring isn't a huge win, but it's definitely a win.

If there's a lesson to learn from this, I guess it's that it can be valuable to revisit tools that you rejected, to see if you've changed your mind. Nothing anyone said about it was persuasive to me, but when I tried it I found that there were reasons to do it that nobody had mentioned. Of course, these reasons might not be compelling for anyone else.

Addenda 2016

Looking back on this from a distance of 18 years, I am struck by the following thoughts:

  1. Syntax higlighting used to make the editor really slow. You had to make a real commitment to using it or not. I had forgotten about that. Another victory for Moore’s law!

  2. Programmers used to argue about it. Apparently programmers will argue about anything, no matter how ridiculous. Well okay, this is not a new observation. Anyway, this argument is now finished. Whether people use it or not, they no longer find the need to argue about it. This is a nice example that sometimes these ridiculous arguments eventually go away.

  3. I don't remember why I said that syntax highlighting “seemed like a pretty stupid idea”, but I suspect that I was thinking that the wrong things get highlighted. Highlighters usually highlight the language keywords, because they're easy to recognize. But this is like highlighting all the generic filler words in a natural language text. The words you want to see are exactly the opposite of what is typically highlighted.

    Syntax highlighters should be highlighting the semantic content like expression boundaries, implied parentheses, boolean subexpressions, interpolated variables and other non-apparent semantic features. I think there is probably a lot of interesting work to be done here. Often you hear programmers say things like “Oh, I didn't see the that the trailing comma was actually a period.” That, in my opinion, is the kind of thing the syntax highlighter should call out. How often have you heard someone say “Oh, I didn't see that while there”?

  4. I have been misspelling “arguments” as “argmuents” for at least 18 years.


[Other articles in category /prog] permanent link

Tue, 12 Jul 2016

A simple but difficult arithmetic puzzle

Lately my kids have been interested in puzzles of this type: You are given a sequence of four digits, say 1,2,3,4, and your job is to combine them with ordinary arithmetic operations (+, -, ×, and ÷) in any order to make a target number, typically 24. For example, with 1,2,3,4, you can go with $$((1+2)+3)×4 = 24$$ or with $$4×((2×3)×1) = 24.$$

We were stumped trying to make 6,6,5,2 total 24, so I hacked up a solver; then we felt a little foolish when we saw the solutions, because it is not that hard. But in the course of testing the solver, I found the most challenging puzzle of this type that I've ever seen. It is:

Given 6,6,5,2, make 17.

There are no underhanded tricks. For example, you may not concatenate 2 and 5 to make 25; you may not say !!6÷6=1!! and !!5+2=7!! and concatenate 1 and 7 to make !!17!!; you may not interpret the 17 as a base 12 numeral, etc.

I hope to write a longer article about solvers in the next week or so.


[Other articles in category /math] permanent link

Mon, 11 Jul 2016

Addenda to recent articles 201607

Here are some notes on posts from the last couple of months that I couldn't find better places for.

  • I wrote a long article about tracking down a system bug. At some point I determined that the problem was related to Perl, and asked Frew Schmidt for advice. He wrote up the details of his own investigation, which pick up where mine ended. Check it out. I 100% endorse his lament about ltrace.

  • There was a Hacker News discussion about that article. One participant asked a very pertinent question:

    I read this, but seemed to skip over the part where he explains why this changed suddenly, when the behavior was documented?

    What changed to make the perl become capable whereas previously it lacked the low port capability?

    So far, we don't know! Frew told me recently that he thinks the TMPDIR-losing has been going on for months and that whatever precipitated my problem is something else.

  • In my article on the Greek clock, I guessed a method for calculating the (approximate) maximum length of the day from the latitude: $$ A = 360 \text{ min}\cdot(1-\cos L).$$

    Sean Santos of UCAR points out that this is inaccurate close to the poles. For places like Philadelphia (40° latitude) it is pretty close, but it fails completely for locations north of the Arctic Circle. M. Santos advises instead:

    $$ A = 360 \text{ min}\cdot \frac{2}{\pi}\cdot \sin^{-1}(\tan L\cdot \tan\epsilon)$$

    where ε is the axial tilt of the Earth, approximately 23.4°. Observe that when !!L!! is above the Arctic Circle (or below the Antarctic) we have !!\tan L \cdot \tan \epsilon > 1!! (because !!\frac1{\tan x} = \tan(90^\circ - x)!!) so the arcsine is undefined, and we get no answer.


[Other articles in category /addenda] permanent link

Fri, 01 Jul 2016

Don't tug on that, you never know what it might be attached to

This is a story about a very interesting bug that I tracked down yesterday. It was causing a bad effect very far from where the bug actually was.

emacsclient

The emacs text editor comes with a separate utility, called emacsclient, which can communicate with the main editor process and tell it to open files for editing. You have your main emacs running. Then somewhere else you run the command

     emacsclient some-files...

and it sends the main emacs a message that you want to edit some-files. Emacs gets the message and pops up new windows for editing those files. When you're done editing some-files you tell Emacs, by typing C-# or something, it it communicates back to emacsclient that the editing is done, and emacsclient exits.

This was more important in the olden days when Emacs was big and bloated and took a long time to start up. (They used to joke that “Emacs” was an abbreviation for “Eight Megs And Constantly Swapping”. Eight megs!) But even today it's still useful, say from shell scripts that need to run an editor.

Here's the reason I was running it. I have a very nice shell script, called also, that does something like this:

  • Interpret command-line arguments as patterns
  • Find files matching those patterns
  • Present a menu of the files
  • Wait for me to select files of interest
  • Run emacsclient on the selected files

It is essentially a wrapper around menupick, a menu-picking utility I wrote which has seen use as a component of several other tools. I can type

    also Wizard

in the shell and get a menu of the files related to the wizard, select the ones I actually want to edit, and they show up in Emacs. This is more convenient than using Emacs itself to find and open them. I use it many times a day.

Or rather, I did until this week, when it suddenly stopped working. Everything ran fine until the execution of emacsclient, which would fail, saying:

 emacsclient: can't find socket; have you started the server?

(A socket is a facility that enables interprocess communication, in this case between emacs and emacsclient.)

This message is familiar. It usually means that I have forgotten to tell Emacs to start listening for emacsclient, by running M-x server-start. (I should have Emacs do this when it starts up, but I don't. Why not? I'm not sure.) So the first time it happened I went to Emacs and ran M-x server-start. Emacs announced that it had started the server, so I reran also. And the same thing happened.

 emacsclient: can't find socket; have you started the server?

Finding the socket

So the first question is: why can't emacsclient find the socket? And this resolves naturally into two subquestions: where is the socket, and where is emacsclient looking?

The second one is easily answered; I ran strace emacsclient (hi Julia!) and saw that the last interesting thing emacsclient did before emitting the error message was

    stat("/mnt/tmp/emacs2017/server", 0x7ffd90ec4d40) = -1 ENOENT (No such file or directory)

which means it's looking for the socket at /mnt/tmp/emacs2017/server but didn't find it there.

The question of where Emacs actually put the socket file was a little trickier. I did not run Emacs under strace because I felt sure that the output would be voluminous and it would be tedious to grovel over it.

I don't exactly remember now how I figured this out, but I think now that I probably made an educated guess, something like: emacsclient is looking in /mnt/tmp; this seems unusual. I would expect the socket to be under /tmp. Maybe it is under /tmp? So I looked under /tmp and there it was, in /tmp/emacs2017/server:

    srwx------ 1 mjd mjd 0 Jun 27 11:43 /tmp/emacs2017/server

(The s at the beginning there means that the file is a “Unix-domain socket”. A socket is an endpoint for interprocess communication. The most familiar sort is a TCP socket, which has a TCP address, and which enables communication over the internet. But since ancient times Unix has also supported Unix-domain sockets, which enable communication between two processes on the same machine. Instead of TCP addresses, such sockets are addressed using paths in the filesystem, in this case /tmp/emacs2017/server. When the server creates such a socket, it appears in the filesystem as a special type of file, as here.)

I confirmed that this was the correct file by typing M-x server-force-delete in Emacs; this immediately caused /tmp/emacs2017/server to disappear. Similarly M-x server-start made it reappear.

Why the disagreement?

Now the question is: Why is emacsclient looking for the socket under /mnt/tmp when Emacs is putting it in /tmp? They used to rendezvous properly; what has gone wrong? I recalled that there was some environment variable for controlling where temporary files are put, so I did

       env | grep mnt

to see if anything relevant turned up. And sure enough there was:

       TMPDIR=/mnt/tmp

When programs want to create tmporary files and directories, they normally do it in /tmp. But if there is a TMPDIR setting, they use that directory instead. This explained why emacsclient was looking for /mnt/tmp/emacs2017/socket. And the explanation for why Emacs itself was creating the socket in /tmp seemed clear: Emacs was failing to honor the TMPDIR setting.

With this clear explanation in hand, I began to report the bug in Emacs, using M-x report-emacs-bug. (The folks in the #emacs IRC channel on Freenode suggested this. I had a bad experience last time I tried #emacs, and then people mocked me for even trying to get useful information out of IRC. But this time it went pretty well.)

Emacs popped up a buffer with full version information and invited me to write down the steps to reproduce the problem. So I wrote down

     % export TMPDIR=/mnt/tmp
     % emacs

and as I did that I ran those commands in the shell.

Then I wrote

     In Emacs:
     M-x getenv TMPDIR
     (emacs claims there is no such variable)

and I did that in Emacs also. But instead of claiming there was no such variable, Emacs cheerfully informed me that the value of TMPDIR was /mnt/tmp.

(There is an important lesson here! To submit a bug report, you find a minimal demonstration. But then you also try the minimal demonstration exactly as you reported it. Because of what just happened! Had I sent off that bug report, I would have wasted everyone else's time, and even worse, I would have looked like a fool.)

My minimal demonstration did not demonstrate. Something else was going on.

Why no TMPDIR?

This was a head-scratcher. All I could think of was that emacsclient and Emacs were somehow getting different environments, one with the TMPDIR setting and one without. Maybe I had run them from different shells, and only one of the shells had the setting?

I got on a sidetrack at this point to find out why TMPDIR was set in the first place; I didn't think I had set it. I looked for it in /etc/profile, which is the default Bash startup instructions, but it wasn't there. But I also noticed an /etc/profile.d which seemed relevant. (I saw later that the /etc/profile contained instructions to load everything under /etc/profile.d.) And when I grepped for TMPDIR in the profile.d files, I found that it was being set by /etc/profile.d/ziprecruiter_environment.sh, which the sysadmins had installed. So that mystery at least was cleared up.

That got me on a second sidetrack, looking through our Git history for recent changes involving TMPDIR. There weren't any, so that was a dead end.

I was still puzzled about why Emacs sometimes got the TMPDIR setting and sometimes not. That's when I realized that my original Emacs process, the one that had failed to rendezvous with emacsclient, had not been started in the usual way. Instead of simply running emacs, I had run

    git re-edit

which invokes Git, which then runs

    /home/mjd/bin/git-re-edit

which is a Perl program I wrote that does a bunch of stuff to figure out which files I was editing recently and then execs emacs to edit them some more. So there are several programs here that could be tampering with the environment and removing the TMPDIR setting.

To more accurately point the finger of blame, I put some diagnostics into the git-re-edit program to have it print out the value of TMPDIR. Indeed, git-re-edit reported that TMPDIR was unset. Clearly, the culprit was Git, which must have been removing TMPDIR from the environment before invoking my Perl program.

Who is stripping the environment?

To confirm this conclusion, I created a tiny shell script, /home/mjd/bin/git-env, which simply printed out the environment, and then I ran git env, which tells Git to find git-env and run it. If the environment it printed were to omit TMPDIR, I would know Git was the culprit. But TMPDIR was in the output.

So I created a Perl version of git-env, called git-perlenv, which did the same thing, and I ran it via git perlenv. And this time TMPDIR was not in the output. I ran diff on the outputs of git env and git perlenv and they were identical—except that git perlenv was missing TMPDIR.

So it was Perl's fault! And I verified this by running perl /home/mjd/bin/git-re-edit directly, without involving Git at all. The diagnostics I had put in reported that TMPDIR was unset.

WTF Perl?

At this point I tried getting rid of get-re-edit itself, and ran the one-line program

    perl -le 'print $ENV{TMPDIR}'

which simply runs Perl and tells it to print out the value of the TMPDIR environment variable. It should print /mnt/tmp, but instead it printed the empty string. This is a smoking gun, and Perl no longer has anywhere to hide.

The mystery is not cleared up, however. Why was Perl doing this? Surely not a bug; someone else would have noticed such an obvious bug sometime in the past 25 years. And it only failed for TMPDIR, not for other variables. For example

    FOO=bar perl -le 'print $ENV{FOO}'

printed out bar as one would expect. This was weird: how could Perl's environment handling be broken for just the TMPDIR variable?

At this point I got Rik Signes and Frew Schmidt to look at it with me. They confirmed that the problem was not in Perl generally, but just in this Perl. Perl on other systems did not display this behavior.

I looked in the output of perl -V, which says what version of Perl you are using and which patches have been applied, and wasted a lot of time looking into CVE-2016-2381, which seemed relevant. But it turned out to be a red herring.

Working around the problem, 1.

While all this was going on I was looking for a workaround. Finding one is at least as important as actually tracking down the problem because ultimately I am paid to do something other than figure out why Perl is losing TMPDIR. Having a workaround in hand means that when I get sick and tired of looking into the underlying problem I can abandon it instantly instead of having to push onward.

The first workaround I found was to not use the Unix-domain socket. Emacs has an option to use a TCP socket instead, which is useful on systems that do not support Unix-domain sockets, such as non-Unix systems. (I am told that some do still exist.)

You set the server-use-tcp variable to a true value, and when you start the server, Emacs creates a TCP socket and writes a description of it into a “server file”, usually ~/.emacs.d/server/server. Then when you run emacsclient you tell it to connect to the socket that is described in the file, with

    emacsclient --server-file=~/.emacs.d/server/server

or by setting the EMACS_SERVER_FILE environment variable. I tried this, and it worked, once I figured out the thing about server-use-tcp and what a “server file” was. (I had misunderstood at first, and thought that “server file” meant the Unix-domain socket itself, and I tried to get emacsclient to use the right one by setting EMACS_SERVER_FILE, which didn't work at all. The resulting error message was obscure enough to lead me to IRC to ask about it.)

Working around the problem, 2.

I spent quite a while looking for an environment variable analogous to EMACS_SERVER_FILE to tell emacsclient where the Unix-domain socket was. But while there is a --socket-name command-line argument to control this, there is inexplicably no environment variable. I hacked my also command (responsible for running emacsclient) to look for an environment variable named EMACS_SERVER_SOCKET, and to pass its value to emacsclient --socket-name if there was one. (It probably would have been better to write a wrapper for emacsclient, but I didn't.) Then I put

    EMACS_SERVER_SOCKET=$TMPDIR/emacs$(id -u)/server

in my Bash profile, which effectively solved the problem. This set EMACS_SERVER_SOCKET to /mnt/tmp/emacs2017/server whenever I started a new shell. When I ran also it would notice the setting and pass it along to emacsclient with --socket-name, to tell emacsclient to look in the right place. Having set this up I could forget all about the original problem if I wanted to.

But but but WHY?

But why was Perl removing TMPDIR from the environment? I didn't figure out the answer to this; Frew took it to the #p5p IRC channel on perl.org, where the answer was eventually tracked down by Matthew Horsfall and Zefrem.

The answer turned out to be quite subtle. One of the classic attacks that can be mounted against a process with elevated privileges is as follows. Suppose you know that the program is going to write to a temporary file. So you set TMPDIR beforehand and trick it into writing in the wrong place, possibly overwriting or destroying something important.

When a program is loaded into a process, the dynamic loader does the loading. To protect against this attack, the loader checks to see if the program it is going to run has elevated privileges, say because it is setuid, and if so it sanitizes the process’ environment to prevent the attack. Among other things, it removes TMPDIR from the environment.

I hadn't thought of exactly this, but I had thought of something like it: If Perl detects that it is running setuid, it enables a secure mode which, among other things, sanitizes the environment. For example, it ignores the PERL5LIB environment variable that normally tells it where to look for loadable modules, and instead loads modules only from a few compiled-in trustworthy directories. I had checked early on to see if this was causing the TMPDIR problem, but the perl executable was not setuid and Perl was not running in secure mode.

But Linux supports a feature called “capabilities”, which is a sort of partial superuser privilege. You can give a program some of the superuser's capabilities without giving away the keys to the whole kingdom. Our systems were configured to give perl one extra capability, of binding to low-numbered TCP ports, which is normally permitted only to the superuser. And when the dynamic loader ran perl, it saw this additional capability and removed TMPDIR from the environment for safety.

This is why Emacs had the TMPDIR setting when run from the command line, but not when run via git-re-edit.

Until this came up, I had not even been aware that the “capabilities” feature existed.

A red herring

There was one more delightful confusion on the way to this happy ending. When Frew found out that it was just the Perl on my development machine that was misbehaving, he tried logging into his own, nearly identical development machine to see if it misbehaved in the same way. It did, but when he ran a system update to update Perl, the problem went away. He told me this would fix the problem on my machine. But I reported that I had updated my system a few hours before, so there was nothing to update!

The elevated capabilities theory explained this also. When Frew updated his system, the new Perl was installed without the elevated capability feature, so the dynamic loader did not remove TMPDIR from the environment.

When I had updated my system earlier, the same thing happened. But as soon as the update was complete, I reloaded my system configuration, which reinstated the capability setting. Frew hadn't done this.

Summary

  • The system configuration gave perl a special capability
  • so the dynamic loader sanitized its environment
  • so that when perl ran emacs,
  • the Emacs process didn't have the TMPDIR environment setting
  • which caused Emacs to create its listening socket in the usual place
  • but because emacsclient did get the setting, it looked in the wrong place

Conclusion

This computer stuff is amazingly complicated. I don't know how anyone gets anything done.

[ Addendum 20160709: Frew Schmidt has written up the same incident, but covers different ground than I do. ]

[ Addendum 20160709: A Hacker News comment asks what changed to cause the problem? Why was Perl losing TMPDIR this week but not the week before? Frew and I don't know! ]


[Other articles in category /tech] permanent link

Wed, 22 Jun 2016

The Greek clock
In former times, the day was divided into twenty-four hours, but they were not of equal length. During the day, an hour was one-twelfth of the time from sunrise to sunset; during the night, it was one-twelfth of the time from sunset to sunrise. So the daytime hours were all equal, and the nighttime hours were all equal, but the daytime hours were not equal to the nighttime hours, except on the equinoxes, or at the equator. In the summer, the day hours were longer and the night hours shorter, and in the winter, vice versa.

Some years ago I suggested, as part of the Perl Quiz of the Week, that people write a greektime program that printed out the time according to a clock that divided the hours in this way. You can, of course, spend a lot of time and effort downloading and installing CPAN astronomical modules to calculate the time of sunrise and sunset, and reading manuals and doing a whole lot of stuff. But if you are content with approximate times, you can use some delightful shortcuts.

First, let's establish what the problem is. We're going to take the conventional time labels ("12:35" and so forth) and adjust them so that half of them take up the time from sunrise to sunset and the other half go from sunset to sunrise. Some will be stretched, and some squeezed. 01:00 in this new system will no longer mean "3600 seconds after midnight", but rather "exactly 7/12 of the way between sunset and sunrise".

To do this, we'll introduce a new daily calendar with the following labels:

Midnight Sunrise Noon Sunset Midnight
00:00 06:00 12:00 18:00 24:00

We'll assume that noon (when the sun is directly overhead) occurs at 12:00 and that midnight occurs at 00:00. (Or 24:00, which is the same thing.) This is pretty close to the truth anyway, although it is screwed up by such oddities as time zones and the like.

On the equinoxes, the sun rises around 06:00 and sets around 18:00, again ignoring time zones and the like. (If you live at the edge of a time zone, especially a large one like U.S. Central Time, local civil noon does not occur at solar noon, so these calculations require adjustments.) On the equinoxes the normal calendar corresponds to the Greek one, because the day and the night are each exactly twelve standard hours long. (The day from 06:00 to 18:00, and the night from 18:00 to 06:00 the following day.) In the winter, the sun rises later and sets earlier; in the summer it rises earlier and sets later. So let's take 06:00 to be the label for the time of sunrise in the Greek clock all year round; 18:00 is similarly the time of sunset in the Greek clock all year round.

With these conventions, it turns out that it's rather easy to calculate the approximate time of sunrise for any day of the year. You need two magic numbers, A and d. The number d is the number of days that have elapsed since the vernal equinox, which is around 19 March (or 19 September, if you live in the southern hemisphere.) The number A is a bit trickier, and I will return to it shortly.

Once you have the two numbers, you just plug into the formula:

$$\text{Sunrise} = \text{06:00} - A \sin {2\pi d\over 365.2422}$$

The tricky part is the magic number A; it depends on your latitude. At the equator, it is 0. And you can probably calculate it directly from the latitude, if you happen to know your latitude. I do know my latitude (Philadelphia is conveniently located at almost exactly 40° N) but I failed observational astronomy classes twice, so I don't know how to do the necessary calculation.

(Actually it occurs to me now that !!A = 360 \text{ min}\times (1-\cos L)!!, should work, where L is the absolute latitude. For the equator (!!L = 90^\circ!!), this gives 0, as it should, and for Philadelphia it gives !!360\text{ min}\cdot (1- \cos 40^\circ) \approx 84.22\text{ min}!!, which is just about right.)

However, there's another trick you can use even if you don't know your latitude. If you know the time of sunset on the summer solstice, you can calculate A quite easily:

$$A = {\text{ Sunset on summer solstice}} - \text{18:00}$$

Does that really help? If it were October, it might not. But the summer solstice is today. So all you have to do is to look out the window in the evening and notice when the sun seems to be going down. Then plug the time into the formula. (Or you can remember what happened yesterday, or wait until tomorrow; the time of sunset hardly changes at all this time of year, by only a few seconds per day. Or you could look at the front page of a daily newspaper, which will also tell you the time of sunset.)

The sun went down here around 20:30 today, but that is really 19:30 because of daylight saving time, so we get A = 19:30 - 18:00 = 90 minutes, which happily agrees with the 84.22 we got earlier by a different method. Then the time of sunrise in Philadelphia d days after the vernal equinox is $$\text{Sunrise} = \text{06:00} - 90\text{ min}\cdot \sin {2\pi d\over 365.2422}$$ Today is June 21, which is (counts on fingers) about 31+30+31 = 92 days after the vernal equinox which was around March 21. So notice that the formula above involves !!\sin{2\pi\cdot 92\over 365.2422} \approx \sin{\frac\pi 2} = 1!! because 92 is just about one-fourth of 365.2422—that is, today is just about a quarter of a year after the vernal equinox. So the formula says that sunrise ought to be about 04:30, or, because of daylight saving time, that's 05:30 local civil time. This time of year the night is only 9 standard hours long, so the Greek nighttime hour is !!\frac9{12}!! standard hours long, or 45 minutes. Right now it's 22:43 daylight time, which is 133 standard minutes past sundown, or just about 3 Greek nighttime hours. So the Greek time is close to 9 PM. In another 2:15 standard hours another 3 Greek hours will have elapsed and it will be Greek midnight; this coincides with standard midnight, which is 01:00 local civil time because of daylight saving.

Here's code for greektime that you can run where you to find out the current Greek time. I hereby place this program in the public domain.

#!/usr/bin/perl
#
# Calculate local time in fictitious Greek clock
# http://blog.plover.com/calendar/Greek-clock.html
# Author: Mark Jason Dominus (mjd@plover.com)
# This program is in the public domain.
#

my $PI = atan2(0, -1);

use Getopt::Std;
my %opt;
getopts('l:s:', \%opt) or usage();
my $A; 
if ($opt{l} =~ /\d/) {
  $A = 360 * 60 * (1-cos(radians($opt{l})));
} elsif ($opt{s} =~ /:/) {
  my ($hr, $mn) = split /:/, $opt{s};
  $A = (($hr - 18) * 60 + $mn) * 60;
} else {
  usage();
}

my $time = time;
my $days_since_equinox = ($time - 1047950185)/86400;
my $days_per_year = 365.2422;

my $sunrise_adj = $A * sin($days_since_equinox / $days_per_year 
                                   * 2 * $PI );

my $length_of_daytime   = 12 * 3600 + 2 * $sunrise_adj;
my $length_of_nighttime = 12 * 3600 - 2 * $sunrise_adj;

my $time_of_sunrise =  6 * 3600 - $sunrise_adj;
my $time_of_sunset  = 18 * 3600 + $sunrise_adj;

my ($gh, $gm) = time_to_greek($time);
my ($h, $m) = (localtime($time))[2,1];

printf "Standard: %2d:%02d\n",  $h,  $m;
printf "   Greek: %2d:%02d\n", $gh, $gm;

sub time_to_greek {
  my ($epoch_time) = shift;
  my $time_of_day;
  { my ($h, $m, $s, $dst) = (localtime($epoch_time))[2,1,0,8];
    $time_of_day = ($h-$dst) * 3600 + $m * 60 + $s;
  }
  my ($greek, $hour, $min);
  if ($time_of_day < $time_of_sunrise) {
    # change early morning into night
    $time_of_day += 24 * 3600;
  }
  if ($time_of_day < $time_of_sunset) {
    # day
    my $diff = $time_of_day - $time_of_sunrise;
    $greek = 6 + ($diff / $length_of_daytime) * 12;
  } else {
    # night
    my $diff = $time_of_day - $time_of_sunset;
    $greek = 18 + ($diff / $length_of_nighttime) * 12;
  }

  $hour = int($greek);
  $min = int(60 * ($greek - $hour));
  ($hour, $min);
}

sub radians {
  my ($deg) = @_;
  return $deg * 2 * $PI / 360;
}

sub usage {
  print STDERR "Usage: greektime [ -l latitude ] [ -s summer_solstice_sunset ]

One of latitude or sunset time must be given.
Latitude should be in degrees north of the equator.
  (Negative for southern hemisphere)
Sunset time should be given in the form '19:37' in local STANDARD time.
  (Southern hemisphere should use the WINTER solstice.)
";
  exit 2;
}

This article has been in the works since January of 2007, but I missed the deadline on 18 consecutive solstices. The 19th time is the charm!

[ Addendum 20160711: Sean Santos has some corrections to my formula for A. ]


[Other articles in category /calendar] permanent link

Sun, 15 May 2016

My Favorite NP-Complete Problem at !!Con 2016

Back in 2006 when this blog was new I observed that the problem of planning Elmo’s World video releases was NP-complete.

This spring I turned the post into a talk, which I gave at !!Con 2016 last week.

Talk materials are online.


[Other articles in category /talk] permanent link

Sun, 01 May 2016

Typewriters

It will suprise nobody to learn that when I was a child, computers were almost unknown, but it may be more surprising that typewriters were unusual.

Probably the first typewriter I was familiar with was my grandmother’s IBM “Executive” model C. At first I was not allowed to touch this fascinating device, because it was very fancy and expensive and my grandmother used it for her work as an editor of medical journals.

The “Executive” was very advanced: it had proportional spacing. It had two space bars, for different widths of spaces. Characters varied between two and five ticks wide, and my grandmother had typed up a little chart giving the width of each character in ticks, which she pasted to the top panel of the typewriter. The font was sans-serif, and I remember being a little puzzled when I first noticed that the lowercase j had no hook: it looked just like the lowercase i, except longer.

The little chart was important, I later learned, when I became old enough to use the typewriter and was taught its mysteries. Press only one key at a time, or the type bars will collide. Don't use the (extremely satisfying) auto-repeat feature on the hyphen or underscore, or the platen might be damaged. Don't touch any of the special controls; Grandma has them adjusted the way she wants. (As a concession, I was allowed to use the “expand” switch, which could be easily switched off again.)

The little chart was part of the procedure for correcting errors. You would backspace over the character you wanted to erase—each press of the backspace key would move the carriage back by one tick, and the chart told you how many times to press—and then place a slip of correction paper between the ribbon and the paper, and retype the character you wanted to erase. The dark ribbon impression would go onto the front of the correction slip, which was always covered with a pleasing jumble of random letters, and the correction slip impression, in white, would exactly overprint the letter you wanted to erase. Except sometimes it didn't quite: the ribbon ink would have spread a bit, and the corrected version would be a ghostly white letter with a hair-thin black outline. Or if you were a small child, as I was, you would sometimes put the correction slip in backwards, and the white ink would be transferred uselessly to the back of the ribbon instead of to the paper. Or you would select a partly-used portion of the slip and the missing bit of white ink would leave a fragment of the corrected letter on the page, like the broken-off leg of a dead bug.

Later I was introduced to the use of Liquid Paper (don't brush on a big glob, dot it on a bit at a time with the tip of the brush) and carbon paper, another thing you had to be careful not to put in backward, although if you did you got a wonderful result: the typewriter printed mirror images.

From typing alphabets, random letters, my name, and of course qwertyuiops I soon moved on to little poems, stories, and other miscellanea, and when my family saw that I was using the typewriter for writing, they presented me with one of my own, a Royal manual (model HHE maybe?) with a two-color ribbon, and I was at last free to explore the mysteries of the TAB SET and TAB CLEAR buttons. The front panel had a control for a three-color ribbon, which forever remained an unattainable mystery. Later I graduated to a Smith-Corona electric, on which I wrote my high school term papers. The personal computer arrived while I was in high school, but available printers were either expensive or looked like crap.

When I was in first grade our classroom had acquired a cheap manual typewriter, which as I have said, was an unusual novelty, and I used it whenever I could. I remember my teacher, Ms. Juanita Adams, complaining that I spent too much time on the typewriter. “You should work more on your handwriting, Jason. You might need to write something while you’re out on the street, and you won't just be able to pull a typewriter out of your pocket.”

She was wrong.


[Other articles in category /tech] permanent link

Sun, 24 Apr 2016

Steph Curry: fluke or breakthrough?

[ Disclaimer: I know very little about basketball. I think there's a good chance this article contains at least one basketball-related howler, but I'm too ignorant to know where it is. ]

Randy Olson recently tweeted a link to a New York Times article about Steph Curry's new 3-point record. Here is Olson’s snapshot of a portion of the Times’ clever and attractive interactive chart:

(Skip this paragraph if you know anything about basketball. The object of the sport is to throw a ball through a “basket” suspended ten feet (3 meters) above the court. Normally a player's team is awarded two points for doing this. But if the player is sufficiently far from the basket—the distance varies but is around 23 feet (7 meters)—three points are awarded instead. Carry on!)


Stephen Curry

The chart demonstrates that Curry this year has shattered the single-season record for three-point field goals. The previous record, set last year, is 286, also by Curry; the new record is 406. A comment by the authors of the chart says

The record is an outlier that defies most comparisons, but here is one: It is the equivalent of hitting 103 home runs in a Major League Baseball season.

(The current single-season home run record is 73, and !!\frac{406}{286}·73 \approx 103!!.)

I found this remark striking, because I don't think the record is an outlier that defies most comparisons. In fact, it doesn't even defy the comparison they make, to the baseball single-season home run record.



Babe Ruth

In 1919, the record for home runs in a single season was 29, hit by Babe Ruth. The 1920 record, also by Ruth, was 54. To make the same comparison as the authors of the Times article, that is the equivalent of hitting !!\frac{54}{29}·73 \approx 136!! home runs in a Major League Baseball season.

No, far from being an outlier that defies most comparisons, I think what we're seeing here is something that has happened over and over in sport, a fundamental shift in the way they game is played; in short, a breakthrough. In baseball, Ruth's 1920 season was the end of what is now known as the dead-ball era. The end of the dead-ball era was the caused by the confluence of several trends (shrinking ballparks), rule changes (the spitball), and one-off events (Ray Chapman, the Black Sox). But an important cause was simply that Ruth realized that he could play the game in a better way by hitting a crapload of home runs.

The new record was the end of a sudden and sharp upward trend. Prior to Ruth's 29 home runs in 1919, the record had been 27, a weird fluke set way back in 1887 when the rules were drastically different. Typical single-season home run records in the intervening years were in the 11 to 16 range; the record exceeded 20 in only four of the intervening 25 years.

Ruth's innovation was promptly imitated. In 1920, the #2 hitter hit 19 home runs and the #10 hitter hit 11, typical numbers for the nineteen-teens. By 1929, the #10 hitter hit 31 home runs, which would have been record-setting in 1919. It was a different game.



Takeru Kobayashi

For another example of a breakthrough, let's consider competitive hot dog eating. Between 1980 and 1990, champion hot-dog eaters consumed between 9 and 16 hot dogs in 10 minutes. In 1991 the time was extended to 12 minutes and Frank Dellarosa set a new record, 21½ hot dogs, which was not too far out of line with previous records, and which was repeatedly approached in the following decade: through 1999 five different champions ate between 19 and 24½ hot dogs in 12 minutes, in every year except 1993.

But in 2000 Takeru Kobayashi (小林 尊) changed the sport forever, eating an unbelievably disgusting 50 hot dogs in 12 minutes. (50. Not a misprint. Fifty. Roman numeral Ⅼ.) To make the Times’ comparison again, that is the equivalent of hitting !!\frac{50}{24\frac12}·73 \approx 149!! home runs in a Major League Baseball season.

At that point it was a different game. Did the record represent a fundamental shift in hot dog gobbling technique? Yes. Kobayashi won all five of the next five contests, eating between 44½ and 53¾ each time. By 2005 the second- and third-place finishers were eating 35 or more hot dogs each; had they done this in 1995 they would have demolished the old records. A new generation of champions emerged, following Kobayashi's lead. The current record is 69 hot dogs in 10 minutes. The record-setters of the 1990s would not even be in contention in a modern hot dog eating contest.



Bob Beamon

It is instructive to compare these breakthroughs with a different sort of astonishing sports record, the bizarre fluke. In 1967, the world record distance for the long jump was 8.35 meters. In 1968, Bob Beamon shattered this record, jumping 8.90 meters. To put this in perspective, consider that in one jump, Beamon advanced the record by 55 cm, the same amount that it had advanced (in 13 stages) between 1925 and 1967.


Progression of the world long jump record
The cliff at 1968 is Bob Beamon

Did Beamon's new record represent a fundamental shift in long jump technique? No: Beamon never again jumped more than 8.22m. Did other jumpers promptly imitate it? No, Beamon's record was approached only a few times in the following quarter-century, and surpassed only once. Beamon had the benefit of high altitude, a tail wind, and fabulous luck.



Joe DiMaggio

Another bizarre fluke is Joe DiMaggio's hitting streak: in the 1941 baseball season, DiMaggio achieved hits in 56 consecutive games. For extensive discussion of just how bizarre this is, see The Streak of Streaks by Stephen J. Gould. (“DiMaggio’s streak is the most extraordinary thing that ever happened in American sports.”) Did DiMaggio’s hitting streak represent a fundamental shift in the way the game of baseball was played, toward high-average hitting? Did other players promptly imitate it? No. DiMaggio's streak has never been seriously challenged, and has been approached only a few times. (The modern runner-up is Pete Rose, who hit in 44 consecutive games in 1978.) DiMaggio also had the benefit of fabulous luck.


Is Curry’s new record a fluke or a breakthrough?

I think what we're seeing in basketball is a breakthrough, a shift in the way the game is played analogous to the arrival of baseball’s home run era in the 1920s. Unless the league tinkers with the rules to prevent it, we might expect the next generation of players to regularly lead the league with 300 or 400 three-point shots in a season. Here's why I think so.

  1. Curry's record wasn't unprecedented. He's been setting three-point records for years. (Compare Ruth’s 1920 home run record, foreshadowed in 1919.) He's continuing a trend that he began years ago.

  2. Curry’s record, unlike DiMaggio’s streak, does not appear to depend on fabulous luck. His 402 field goals this year are on 886 attempts, a 45.4% success rate. This is in line with his success rate every year since 2009; last year he had a 44.3% success rate. Curry didn't get lucky this year; he had 40% more field goals because he made almost 40% more attempts. There seems to be no reason to think he couldn't make the same number of attempts next year with equal success, if he wants to.

  3. Does he want to? Probably. Curry’s new three-point strategy seems to be extremely effective. In his previous three seasons he scored 1786, 1873, and 1900 points; this season, he scored 2375, an increase of 475, three-quarters of which is due to his three-point field goals. So we can suppose that he will continue to attempt a large number of three-point shots.

  4. Is this something unique to Curry or is it something that other players might learn to emulate? Curry’s three-point field goal rate is high, but not exceptionally so. He's not the most accurate of all three-point shooters; he holds the 62nd–64th-highest season percentages for three-point success rate. There are at least a few other players in the league who must have seen what Curry did and thought “I could do that”. (Kyle Korver maybe? I'm on very shaky ground; I don't even know how old he is.) Some of those players are going to give it a try, as are some we haven’t seen yet, and there seems to be no reason why some shouldn't succeed.

A number of things could sabotage this analysis. For example, the league might take steps to reduce the number of three-point field goals, specifically in response to Curry’s new record, say by moving the three-point line farther from the basket. But if nothing like that happens, I think it's likely that we'll see basketball enter a new era of higher offense with more three-point shots, and that future sport historians will look back on this season as a watershed.

[ Addendum 20160425: As I feared, my Korver suggestion was ridiculous. Thanks to the folks who explained why. Reason #1: He is 35 years old. ]


[Other articles in category /games] permanent link

Wed, 20 Apr 2016

The sage and the seven horses

A classic puzzle of mathematics goes like this:

A father dies and his will states that his elder daughter should receive half his horses, the son should receive one-quarter of the horses, and the younger daughter should receive one-eighth of the horses. Unfortunately, there are seven horses. The siblings are arguing about how to divide the seven horses when a passing sage hears them. The siblings beg the sage for help. The sage donates his own horse to the estate, which now has eight. It is now easy to portion out the half, quarter, and eighth shares, and having done so, the sage's horse is unaccounted for. The three heirs return the surplus horse to the sage, who rides off, leaving the matter settled fairly.

(The puzzle is, what just happened?)

It's not hard to come up with variations on this. For example, picking three fractions at random, suppose the will says that the eldest child receives half the horses, the middle child receives one-fifth, and the youngest receives one-seventh. But the estate has only 59 horses and an argument ensues. All that is required for the sage to solve the problem is to lend the estate eleven horses. There are now 70, and after taking out the three bequests, !!70 - 35 - 14 - 10 = 11!! horses remain and the estate settles its debt to the sage.

But here's a variation I've never seen before. This time there are 13 horses and the will says that the three children should receive shares of !!\frac12, \frac13,!! and !!\frac14!!. respectively. Now the problem seems impossible, because !!\frac12 + \frac13 + \frac14 \gt 1!!. But the sage is equal to the challenge! She leaps into the saddle of one of the horses and rides out of sight before the astonished heirs can react. After a day of searching the heirs write off the lost horse and proceed with executing the will. There are now only 12 horses, and the eldest takes half, or six, while the middle sibling takes one-third, or 4. The youngest heir should get three, but only two remain. She has just opened her mouth to complain at her unfair treatment when the sage rides up from nowhere and hands her the reins to her last horse.


[Other articles in category /math] permanent link

Tue, 19 Apr 2016

Thackeray's illustrations for Vanity Fair

Last month I finished reading Thackeray’s novel Vanity Fair. (Related blog post.) Thackeray originally did illustrations for the novel, but my edition did not have them. When I went to find them online, I was disappointed: they were hard to find and the few I did find were poor quality and low resolution.

Before
After

(click to enlarge)

The illustrations are narratively important. Jos Osborne dies suspiciously; the text implies that Becky has something to do with it. Thackeray's caption for the accompanying illustration is “Becky’s Second Appearance in the Character of Clytemnestra”. Thackeray’s depiction of Miss Swartz, who is mixed-race, may be of interest to scholars.

I bought a worn-out copy of Vanity Fair that did have the illustrations and scanned them. These illustrations, originally made around 1848 by William Makepeace Thackeray, are in the public domain. In the printing I have (George Routeledge and Sons, New York, 1886) the illustrations were 9½cm × 12½ cm. I have scanned them at 600 dpi.

Large thumbails

(ZIP file .tgz file)

Unfortunately, I was only able to find Thackeray’s full-page illustrations. He also did some spot illustrations, chapter capitals, and so forth, which I have not been able to locate.

Share and enjoy.


[Other articles in category /book] permanent link

Sat, 16 Apr 2016

How to recover lost files added to Git but not committed

A few days ago, I wrote:

If you lose something [in Git], don't panic. There's a good chance that you can find someone who will be able to hunt it down again.

I was not expecting to have a demonstration ready so soon. But today I finished working on a project, I had all the files staged in the index but not committed, and for some reason I no longer remember I chose that moment to do git reset --hard, which throws away the working tree and the staged files. I may have thought I had committed the changes. I hadn't.

If the files had only been in the working tree, there would have been nothing to do but to start over. Git does not track the working tree. But I had added the files to the index. When a file is added to the Git index, Git stores it in the repository. Later on, when the index is committed, Git creates a commit that refers to the files already stored. If you know how to look, you can find the stored files even before they are part of a commit.

(If they are part of a commit, the problem is much easier. Typically the answer is simply “use git-reflog to find the commit again and check it out”. The git-reflog command is probably the first thing anyone should learn on the path from being a Git beginner to becoming an intermediate Git user.)

Each file added to the Git index is stored as a “blob object”. Git stores objects in two ways. When it's fetching a lot of objects from a remote repository, it gets a big zip file with an attached table of contents; this is called a pack. Getting objects from a pack can be a pain. Fortunately, not all objects are in packs. When when you just use git-add to add a file to the index, git makes a single object, called a “loose” object. The loose object is basically the file contents, gzipped, with a header attached. At some point Git will decide there are too many loose objects and assemble them into a pack.

To make a loose object from a file, the contents of the file are checksummed, and the checksum is used as the name of the object file in the repository and as an identifier for the object, exactly the same as the way git uses the checksum of a commit as the commit's identifier. If the checksum is 0123456789abcdef0123456789abcdef01234567, the object is stored in

    .git/objects/01/23456789abcdef0123456789abcdef01234567

The pack files are elsewhere, in .git/objects/pack.

So the first thing I did was to get a list of the loose objects in the repository:

    cd .git/objects
    find ?? -type f  | perl -lpe 's#/##' > /tmp/OBJ

This produces a list of the object IDs of all the loose objects in the repository:

    00f1b6cc1dfc1c8872b6d7cd999820d1e922df4a
    0093a412d3fe23dd9acb9320156f20195040a063
    01f3a6946197d93f8edba2c49d1bb6fc291797b0
    …
    ffd505d2da2e4aac813122d8e469312fd03a3669
    fff732422ed8d82ceff4f406cdc2b12b09d81c2e

There were 500 loose objects in my repository. The goal was to find the eight I wanted.

There are several kinds of objects in a Git repository. In addition to blobs, which represent file contents, there are commit objects, which represent commits, and tree objects, which represent directories. These are usually constructed at the time the commit is done. Since my files hadn't been committed, I knew I wasn't interested in these types of objects. The command git cat-file -t will tell you what type an object is. I made a file that related each object to its type:

    for i in $(cat /tmp/OBJ); do
      echo -n "$i ";
      git type $i;
    done > /tmp/OBJTYPE

The git type command is just an alias for git cat-file -t. (Funny thing about that: I created that alias years ago when I first started using Git, thinking it would be useful, but I never used it, and just last week I was wondering why I still bothered to have it around.) The OBJTYPE file output by this loop looks like this:

    00f1b6cc1dfc1c8872b6d7cd999820d1e922df4a blob
    0093a412d3fe23dd9acb9320156f20195040a063 tree
    01f3a6946197d93f8edba2c49d1bb6fc291797b0 commit
    …
    fed6767ff7fa921601299d9a28545aa69364f87b tree
    ffd505d2da2e4aac813122d8e469312fd03a3669 tree
    fff732422ed8d82ceff4f406cdc2b12b09d81c2e blob

Then I just grepped out the blob objects:

    grep blob /tmp/OBJTYPE | f 1 > /tmp/OBJBLOB

The f 1 command throws away the types and keeps the object IDs. At this point I had filtered the original 500 objects down to just 108 blobs.

Now it was time to grep through the blobs to find the ones I was looking for. Fortunately, I knew that each of my lost files would contain the string org-service-currency, which was my name for the project I was working on. I couldn't grep the object files directly, because they're gzipped, but the command git cat-file disgorges the contents of an object:

    for i in $(cat /tmp/OBJBLOB ) ; do
      git cat-file blob $i |
        grep -q org-service-curr
          && echo $i;
    done > /tmp/MATCHES

The git cat-file blob $i produces the contents of the blob whose ID is in $i. The grep searches the contents for the magic string. Normally grep would print the matching lines, but this behavior is disabled by the -q flag—the q is for “quiet”—and tells grep instead that it is being used only as part of a test: it yields true if it finds the magic string, and false if not. The && is the test; it runs echo $i to print out the object ID $i only if the grep yields true because its input contained the magic string.

So this loop fills the file MATCHES with the list of IDs of the blobs that contain the magic string. This worked, and I found that there were only 18 matching blobs, so I wrote a very similar loop to extract their contents from the repository and save them in a directory:

    for i in $(cat /tmp/OBJBLOB ) ; do
      git cat-file blob $i | 
         grep -q org-service-curr
           && git cat-file blob $i > /tmp/rescue/$i;
    done

Instead of printing out the matching blob ID number, this loop passes it to git cat-file again to extract the contents into a file in /tmp/rescue.

The rest was simple. I made 8 subdirectories under /tmp/rescue representing the 8 different files I was expecting to find. I eyeballed each of the 18 blobs, decided what each one was, and sorted them into the 8 subdirectories. Some of the subdirectories had only 1 blob, some had up to 5. I looked at the blobs in each subdirectory to decide in each case which one I wanted to keep, using diff when it wasn't obvious what the differences were between two versions of the same file. When I found one I liked, I copied it back to its correct place in the working tree.

Finally, I went back to the working tree and added and committed the rescued files.

It seemed longer, but it only took about twenty minutes. To recreate the eight files from scratch might have taken about the same amount of time, or maybe longer (although it never takes as long as I think it will), and would have been tedious.

But let's suppose that it had taken much longer, say forty minutes instead of twenty, to rescue the lost blobs from the repository. Would that extra twenty minutes have been time wasted? No! The twenty minutes spent to recreate the files from scratch is a dead loss. But the forty minutes to rescue the blobs is time spent learning something that might be useful in the future. The Git rescue might have cost twenty extra minutes, but if so it was paid back with forty minutes of additional Git expertise, and time spent to gain expertise is well spent! Spending time to gain expertise is how you become an expert!

Git is a core tool, something I use every day. For a long time I have been prepared for the day when I would try to rescue someone's lost blobs, but until now I had never done it. Now, if that day comes, I will be able to say “Oh, it's no problem, I have done this before!”

So if you lose something in Git, don't panic. There's a good chance that you can find someone who will be able to hunt it down again.


[Other articles in category /prog] permanent link