The Universe of Discourse
           
Sat, 28 Jul 2007

Lightweight Database Strategies for Perl
Several years ago I got what I thought was a great idea for a three-hour conference tutorial: lightweight data storage techniques. When you don't have enough data to be bothered using a high-performance database, or when your data is simple enough that you don't want to bother with a relational database, you stick it in a flat file and hack up some file code to read it. This is the sort of thing that people do all the time in Perl, and I thought it would be a big seller. I was wrong.

I don't know why. I tried giving the class a snappier title, but that didn't help. I'm really bad at titles. Maybe people are embarrassed to think about all the lightweight data storage hackery they do in Perl, and feel that they "should" be using a relational database, and don't want to commit more resources to lightweight database techniques. Or maybe they just don't think there is very much to know about it.

But there is a lot to know; with a little bit of technique you can postpone the day when you need to go to an RDB, often for quite a long time, and often forever. Many of the techniques fall into the why-didn't-I-think-of-that category, stuff that isn't too weird to write or maintain, but that you might not have thought to try.

I think it's a good class, but since it never sold well, I've decided it would do more good (for me and for everyone else) if I just gave away the materials for free.

Table of Contents

The class is in three sections. The first section is about using plain text files and talks about a bunch of useful techniques, such as how to do binary search on sorted text files (this is nontrivial) and how to replace records in-place, when they might not fit.

The second section is about the Tie::File module, which associates a flat text file with a Perl array.

The third section is about DBM files, with a comparison of the five major implementations. It finishes up with a discussion of some of Berkeley DB's lesser-known useful features, such as its DB_BTREE file type, which offers fast access like a hash but keeps the records in sorted order

  • Text Files
    • Rotating log file; deleting a user
    • Copy the File
      • -i.bak
      • Using -i inside a program
      • Problems with -i
      • Atomicity issues
    • Essential problem with files; fundamental operations; seeking
    • Sorted files
    • In-place modification of records
      • Overwriting records
      • Bytes vs. positions
      • Gappy Files
      • Fixed-length records
      • Numeric indices
      • Case study: lastlog
    • Indexing
      • Void fields
      • Generic text indices
      • Packed offsets
  • Tie::File
    • Tie::File Examples
    • delete_user revisited
    • uppercase_username revisited
    • Rotating log file revisited
    • Most important thing to know about Tie::File
    • Indexing with Tie::File
    • Tie::File Internals
      • Caching
      • Record modification
      • Immediate vs. Deferred Writing
      • Autodeferring
    • Miscellaneous Features
  • DBM
    • Common DBM Implementations
    • What DBM Does
    • Small DBMs: ODBM, NDBM, and SDBM
    • GDBM
    • DB_File
      • Indexing revisited
      • Ordered hashes
      • Partial matching
      • Sequential access
      • Multiple values
      • Filters
      • BerkeleyDB

Online materials


[Other articles in category /prog/perl] permanent link

Fri, 27 Jul 2007

Conference talk brochure descriptions
I just got back from doing some tutorials at OSCON, which were generally well-received. Sometimes it goes better than other times; this time it went pretty well, I thought, except that I was seven minutes late to the Tuesday morning one, through a tremendous series of fuckups beginning with the conference hotel not being able to find my reservation on Saturday night, continuing with my barely missing two unrelated streetcars on Tuesday morning, and, let's not leave out the most important part, my forgetting that the class started at 8:30 and not at 9:00 until about 8:00.

I've written before about the general worthlessness of the attendee evaluations, so maybe I won't go into detail about them again. What I want to complain about here is the descriptions of the classes that appear in the conference brochure and on the web site.

One of the things that Nat (the program committee chair) and I have commiserated about in the past is that no matter how hard you try to make a clear, concise, accurate description of the class, you are doomed, because people do not use the descriptions in a rational way. For example, suppose I happen to be giving the same class two years in a row. The class title is the same both years. The 250-word description in the brochure and on the web site is word-for-word identical both years. Nevertheless, you can be sure that someone will hand in an evaluation the second year that complains bitterly that the class was a waste of time, because they took the class the year before and there was no new material. I vented about this to Nat once, and the look of exhausted disgust on his face was something to see. Because I only have to read my own stupid evaluations, but Nat has to read all the stupid evaluations, and he probably sees that same idiotic complaint ten times a year.

Here's one I was afraid I'd get this year, and, who knows. It may yet happen. I sent the program committee seven proposals. They accepted three. One was for the Advanced techniques for Parsing class; one for for Higher-Order Perl. There was significant overlap between these two classes; the last third of the Higher-Order Perl class is about higher-order parser combinators, which are the principal subject of the advanced parsing class. This puts me in a difficult position. The program committee has accepted two classes that overlap. I have to deliver the material that I promised in the brochure, which people paid money to hear. I cannot unilaterally eliminate the overlap, say by substituting a different topic into Higher-Order Perl, because then someone in that class might quite rightly complain that they had been promised a section on parsing techniques, had paid for a section on parsing techniques, but had not been delivered a section on parsing techniques. But some people will sign up for both classes, and then will inevitably complain about the overlap, even though it should have been clear from the brochure that the classes would overlap.

The only way out for me is to try to get the program committee to agree beforehand to let me change around one of the classes to remove the overlap, write one-third of a new class, and document the change in the brochure description before it is published. That is a lot of work to do in a short time. Some people write their class slides the night before they give the class. I don't; I take weeks over it, revising extensively, and then I give a practice session, and then I revise again. So the classes overlapped, and I'm sure there were complaints about it that I haven't seen yet.

My favorite complaint of all time was from the guy who took Tricks of the Wizards and then complained that the material was too advanced.

This year I had the opposite problem. I gave a class on Advanced techniques for Parsing, and the following day I read a blog article from someone who had been disappointed that it was insufficiently advanced. This is a fair and legitimate criticism, and deserves a reasonable response. The response is not, however, to change the class content, because I think I have a pretty good idea of how sophisticated the conference attendees are, and of what is useful, and if I made the class a lot more advanced than it is, hardly anyone would understand it. But I did feel bad that this blogger had mistakenly wasted hours in my class and gotten nothing out of it. That should have been avoidable.

The first thing I did was to check the brochure description, to see if perhaps it was misleading, or if it promised extra-advanced material that I then didn't deliver. This sometimes happens. The deadline for proposals is far in advance of the deadline for the class materials themselves. So what happens is that you write up a proposal for a class you think you can do, that people will like, and that will appeal to the program committee, and you send it in. A few months later, it is accepted, and you start work on the class. Then sometimes you discover that even though you proposed a class about A, B, and C, there is only enough time to do A and B properly, and to cover all three in a three-hour class would just be a mess. So you write a class that covers A and B properly, and has an abbreviated discussion of C. But then there will be some people who came to the class specifically for the discussion of C, and who are disappointed. It is a tough problem.

Anyway, I thought this time I had done a reasonably good job of writing a class that actually matched the brochure description. So I wrote to the blogger to ask how the description could have been better: what would I have needed to say in it that would have tipped him off that the class would not have had whatever it was he was looking for?

The answer: nothing. He had not read the description. He attended the class solely because of the title, Advanced techniques for Parsing, and then after two hours figured out that it was not as advanced as he wanted it to be.

Not my fault! Not my fault!


[Other articles in category /talk] permanent link

Sat, 21 Jul 2007

Homosexuality is not hereditary
A just read a big pile of blog comments that all said that homosexuality couldn't be hereditary, because if it were, natural selection would have gotten rid of it by now.

But natural selection is more interesting than that. This article will ignore the obvious notion of homosexuals who breed anyway. Here is one way in which homosexuality could be entirely hereditary and still be favored by natural selection.

Suppose that human sexuality is extremely complicated, which should not be controversial. Suppose, just for concreteness, that there are 137 different genes that can affect whether an individual turns out heterosexual or homosexual. Say that each of these can either be either in state Q or state S, and that and that any individual will turn out homosexual if any 93 of the 137 genes are in state Q, heterosexual otherwise.

The over-simplistic argument from natural selection says that the Q states will be bred out of the population, and that S will be increasingly predominant over time.

Now let's consider an individual, X, whose family members tend to carry a lot of Q genes.

Suppose X's parents have a lot of Q genes, around 87 or 90. X's parents' siblings, who resemble them, will also have a lot of Q genes, and have a high probability of being homosexual. Having no children of their own, they may contribute to X's welfare, maybe by caring for X or by finding food for X.

In short, for every gay uncle X has, that is one additional set of cousins with whom X does not have to compete for scarce resources.

This could well turn out to be a survival advantage for X over someone from a family of people without a lot of Q genes, someone who is competing for food with a passel of cousins, none of whom ever really get enough to eat, someone whose aunt might even try to kill them in order to benefit her own children.

Perhaps X turns out to be homosexual and never breeds, but X probably has some siblings, in which case X might be an advantageous gay uncle or lesbian aunt to one of his or her own nieces or nephews, who, remember, are carrying a lot of the same genes, including the Q genes.

It might not actually work this way, of course, and in most ways it probably doesn't. The only point here is to show that natural selection does not necessarily rule out the idea of inherited homosexuality; people who think it must, have not exercised enough imagination.

(Now that I have finished writing this article, it occurs to me that the same argument applies to bees and ants; most individuals in a bee or ant colony are sterile. Who would be foolish enough to argue that this trait will soon be bred out of the colony?)

Order
Darwin's Dangerous Idea
Darwin's Dangerous Idea
with kickback
no kickback
The moral of this story:

Time and time again, biologists baffled by some apparently futile or maladroit bit of bad design in nature have eventually come to see that they have underestimated the ingenuity, the sheer brilliance, the depth of insight to be disovered in one of Mother Nature's creations. Francis Crick has mischievously baptized this trend in the name of his colleague Leslie Orgel, speaking of what he calls "Orgels Second Rule: Evolution is cleverer than you are."
Daniel Dennett, Darwin's Dangerous Idea, p. 74.


[Other articles in category /bio] permanent link

Fri, 20 Jul 2007

Tough questions
It's easy to recognize a good question: a good question is one that takes a lot longer to answer than it does to ask. Chip Buchholtz's example is "what is a byte?" To answer that you have to get into the nitty gritty of computer architecture and how, although the information in the computer is stored by the bit, the memory bus can only address it by the byte.

One of the biology interns asked a me a good one a couple of weeks ago: he asked how, if Perl runs Perl scripts, and the OS is running Perl, what is running the OS? Now that is a tough question to answer. I explained about logic gates, and how the logic gates are built into trivial arithmetic and memory circuits, how these are then built up into ALUs and memories, and how these in turn are controlled by microcode, and finally how the logical parts are assembled into a computer. I don't know how understandable it was, but it was the best I could do in five minutes, and I think I got some of the idea across. But I started and finished by saying that it was basically miraculous.

My daughter Iris asks a ton of questions, some better than others. On any given evening she is likely to ask "Daddy, what are you doing?" about fifteen times, and "why?" about fifteen million times. "Why" can be a great question, but sometimes it's not so great; Iris asks both kinds. Sometimes it's in response to "I'm eating a sandwich." Then the inevitable "why?" is rather annoying.

Some of the "why" questions are nearly impossible to answer. For example, we see a lady coming up the street toward us. "Is that Susanna?" "No." "Why is it not Susanna?"

I think what's happening here is that having discovered this magic word that often produces interesting information, Iris is employing it whenever possible, even when it doesn't make sense, because she hasn't yet learned when it works and when not. Why is that not Susanna? Hey, you never know when you might get an interesting answer. But there might be something else going on that I don't appreciate.

But the nice thing about Iris's incessant questions is that she listens to and remembers the answers, ponders them deeply, and then is likely to come back with an insightful followup when you least expect it.

Order
Make Way for Ducklings
Make Way for Ducklings
with kickback
no kickback
This weekend we went to visit my parents in New York, and as we drove down the Henry Hudson Parkway, we passed the North River wastewater treatment plant. Three-year-olds are fascinated with poop, so I took the opportunity to point out the plant to Iris. I said that although it had a park with trees on the roof, the inside was a giant machine for turning poop into garden soil; they cleaned it and mixed with with wood chips and it composted like the stuff in our composter. (I later found that some of these details were not quite accurate, but the general idea is correct. See the official site for the official story. My wife provided the helpful analogy with the composter.) As I expected, Iris was interested, and thought this over; she confirmed that they turned poop into soil, and then asked what they made pee into. I was not prepared for that one, and I had to promise her I would find out later. It took me some Internet research time to find out about denitrogenation.

Speaking of poop, last month Iris asked a puzzler: why don't birds use toilets? I think this was motivated by our earlier discussion of bird poop on our car.

In Make Way for Ducklings there's a picture of the friendly policeman Michael, running back to his police box to order a police escort to help the ducklings across Beacon Street. He's holding his billy club. Iris asked what that was for. I thought a moment, and then said "It's for hitting people with." Later I wondered if I had given an inaccurate or incomplete answer, so I asked around, and did some reading. It appears I got that one right. Some folks I know suggested that I should have said it was for hitting bad people, but I'd rather stick to the plain facts, and leave out the editorializing.


Order
The Defeat of the Spanish Armada
The Defeat of the Spanish Armada
with kickback
no kickback
Anyway, lately I've been rereading The Defeat of the Spanish Armada, by Garrett Mattingly, which is a really good book; it won a special Pulitzer Prize when it was published. It's about the attempt by Spain to invade England in 1588. The invasion was a failure, and the Spanish got clobbered. Most interesting minor detail: Francis Drake went to St. Vincent the year before the Armada sailed and captured a bunch of merchant ships that were carrying seasoned barrel-staves, which he burnt. As a result, when the mighty Armada sailed, many of the ships had to carry casks made of green wood, and they leaked; whenever the Spanish opened a cask that should have contained food or water, they were as likely as not to find it full of green slime instead.

So I was reading the Mattingly book this evening, and Iris was eating and playing with Play-Doh on the kitchen floor. After the eleventh repetition of "Daddy, what are you doing?" "Reading." I decided to tell Iris what I was reading about. I said that I was reading about ships, that ships are big boats; they carry lots of men and guns. Iris asked why they carried guns, and I explained that often the ships carried treasure, like spices or gold or jewels or cloth, and that pirates tried to steal it. Iris asked if the cloth was like a wash cloth, and I said no, it was more like the kind of cloth that Mommy makes quilts from, or like the silk that her silk dress is made of. I explained about the pirates, which she seemed to understand, because toddlers know all about people who try to take stuff that isn't theirs. And then she asked the question I couldn't answer: Why were there men on the ships, but no women?

I was totally stumped; I don't even know where to begin explaining to a three-year-old why there are no women on ships in 1588. The only answers I could think of had to do with women's traditional roles, with European mores, social constructions of gender, and so on, all stuff that wouldn't help. Sometimes women were smuggled aboard ship, but I wasn't going to say that either.

I don't usually give up, but this time I gave up. This is a tough question of the first order, easy to ask, hard to answer. It's a lot easier to explain wastewater treatment.


[Other articles in category /misc] permanent link

"More intuitive" programming language syntax
Chromatic wrote an article today about The Broken Metric of "Intuitive to the Uneducated" Language Syntax in which he addresses the very common argument that some language syntax is better than some other because it is "more intuitive" or "easier for beginners to understand".

Chromatic says that these arguments are bunk because programming language syntax is much less important than programming language semantics. But I think that is straining at a gnat and swallowing a camel.

To argue that a certain programming language feature is bad because it is confusing to beginners, you have to do two things. You have to successfully argue that being confusing to beginners is an important metric. Chromatic's article tries to refute this, saying that it is not an important metric.

But before you even get to that stage, you first have to show that the programming language feature actually is confusing to beginners.

But these arguments are never presented with any evidence at all, because no such evidence exists. They are complete fabrications, pulled out of the asses of their propounders, and made of equal parts wishful thinking and bullshit.

Addendum 20070720:
To support my assertion that nobody knows what makes programming hard for beginners, I wanted to cite this paper, The camel has two humps, by Dehnadi and Bornat, which I was rereading recently, but I couldn't find my copy and couldn't remember the title or authors. Happily, I eventually remembered.

The abstract begins:

Learning to program is notoriously difficult. A substantial minority of students fails in every introductory programming course in every UK university. Despite heroic academic effort, the proportion has increased rather than decreased over the years. Despite a great deal of research into teaching methods and student responses, we have no idea of the cause.
But the situation isn't completely hopeless; the abstract also says:

We have found a test for programming aptitude, of which we give details. We can predict success or failure even before students have had any contact with any programming language with very high accuracy, and by testing with the same instrument after a few weeks of exposure, with extreme accuracy. We present experimental evidence to support our claim. certain to succeed.
What's the secret? Read and learn.


[Other articles in category /prog] permanent link

Thu, 19 Jul 2007

More about fixed points and attractors
A while back I talked about a technique for calculating √2 where you pick a function that has √2 as a fixed point (that is, f(√2) = √2) and then see what happens when you consider the sequence x, f(x), f(f(x)), ..., for various initial values of x. For some such functions the sequence diverges, but often it converges to √2.

I picked a few example functions, some of which worked and some of which didn't.

One glaring omission from the article was that I forgot to mention the so-called "Babylonian method" for calculating square roots. The Babylonian method for calculating √n is simply to iterate the function x → ½(x + n/x). (This is a special case of the Newton-Raphson method for finding the zeroes of a function. In this case the function whose zeroes are being found is is xx2 - n.) The Babylonian method converges quickly for almost all initial values of x. As I was writing the article, at 3 AM, I had the nagging feeling that I was leaving out an important example function, and then later on realized what it was. Oops.

But there's a happy outcome, which is that the Babylonian method points the way to a nice general extension of this general technique. Suppose you've found a function f that has your target value, say √2, as a fixed point, but you find that iterating f doesn't work for some reason. For example, one of the functions I considered in the article was x → 2/x. No matter what initial value you start with (other than √2 and -√2) iterating the function gets you nowhere; the values just hop back and forth between x and 2/x forever.

But as I said in the original article, functions that have √2 as a fixed point are easy to find. Suppose we have such a function, f, which is badly-behaved because the fixed point repels, or because of the hopping-back-and-forth problem. Then we can perturb the function by trying instead x → ½(x + f(x)), which has the same fixed points, but which might be better-behaved. (More generally, x → (ax + bf(x)) / (a + b) has the same fixed points as f for any nonzero a and b, but in this article we'll leave a = b = 1.) Applying this transformation to the function x → 2/x gives us the the Babylonian method.

I tried applying this transform to the other example I used in the original article, which was xx2 + x - 2. This has √2 as a fixed point, but the √2 is a repelling fixed point. √2 ± &epsilon → √2 ± (1 + 2√2)ε, so the error gets bigger instead of smaller. I hoped that perturbing this function might improve its behavior, and at first it seemed that it didn't. The transformed version is x → ½(x + x2 + x - 2) = x2/2 + x - 1. That comes to pretty much the same thing. It takes √2 ± &epsilon → √2 + (1 + √2)ε, which has the same problem. So that didn't work; oh well.

But actually things had improved a bit. The original function also has -√2 as a fixed point, and again it's one that repels from both sides, because -√2 ± ε → -√2 ± (1 - 2√2)ε, and |1 - 2√2| > 1. But the transformed function, unlike the original, has -√2 as an attractor, since it takes -√2 ± ε → -√2 ± (1 - √2)ε and |1 - √2| < 1.

So the perturbed function works for calculating √2, in a slightly backwards way; you pick a value close to -√2 and iterate the function, and the iterated values get increasingly close to -√2. Or you can get rid of the minus signs entirely by transforming the function again, and considering -f(-x) instead of f(x). This turns x2/2 + x - 1 into -x2/2 + x + 1. The fixed points change places, so now √2 is the attractor, and -√2 is the repeller, since √2 ± ε → √2 ± (1 - √2)ε. Starting with x = 1, we get:

1.5
1.375
1.4296875
1.40768433
1.41689675
1.41309855
1.41467479
1.41402241
1.41429272
1.41418077
1.41422714
1.41420794
1.41421589
1.41421260
1.41421396
1.41421340
1.41421363
So that worked out pretty well. One might even make the argument that the method is simpler than the Babylonian method, since the division is a simple x/2 instead of a complex 2/x. I have not yet looked into the convergence properties; I expect it will turn out that the iterated polynomial converges more slowly than the Babylonian method.

I had meant to write about Möbius transformations, but that will have to wait until next week, I think.


[Other articles in category /math] permanent link

Wed, 18 Jul 2007

God Plays Dice
Lately my favorite blog is God Plays Dice. If you like my blog, I think you will probably like that one too.


[Other articles in category ] permanent link

Sat, 14 Jul 2007

Evaporation
I work for the Penn Genomics Institute, mostly doing software work, but the Institute is run by biologists and also does biology projects. Last month I taught some perl classes for the four summer interns; this month they are doing some lab work. Since part of my job involves dealing with biologists, I thought this would be a good opportunity to get into the lab, and I got permission from Adam, the research scientist who was supervising the interns, to let me come along.

Since my knowledge of biology is practically nil, Adam was not entirely sure what to do with me while the interns prepared to grow yeasts or whatever it is that they are doing. He set me up with a scale, a set of pipettes, and a beaker of water, with instructions to practice pipetting the water from the beaker onto the scale.

The pipettes came in three sizes. Shown at right is the largest of the ones I used; it can dispense liquid in quantities between 10 and 100 μl, with a precision of 0.1 μl. I used each of the three pipettes in three settings, pipetting water in quantities ranging from 1 ml down to 5 μl. I think the idea here is that I would be able to see if I was doing it right by watching the weight change on the scale, which had a display precision of 1 mg. If I pipette 20 μl of water onto the scale, the measured weight should go up by just about 20 mg.

Sometimes it didn't. For a while my technique was bad, and I didn't always pick up the exact right amount of water. With the small pipette, which had a capacity range of 2–20 μl, you have to suck up the water slowly and carefully, or the pipette tip gets air bubbles in it, and does not pick up the full amount.

With a scale that measures in milligrams, you have a wait around for a while for the scale to settle down after you drop a few μl of water onto it, because the water bounces up and down and the last digit of the scale readout oscillates a bit. Milligrams are much smaller than I had realized.

It turned out that it was pretty much impossible to see if I was picking up the full amount with the smallest pipette. After measuring out some water, I would wait a few seconds for the scale display to stabilize. But if I waited a little longer, it would tick down by a milligram. After another twenty or thirty seconds it would tick down by another milligram. This would continue indefinitely.

I thought about this quietly for a while, and realized that what I was seeing was the water evaporating from the scale pan. The water I had in the scale pan had a very small surface area, only a few square centimeters. But it was evaporating at a measurable rate, around 2 or 3 milligrams per minute.

So it was essentially impossible to measure out five pipette-fuls of 10 μl of water each and end up with 50 mg of water on the scale. By the time I got it done, around 15% of it would have evaporated.

The temperature here was around 27°C, with about 35% relative humidity. So nothing out of the ordinary.

I am used to the idea that if I leave a glass of water on the kitchen counter overnight, it will all be gone in the morning; this was amply demonstrated to me in nursery school when I was about three years old. But to actually see it happening as I watched was a new experience.

I had no idea evaporation was so speedy.


[Other articles in category /physics] permanent link

Fri, 13 Jul 2007

New York tourism
Anil Dash recently blogged about touristy stuff in New York that you should skip. I grew up in New York, so I know something about this.

Top of Anil's list: the Statue of Liberty. He advises taking the Staten Island Ferry instead. I couldn't agree more. The Statue is great, but it's just as great seen from a distance, and you get a superb view of it from the Ferry. The Ferry is cheap (Anil says it's free; it was fifty cents last time I took it) and the view of lower Manhattan is unbeatable.

Similarly, you should avoid the Circle Line, which is a boat trip all the way around Manhattan Island. That sounds good, but it takes all day and you spend a lot of it cruising the not-so-scenic Harlem River. The high point of the trip is the view of lower Manhattan and the harbor. You can get the best parts of the Circle Line trip by taking the Staten Island Ferry, which is much cheaper and omits the dull bits.

Ten years ago I would have said to skip the World Trade Center in favor of the Empire State Building. Well, so much for that suggestion.

Anil says to skip Katz's and the Carnegie Deli, that they're tourist traps. I've never been to Katz's. I would not have advised skipping the Carnegie. I have not been there since 1995, so my view may be out of date, and the place may have changed. But in 1995 I would have said that although it is indeed a tourist trap, the pastrami sandwich is superb nevertheless. At no time, however, would I have advised anyone to eat anything else from there. Get the sandwich and eat it in the comfort of your hotel room, perhaps. But quickly, before it gets cold.

Also in the "go there but only eat one thing" department is Junior's Restaurant, at (I think) Atlantic and De Kalb avenues in Brooklyn. Now here's the thing about Junior's: their cheesecake is justly famous. They guarantee it. It is not your usual guarantee. A typical guarantee would be that if you are not happy with the cheesecake, they will refund your money. That is not Junior's guarantee. No. Junior's guarantees your money back unless their cheesecake is the best you have ever eaten.

Lorrie and I once ordered a cheesecake from Junior's. They ship it overnight, packed in dry ice. Our order was delayed in transit; we called the next day to ask where it was. They apologized and immediately overnighted us a second cheesecake, free, with no further discussion. The next day the two cheesecakes arrived in the mail. Both of them were the best cheesecake I have ever eaten.

But I once went to have dinner at Junior's. This was a mistake. Their cheesecake is so stupendous, I thought, how could their other food possibly fail? As usual, the cheesecake was the best I have ever eaten. But dinner? Not so hot. Do go to Junior's. You don't even have to schlep out to Atlantic Avenue, since they have opened restaurants in Times Square and at Grand Central Station. Get the cheesecake. But eat dinner somewhere else.

Anil says not to eat in the goddamn Olive Garden, and of course he is right. What on earth is the point of going to New York, food capital of this half of the Earth, and eating in the goddamn Olive Garden? You could have done that in Dubuque or Tallahassee or whatever crappy Olive-Garden-loving burg you came from.

If you don't know where to eat in New York, here's my advice: Take the subway to 42nd street, get out, and walk to 9th Avenue. Choose a side of the street by coin flip. Walk north on 9th avenue. Make a note of every interesting-seeming restaurant you pass. After three blocks, you will have passed at least ten interesting-seeming restaurants. Walk back to the most interesting-seeming one and go in, or select one at random. I promise you will have a win, probably a big win. That stretch of 9th Avenue is a paradise of inexpensive but superb restaurants.

I have played the 9th Avenue game many times and it has never failed.

Speaking of "things to skip", I suggest skipping the giant Times Square New Year's Eve celebration, unless you are a pickpocket, in which case you should get there early. Instead, have dinner on 9th Avenue. As you pass each cross-street walking down 9th Avenue, you will be able to see the Times Square crowd two blocks east, and you can pause a moment to think how clever you are to not to be part of it; feeling smugly superior to the writhing mass of humanity is an authentically New York experience. Then have an awesome dinner on 9th Avenue, and take the subway home.

Anil's whole series is pretty good, and as a native New Yorker I found little to disagree with. But I think he may be a little misleading when he says "the natives are friendly and helpful." I would say not. Neither are they unfriendly or unhelpful. What they mostly are, in my experience, is brusque and in a hurry. They will not go out of their way to abuse, harass, or ridicule you; nor will they go out of their way to advise or assist you. The New Yorkers' outlook on the world is that they have important business to attend to, and so, presumably, do you, and everything will run smoothly as long as everyone just stays out of each others' way and attends to their own important business.

In Boston, people will take you personally. I was once thrown out of a liquor store in Boston for daring to ask for a bottle of rye in a manner that the proprietor found offensive. This would never happen in New York. New Yorkers don't have time to be offended by your stupid demands, and they will not throw you out, because they want your money, and if dealing with your stupid demands is what they have to do to get it, well, they will just deal with your stupid demands as quickly as possible. A New York liquor store owner is not in the business of getting offended, and he has more important things to do than to throw you out. He is in the business of taking your money, and if he throws you out, it is because you are getting in the way of his next customer and preventing him from taking his money. Most likely, if you ask for rye, the New York liquor store owner will take your money and give you the rye.

There is a story about Hitler and Goebbels having an argument, with Hitler arguing that the Jews were too inferior to pose any sort of threat, and Goebbels disputing with him, saying that Jews are devious and cunning. To prove his point, Goebbels takes Hitler to a Jewish-run hardware and sundries store and asks the proprietor for a left-handed teapot. The proprietor hesitates a moment, says "let me check in the back room," and returns carrying a teapot in his left hand. "Yes," he says, "I had just one left." As Goebbels and Hitler leave the shop with their left-handed teapot, Goebbels says "I told you the Jews were cunning." Hitler replies "What's so cunning about having one left?"

A Bostonian would have told those two assholes where they could stick their left-handed teapot. That Jew emigrated from Germany, and he did not go to Boston. He went to New York, as did his fifty devious cousins.

But I digress.

In some cities I have visited, there is no convention about which side of the subway stairs are for going up and which are for going down. People just go up whichever side they feel like. In New York, you always travel on the right-hand side of the stairs. Everyone does this, because everyone knows that if they don't they will just get in the way and hold everyone up, including themselves. They have no time for this disorganized nonsense in which people go up whatever side of the stairs suits them.

New Yorkers do not stop and stand in doorways. When New Yorkers need to open their umbrellas, they step aside, and do it out of the way.

New Yorkers are orderly queuers. Disorganized queuing just wastes everyone's time. You don't want to waste everyone's time, do you? So get in line and shut the hell up!

Here in Philadelphia, we waste a lot of time trying to flag down cabs that turn out to be full. New Yorkers would never tolerate such slack management. In New York, taxicabs have a lamp on top that is wired to the taximeter; it lights up when the taxi is empty. That is good business for drivers, for riders, for everyone. I like Philadelphia well enough to have lived here for seventeen years, but it's no New York, let me tell you.

Hong Kong, on the other hand, is a very satisfactory New York. A few years back I visited Hong Kong, food capital of the other half of the Earth, on business, and loved it there. Not least because of the food. The Cantonese are the best cooks in the world, cooks so gifted and brilliant that people all over the world line up on the weekends to eat Cantonese-style garbage, and then come back next weekend to eat it again, because Cantonese garbage, which they call dim sum, but if you think about it for a minute you will realize that dim sum is the week's leftovers, served up in a not-too-subtle disguise, dim sum is more delicious than other cuisines' delicacies. And Hong Kong has the best Cantonese food in the world.

People had warned me beforehand that the Hongkongians were known for being brusque and rude. And that is what I found. Several times in Hong Kong I called up someone or other to try to get something done, and the conversation went roughly like this: I would start my detailed explanation of what I wanted, and why, and the person on the other end of the phone would cut me off mid-sentence, saying something like "You need x; I do y. OK? OK! <click>" and that was the end of it.

As a New Yorker, I recognized immediately what was going on. Brusque, yes, but not rude. I knew that the person on the other end of the phone was thinking that their time was valuable, that I presumably considered my own time valuable, and that we would both be best served if each of us wasted as little of our valuable time as possible in idle chitchat. New Yorkers are just like that too. I gather some people are offended by this behavior, and want the person on the phone to be polite and friendly. I just want them to shut up and do the thing I want done, and in Hong Kong that is what I got.

So if you are a tourist in New York, please try to remember: New Yorkers may appear to be trying to get rid of you as quickly as they can, and if it seems that way, it is probably because they are trying to get rid of you as quickly as they can. But they are doing it because they are trying to help, because they have your best interests at heart. And also because they want to get rid of you as quickly as they can.


[Other articles in category /food] permanent link

Thu, 12 Jul 2007

Another useful utility
Every couple of years I get a good idea for a simple utility that will make my life easier. Last time it was the following triviality, which I call f:

	#!/usr/bin/perl

	my $field = shift or usage();
	$field -= 1 if $field > 0;
	$|=1;

	while (<>) {
		chomp;
		my @f = split;
		print $f[$field], "\n";
	}

	sub usage {
		print STDERR "$0 fieldnumber\n"; 
		exit 1;
	}
I got tired of writing awk '{print $11}' when I wanted to extract the 11th field of some stream of data in a Unix pipeline, which is something I do about six thousand times a day. So I wrote this tiny thing. It was probably the most useful piece of software I wrote in that calendar year, and as you can see from the length, it certainly had the best cost-to-benefit ratio. I use it every day.

The point here is that you can replace awk '{print $11}' with just f 11. For example, f 11 access_log finds out the referrer URLs from my Apache httpd log. I also frequently use f -1, which prints the last field in each line. ls -l | grep '^l' | f -1 prints out the targets of all the symbolic links in the current directory.

Programs like this won't win me any prizes, but they certainly are useful.

Anyway, today's post was inspired by another similarly tiny utility that I expect will be similarly useful that I just finished. It's called runN:

	#!/usr/bin/perl

	use Getopt::Std;
	my %opt;
	getopts('r:n:c:v', \%opt) or usage();
	$opt{n} or usage();
	$opt{c} or usage();

	@ARGV = shuffle(@ARGV) if $opt{r};

	my $N = $opt{n};
	my %pid;
	while (@ARGV) {
	  if (keys(%pid) < $N) {
	    $pid{spawn($opt{c}, split /\s+/, shift @ARGV)} = 1;
	  } else {
	    delete $pid{wait()};
	  }
	}

	1 while wait() >= 0;

	sub spawn {
	  my $pid = fork;
	  die "fork: $!" unless defined $pid;
	  return $pid if $pid;
	  exec @_;
	  die "exec: $!";
	}
You can tell I just finished it because the shuffle() and usage() functions are unimplemented.

The idea is that you execute the program like this:

	runN -n 3 -c foo arg1 arg2 arg3 arg4...
and it runs the commands foo arg1, foo arg2, foo arg3, foo arg4, etc., simultaneously, but with no more than 3 running at a time.

The -n option says how many commands to run simultaneously; after running that many the main control waits until one has exited before starting another.

If I had implemented shuffle(), then -r would run the commands in random order, instead of in the order specified. Probably I should get rid of -c and just have the program take the first argument as the command name, so that the invocation above would become runN -n 3 foo arg1 arg2 arg3 arg4.... The -v flag, had I implemented it, would put the program into verbose mode.

I find that it's best to defer the implementation of features like -r and -v until I actually need them, which might be never. In the past I've done post-analyses of the contents of ~mjd/bin, and what I found was that my tendency was to implement a lot more features than I needed or used.

In the original implementation, the -n is mandatory, because I couldn't immediately think of a reasonable default. The only obvious choice is 1, but since the point of the program was to run programs concurrently, 1 is not reasonable. But it occurs to me now that if I let -n default to 1, then this command would replace many of my current invocations of:

	for i in ...; do
	  cmd $i
	done
which I do quite a lot. Typing runN cmd ... would be a lot quicker and easier. As I've written before, when a feature you put in turns out to have unanticipated uses, it's a sign of a good, modular design.

The code itself makes me happy for two reasons. One is that the program worked properly on the first try, which does not happen very often for me. When I was in elementary school, my teachers always complained that although I was very bright, I made a lot of careless mistakes because I was not methodical enough. They tried hard to fix this personality flaw. They did not succeed.

The other thing I like about the code is that it's so very brief. Not to say that it is any briefer than it should be; I think it's just about perfect. One of the recurring themes of my study of programming for the last few years is that beginner programmers use way more code than is necessary, just like beginning writers use way too many words. The process and concurrency management turned out to be a lot easier than I thought they would be: the default Unix behavior was just exactly what I needed. I am particularly pleased with delete $pid{wait()}. Sometimes these things just come together.

The 1 while wait() >= 0 line is a non-obfuscated version of something I wrote in my prize-winning obfuscated program, of all places. Sometimes the line between the sublime and the ridiculous is very fine indeed.

Despite my wariness of adding unnecessary features, there is at least one that I will put in before I deploy this to ~mjd/bin and start using it. I'll implement usage(), since experience has shown that I tend to forget how to invoke these things, and reading the usage message is a quicker way to figure it out than is rereading the source code. In the past, usage messages have been good investments.

I'm tempted to replace the cut-rate use of split here with something more robust. The problem I foresee is that I might want to run a command with an argument that contains a space. Consider:

	runN -n 2 -c ls foo bar "-l baz"
This runs ls foo, then ls bar, then ls -l baz. Without the split() or something like it, the third command would be equivalent to ls "-l baz" and would fail with something like -l baz: no such file or directory. (Actually it tries to interpret the space as an option flag, and fails for that reason instead.) So I put the split in to enable this usage. (Maybe this was a you-ain't-gonna-need-it moment; I'm not sure.) But this design makes it difficult or impossible to apply the command to an argument with a space in it. Suppose I'm trying to do ls on three directories, one of which is called old stuff. The natural thing to try is:

	runN -n 2 -c ls foo bar "old stuff"
But the third command turns into ls old stuff and produces:

	ls: old: No such file or directory
	ls: stuff: No such file or directory
If the split() were omitted, it would just work, but then the ls -l baz example above would fail. If the split() were replaced by the correct logic, I would be able to get what I wanted by writing something like this:

	runN -n 2 -c ls foo bar "'old stuff'"
But as it is this just produces another error:

	ls: 'old: No such file or directory
	ls: stuff': No such file or directory
Perl comes standard with a library called ShellWords that is probably close to what I want here. I didn't use it because I wasn't sure I'd actually need it—only time will tell—and because shell parsing is very complicated and error-prone, more so when it is done synthetically rather than by the shell, and even more so when it is done multiple times; you end up with horrible monstrosities like this:

	s='q=`echo "$s" | sed -e '"'"'s/'"'"'"'"'"'"'"'"'/'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'/g'"'"'`; echo "s='"'"'"$q"'"'"'"; echo $s'
	q=`echo "$s" | sed -e 's/'"'"'/'"'"'"'"'"'"'"'"'/g'`; echo "s='"$q"'"; echo $s
So my fear was that by introducing a double set of shell-like interpretation, I'd be opening a horrible can of escape character worms and weird errors, and my hope was that if I ignored the issue the problems might be simpler, and might never arise in practice. We'll see.

[ Addendum 20080712: Aaron Crane wrote a thoughtful followup. Thank you, M. Crane. ]


[Other articles in category /prog] permanent link