| The Universe of Discourse | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
12 recent entries Archive:
In this section: Comments disabled |
Fri, 27 Aug 2010
A dummy generator for mock objects
Suppose you are debugging some method, say someMethod, which accepts as one of its arguments complicated, annoying objects $annoying that you either can't or don't want to instantiate. This might be because $annoying is very complicated, with many sub-objects to set up, or perhaps you simply don't know how to build $annoying and don't care to find out. That is okay, because you can get someMethod to run without the full behavior of $annoying. Say for example someMethod calls $annoying->foo_manager->get_foo(...)->get_user_id. You don't understand or care about the details because for debugging someMethod it is enough to suppose that the end result is the user ID 3. You could supply a mock object, or several, that implement the various methods, but that requires some work up front. Instead, use this canned Dummy class. Instead of instantiating a real $annoying (which is difficult) or using a bespoke mock object, use Dummy->new("annoying"):
package Dummy;
use Data::Dumper;
$Data::Dumper::Terse = 1;
our $METHOD;
my @names = qw(bottle corncob euphonium octopus potato slide);
my $NAME = "aaa";
sub new {
my ($class, $name) = @_;
$name ||= $METHOD || shift(@names) || $NAME++;
bless { N => $name } => $class;
}
The call Dummy->new("annoying") will generate an ad-hoc mock
object; whenever any method is called on this dummy object, the call
will be caught by an AUTOLOAD that will prompt you for the
return value you want it to produce:
sub AUTOLOAD {
my ($self, @args) = @_;
my ($p, $m) = $AUTOLOAD =~ /(.*)::(.*)/;
local $METHOD = $m;
print STDERR "<< $_[0]{N}\->$m >>\n";
print STDERR "Arguments: " . Dumper(\@args) . "\n";
my $v;
do {
print STDERR "Value? ";
chomp($v = <STDIN>);
} until eval "$v; 1";
return(eval $v);
}
sub DESTROY { }
1;
The prompt looks like this:
<< annoying->foo_manager >> Arguments: [] Value?If the returned value should be a sub-object, no problem: just put in new Dummy and it will make a new Dummy object named foo_manager, and the next prompt will be:
<< foo_manager->get_foo >> Arguments: ... ... Value?Now you can put in new Dummy "(Fred's foo)" or whatever. Eventually it will ask you for a value for (Fred's foo)->id and you can have it return 4. It's tempting to add caching, so that it won't ask you twice for the results of the same method call. But that would foreclose the option to have the call return different results twice. Better, I think, is for the user to cache the results themselves if they plan to use them again; there is nothing stopping the user from entering a value expression like $::val = .... This may turn out to be one of those things that is mildly useful, but not useful enough to actually use; we'll see.
[Other articles in category /prog/perl] permanent link Thu, 26 Aug 2010
Monad terminology problem
The most serious problem here is #4, that people refer to individual values of monadic types as "monads". Even when they don't do this, they are hampered by the lack of a good term for it. As I know no good alternative has been proposed. People often say "monadic value" (I think), which is accurate, but something of a mouthful. One thing I have discovered in my writing life is that the clarity of a confusing document can sometimes be improved merely by replacing a polysyllabic noun phrase with a monosyllable. For example, chapter 3 of Higher-Order Perl discussed the technique of memoizing a function by generating an anonymous replacement for it that maintains a cache and calls the real function on a cache miss. Early drafts were hard to understand, and improved greatly when I replaced the phrase "anonymous replacement function" with "stub". The Perl documentation was significantly improved merely by replacing "associative array" everywhere with "hash" and "funny punctuation character" with "sigil". I think a monosyllabic replacement for "monadic value" would be a similar boon to discussion of monads, not just for beginners but for everyone else too. The drawback, of introducing yet another jargon term, would in this case be outweighed by the benefits. Jargon can obscure, but sometimes it can clarify. The replacement word should be euphonious, clear but not overly specific, and not easily confused with similar jargon words. It would probably be good for it to begin with the letter "m". I suggest: So return takes a value and returns a mote. The >>= function similarly lifts a function on pure values to a function on motes; when the mote is a container one may think of >>= as applying the function to the values in the container. [] is a monad, so lists are motes. The expression on the right-hand side of a var ← expr in a do-block must have mote type; it binds the mote on the right to the name on the left, using the >>= operator. I have been using this term privately for several months, and it has been a small but noticeable success. Writing and debugging monadic programs is easier because I have a simple name for the motes that the program manipulates, which I can use when I mumble to myself: "What is the type error here? Oh, commit should be returning a mote." And then I insert return in the right place. I'm don't want to oversell the importance of this invention. But there is clearly a gap in the current terminology, and I think it is well-filled by "mote". (While this article was in progress I discovered that What a Monad is not uses the nonceword "mobit". I still prefer "mote".)
[Other articles in category /prog/haskell] permanent link Sun, 03 Jan 2010
A short bibliography of probability monads
I did not imagine that my idea was a new one. I arrived at it by thinking about List as a representation of non-deterministic computation. But if you think of it that way, the natural interpretation is that every list element represents an equally likely outcome, and so annotating the list elements with probabilities is the obvious next step. So the existence of the Erwig library was not a big surprise. A little more surprising though, were the references in the Erwig paper. Specifically, the idea dates back to at least 1981; Erwig cites a paper that describes the probability monad in a pure-mathematics context. Nobody responded to my taunting complaint about Haskell's failure to provide support a good monad of sets. It may be that this is because they all agree with me. (For example, the documentation of the Erwig package says "Unfortunately we cannot use a more efficient data structure because the key type must be of class Ord, but the Monad class does not allow constraints for result types.") But a number of years ago I said that the C++ macro processor blows goat dick. I would not have put it so strongly had I not naïvely believed that this was a universally-held opinion. But no, plenty of hapless C++ programmers wrote me indignant messages defending their macro system. So my being right is no guarantee that language partisans will not dispute with me, and the Haskell community's failure to do so in this case reflects well on them, I think.
[Other articles in category /prog/haskell] permanent link Thu, 31 Dec 2009
A monad for probability and provenance
Suppose a monad value represents all the possible outcomes of an event, each with a probability of occurrence. For concreteness, let's suppose all our probability distributions are discrete. Then we might have: data ProbDist p a = ProbDist [(a,p)] deriving (Eq, Show) unpd (ProbDist ps) = psEach a is an outcome, and each p is the probability of that outcome occurring. For example, biased and unbiased coins:
unbiasedCoin = ProbDist [ ("heads", 0.5),
("tails", 0.5) ];
biasedCoin = ProbDist [ ("heads", 0.6),
("tails", 0.4) ];
Or a couple of simple functions for making dice:
import Data.Ratio
d sides = ProbDist [(i, 1 % sides) | i <- [1 .. sides]]
die = d 6
d n is an n-sided die. The Functor instance is straightforward:
instance Functor (ProbDist p) where
fmap f (ProbDist pas) = ProbDist $ map (\(a,p) -> (f a, p)) pas
The Monad instance requires return and
>>=. The return function merely takes an event and
turns it into a distribution where that event occurs with probability
1. I find join easier to think about than >>=.
The join function takes a nested distribution, where each
outcome of the outer distribution specifies an inner distribution for
the actual events, and collapses it into a regular, overall
distribution. For example, suppose you put a biased coin and an
unbiased coin in a bag, then pull one out and flip it:
bag :: ProbDist Double (ProbDist Double String)
bag = ProbDist [ (biasedCoin, 0.5),
(unbiasedCoin, 0.5) ]
The join operator collapses this into a single ProbDist
Double String:
ProbDist [("heads",0.3),
("tails",0.2),
("heads",0.25),
("tails",0.25)]
It would be nice if join could combine the duplicate
heads into a single ("heads", 0.55) entry. But that
would force an Eq a constraint on the event type, which isn't
allowed, because (>>=) must work for all data types, not
just for instances of Eq. This is a problem with Haskell,
not with the monad itself. It's the same problem that prevents one
from making a good set monad in Haskell, even though categorially sets
are a perfectly good monad. (The return function constructs
singletons, and the join function is simply set union.)
Maybe in the next language.Perhaps someone else will find the >>= operator easier to understand than join? I don't know. Anyway, it's simple enough to derive once you understand join; here's the code:
instance (Num p) => Monad (ProbDist p) where return a = ProbDist [(a, 1)] (ProbDist pas) >>= f = ProbDist $ do (a, p) <- pas let (ProbDist pbs) = f a (b, q) <- pbs return (b, p*q)So now we can do some straightforward experiments:
liftM2 (+) (d 6) (d 6) ProbDist [(2,1 % 36),(3,1 % 36),(4,1 % 36),(5,1 % 36),(6,1 % 36),(7,1 % 36),(3,1 % 36),(4,1 % 36),(5,1 % 36),(6,1 % 36),(7,1 % 36),(8,1 % 36),(4,1 % 36),(5,1 % 36),(6,1 % 36),(7,1 % 36),(8,1 % 36),(9,1 % 36),(5,1 % 36),(6,1 % 36),(7,1 % 36),(8,1 % 36),(9,1 % 36),(10,1 % 36),(6,1 % 36),(7,1 % 36),(8,1 % 36),(9,1 % 36),(10,1 % 36),(11,1 % 36),(7,1 % 36),(8,1 % 36),(9,1 % 36),(10,1 % 36),(11,1 % 36),(12,1 % 36)]This is nasty-looking; we really need to merge the multiple listings of the same event. Here is a function to do that:
agglomerate :: (Num p, Eq b) => (a -> b) -> ProbDist p a -> ProbDist p b
agglomerate f pd = ProbDist $ foldr insert [] (unpd (fmap f pd)) where
insert (k, p) [] = [(k, p)]
insert (k, p) ((k', p'):kps) | k == k' = (k, p+p'):kps
| otherwise = (k', p'):(insert (k,p) kps)
agg :: (Num p, Eq a) => ProbDist p a -> ProbDist p a
agg = agglomerate id
Then agg $ liftM2 (+) (d 6) (d 6) produces:
ProbDist [(12,1 % 36),(11,1 % 18),(10,1 % 12),(9,1 % 9),
(8,5 % 36),(7,1 % 6),(6,5 % 36),(5,1 % 9),
(4,1 % 12),(3,1 % 18),(2,1 % 36)]
Hey, that's correct.There must be a shorter way to write insert. It really bothers me, because it looks look it should be possible to do it as a fold. But I couldn't make it look any better. You are not limited to calculating probabilities. The monad actually will count things. For example, let us throw three dice and count how many ways there are to throw various numbers of sixes:
eq6 n = if n == 6 then 1 else 0
agg $ liftM3 (\a b c -> eq6 a + eq6 b + eq6 c) die die die
ProbDist [(3,1),(2,15),(1,75),(0,125)]
There is one way to throw three sixes, 15 ways to throw two sixes, 75
ways to throw one six, and 125 ways to throw no sixes. So
ProbDist is a misnomer. It's easy to convert counts to probabilities:
probMap :: (p -> q) -> ProbDist p a -> ProbDist q a
probMap f (ProbDist pds) = ProbDist $ (map (\(a,p) -> (a, f p))) pds
normalize :: (Fractional p) => ProbDist p a -> ProbDist p a
normalize pd@(ProbDist pas) = probMap (/ total) pd where
total = sum . (map snd) $ pas
normalize $ agg $ probMap toRational $
liftM3 (\a b c -> eq6 a + eq6 b + eq6 c) die die die
ProbDist [(3,1 % 216),(2,5 % 72),(1,25 % 72),(0,125 % 216)]
I think this is the first time I've gotten to write die die
die in a computer program.The do notation is very nice. Here we calculate the distribution where we roll four dice and discard the smallest:
stat = do
a <- d 6
b <- d 6
c <- d 6
d <- d 6
return (a+b+c+d - minimum [a,b,c,d])
probMap fromRational $ agg stat
ProbDist [(18,1.6203703703703703e-2),
(17,4.1666666666666664e-2), (16,7.253086419753087e-2),
(15,0.10108024691358025), (14,0.12345679012345678),
(13,0.13271604938271606), (12,0.12885802469135801),
(11,0.11419753086419752), (10,9.41358024691358e-2),
(9,7.021604938271606e-2), (8,4.7839506172839504e-2),
(7,2.9320987654320986e-2), (6,1.6203703703703703e-2),
(5,7.716049382716049e-3), (4,3.0864197530864196e-3),
(3,7.716049382716049e-4)]
One thing I was hoping to get didn't work out. I had this idea that
I'd be able to calculate the outcome of a game of craps like this:
dice = liftM2 (+) (d 6) (d 6)
point n = do
roll <- dice
case roll of 7 -> return "lose"
_ | roll == n = "win"
_ | otherwise = point n
craps = do
roll <- dice
case roll of 2 -> return "lose"
3 -> return "lose"
4 -> point 4
5 -> point 5
6 -> point 6
7 -> return "win"
8 -> point 8
9 -> point 9
10 -> point 10
11 -> return "win"
12 -> return "lose"
This doesn't work at all; point is an infinite loop because
the first value of dice, namely 2, causes a recursive call.
I might be able to do something about this, but I'll have to think
about it more.It also occurred to me that the use of * in the definition of >>= / join could be generalized. A couple of years back I mentioned a paper of Green, Karvounarakis, and Tannen that discusses "provenance semirings". The idea is that each item in a database is annotated with some "provenance" information about why it is there, and you want to calculate the provenance for items in tables that are computed from table joins. My earlier explanation is here. One special case of provenance information is that the provenances are probabilities that the database information is correct, and then the probabilities are calculated correctly for the joins, by multiplication and addition of probabilities. But in the general case the provenances are opaque symbols, and the multiplication and addition construct regular expressions over these symbols. One could generalize ProbDist similarly, and the ProbDist monad (even more of a misnomer this time) would calculate the provenance automatically. It occurs to me now that there's probably a natural way to view a database table join as a sort of Kleisli composition, but this article has gone on too long already. Happy new year, everyone. [ Addendum 20100103: unsurprisingly, this is not a new idea. Several readers wrote in with references to previous discussion of this monad, and related monads. It turns out that the idea goes back at least to 1981. ]
My thanks to Graham Hunter for his donation.
[Other articles in category /prog/haskell] permanent link Tue, 15 Dec 2009
Monads are like burritos
At first I thought the choice of burritos was only a facetious reference to the peculiar and sometimes strained analogies these tutorials make. But then I realized that monads are like burritos. I will explain. A monad is a special kind of a functor. A functor F takes each type T and maps it to a new type FT. A burrito is like a functor: it takes a type, like meat or beans, and turns it into a new type, like beef burrito or bean burrito. A functor must also be equipped with a map function that lifts functions over the original type into functions over the new type. For example, you can add chopped jalapeños or shredded cheese to any type, like meat or beans; the lifted version of this function adds chopped jalapeños or shredded cheese to the corresponding burrito. A monad must also possess a unit function that takes a regular value, such as a particular batch of meat, and turns it into a burrito. The unit function for burritos is obviously a tortilla. Finally, a monad must possess a join function that takes a ridiculous burrito of burritos and turns them into a regular burrito. Here the obvious join function is to remove the outer tortilla, then unwrap the inner burritos and transfer their fillings into the outer tortilla, and throw away the inner wrappings. The map, join, and unit functions must satisfy certain laws. For example, if B is already a burrito, and not merely a filling for a burrito, then join(unit(B)) must be the same as B. This means says that if you have a burrito, and you wrap it in a second tortilla, and then unwrap the contents into the outer tortilla, the result is the same as what you started with. This is true because tortillas are indistinguishable. I know you are going to point out that some tortillas have the face of Jesus. But those have been toasted, and so are unsuitable for burrito-making, and do not concern us here. So monads are indeed like burritos. I asked Brent if this was actually what he had in mind when he first suggested the idea of tutorials explaining monads in terms of burritos, and if everyone else had understood this right away. But he said no, I was the lone genius.
[Other articles in category /prog] permanent link Fri, 31 Jul 2009
Dijkstra was not insane
I wrote a response, explaining where Dijkstra was coming from, and I am very happy with how it came out, so I'm reposting it here. The list subscriber said, in part:
On a side note, I never read anything by Dijkstra that wasn't noticeably out of touch with the reality of programming, which qualifies them as screeds to me.A lot of people bring up the premature-loop-exit prohibition without understanding why Dijkstra suggested it; it wasn't just that he was a tightassed Dutchman. Dijkstra's idea was this: suppose you want to prove, mathematically, that your program does what it is supposed to do. Please, everyone, suspend your judgment of this issue for a few paragraphs, and bear with me. Let's really suppose that we want to do this. Dijkstra's idea is that the program is essentially a concatenation of blocks, each of which is trying to accomplish something or other, and each of which does not make sense to run unless some part of the program state is set up for it ahead of time. For example, the program might be to print a sorted list of links from a web page. Then the obvious blocks are:
We say that the "precondition" for C is that the array be populated with URLs, and the "postcondition" is that the array be in sorted order. What you would want to prove about C is that if the precondition holds—that is, if the array is properly populated before C begins—then the postcondition will hold too—that is, the array will be in sorted order when C completes. It occurs to me that calling this a "proof" is probably biasing everyone's thinking. Let's forget about mathematical proofs and just think about ordinary programmers trying to understand if the program is correct. If the intern in the next cubicle handed you his code for this program, and you were looking it over, you would probably think in very much this way: you would identify block C (maybe it's a subroutine, or maybe not) and then you would try to understand if C, given an array of URLs, would produce a properly sorted array by the time it was done. C itself might depend on some sub-blocks or subroutines that performed sub-parts of the task; you could try to understand them similarly. Having proved (or convinced yourself) that C will produce the postcondition "array contains sorted list of URLs", you are in an excellent position to prove (or convince yourself) that block D prints out a sorted array of URLs, which is what you want. Without that belief about C, you are building on sand; you have almost nothing to go on, and you can conclude hardly anything useful about the behavior of D. Now consider a more complex block, one of the form:
if (q) { E; }
else { F; }
Suppose you believe that code E, given precondition x, is
guaranteed to produce postcondition y. And suppose you believe
the same thing about F. Then you can conclude the same thing about
the entire if-else block: if x was true before it began
executing, then y will be true when it is done.[2] So you can build up proofs (or beliefs)
about small bits of code into proofs (or beliefs) about larger ones.We can understand while loops similarly. Suppose we know that condition p is true prior to the commencement of some loop, and that if p is true before G executes, then p will also be true when G finishes. Then what can we say about this loop?
while (q) { G; }
We can conclude that if p was true before the loop began, then p will
still be true, and q will be false, when the loop ends.BUT BUT BUT BUT if your language has break, then that guarantee goes out the window and you can conclude nothing. Or at the very least your conclusions will become much more difficult. You can no longer treat G atomically; you have to understand its contents in detail. So this is where Dijkstra is coming from: features like break[3] tend to sabotage the benefits of structured programming, and prevent the programmer from understanding the program as a composition of independent units. The other subscriber made a seemingly disparaging reference to "Dijkstra's idea of structure", but I hope it is clear that it was not an arbitrary idea. Dijkstra's idea of structure is what will allow you to understand a large program as a collection of modules. Regardless of your opinion about formal verification methods, or correctness proofs, or the practicality of omitting break from your language, it should at least be clear that Dijkstra was not being doctrinaire just for the sake of doctrine.
Some additional notesHere are some interesting peripheral points that I left out of my main discussion because I wanted to stick to the main point, which was: "Dijkstra was not insane".
Further reading
[Other articles in category /prog] permanent link Tue, 16 Jun 2009
Haskell logo fail
[Other articles in category /prog/haskell] permanent link Thu, 14 May 2009
Product types in Java
class Persons2 {
Person personA, personB;
Persons2(Person a, Person b) {
personA = a; personB = b;
}
Person getPersonA() { return personA; }
...
}
Java is loathsome in its verbosity, and this sort of monkey code is
Java's verbosity at its most loathsome. So I did not do this.Haskell functions return only one value also, but this is no limitation, because Haskell has product types. And starting in Java 5, the Java type system is a sort of dented, bolted-on version of the type systems that eventually evolved into the Haskell type system. But product types are pretty simple. I can make a generic product type in Java:
class Pair<A,B> {
A a; B b;
Pair(A a, B b) { this.a = a; this.b = b; }
A fst() { return a; }
B snd() { return b; }
}
Then I can declare my function to return a
Pair<Person,Person>:
Pair<Person,Person> findMatch() {
...
return new Pair
Okay, that worked just fine. The
boilerplate is still there, but you only have to do it once. This
trick seems sufficiently useful that I can imagine that I will use it
again, and that someone else reading this will want to use it too.I've been saying for a while that up through version 1.4, Java was a throwback to the languages of the 1970s, but that with the introduction of generics in Java 5, it took a giant step forward into the 1980s. I think this is a point of evidence in favor of that claim. I wonder why this class isn't in the standard library. I was not the first person to think of doing this; web search turns up several others, who also wonder why this class isn't in the standard library. I wrote a long, irrelevant coda regarding my use of the identifiers husband and wife in the example, but, contrary to my usual practice, I will publish it another day. [ Addendum 20090517: Here's the long, irrelevant coda. ]
I gratefully acknowledge the gift of Petr Kiryakov. Thank you!
[Other articles in category /prog/java] permanent link Sun, 22 Mar 2009
Worst error messages this month
Line 319 in XML document from class path resource [applicationContext-standalone.xml] is invalid; nested exception is org.xml.sax.SAXParseException: cvc-complex-type.2.3: Element 'beans' cannot have character [children], because the type's content type is element-only.Experienced technicians will of course want to look at line 319. Silly! If looking at line 319 were any help, this would not be this month's lucky winner. Line 319 is the last line of the document, and says, in whole, "</beans>". What this actually means is that there is a stray plus sign at the end of line 54. Well, that is the ultimate cause. The Fregean Bedeutung, as it were. What it really means (the Sinn) is that the <beans>...</beans> element is allowed to contain sub-elements, but not naked text ("content type is element-only") and the stray plus sign is naked text. The mixture of weird jargon ("cvc-complex-type.2.3") and obscure anaphora ("character [children]" for "plus sign") got this message nominated for the competition. The totally wrong line number is a bonus. But what won this message the prize is that even if you somehow understand what it means, it doesn't help you find the actual problem! You get to grovel over the 319-line XML file line-by-line, looking for the extra character. Come on, folks, it's a SAX parser, so how hard is it to complain about the plus sign as soon as it shows up? What do we have for the lucky winner, Johnny?
You'll be flown to lovely Centralia, Pennsylvania, where you'll enjoy four days and three nights of solitude in an abandoned coal mine being flogged with holly branches and CAT-5 ethernet cable by the cast of "The Hills"!Thank you, Johnny. And there is a runner-up! The badblocks utility that is distributed as part of the Linux e2fsprogs package, produces the following extremely useful error message:
% badblocks /home badblocks: invalid starting block (0): must be less than 0Apparently this is Linux-speak for "This program needs the name of a device file, and the programmer was too lazy to have it detect that you supplied the name of the mount point instead". Happy spring, everyone!
[Other articles in category /prog] permanent link Thu, 12 Feb 2009
More Uzi-clubbing: a counterexample
I ended the article by saying:
I had already realized that you could, in principle, commit this error with a regular array instead of with a hash, but I had never seen an example until...Just recently I saw another example, which I think is interesting because it seems to be a counterexample. It's part of a somewhat longer Java program. The crucial section is:
...
LINE: while ( ( line = in.readLine()) != null ) {
String[] fields = line.split("\t");
...
for ( int i = 0; i < fields.length; i++ ) {
if ( ! isEmpty(fields[i]) ) {
switch(i) {
case 0: citation.setCitationType(fields[i]); break;
case 1: setAuthors(citation,fields[i],personHome,false); break;
case 2: citation.setPublishYear(Integer.parseInt(fields[i])); break;
case 3: citation.setTitle(fields[i]); break;
...
case 19: citation.setURL(fields[i]); break;
case 20: citation.setDoi(fields[i]); break;
default: warn("Empty field expected, found: " + fields[i] + " for line: " + line); break;
}
}
}
}
...
The Perlishness of this Java code might lead you to think that I wrote
it, but I did not.My temptation here was to replace the loop and the switch with code like this:
citation.setCitationType(fields[0]);
setAuthors(citation,fields[1],personHome,false);
citation.setPublishYear(Integer.parseInt(fields[2]));
citation.setTitle(fields[3]);
...
citation.setURL(fields[19]);
citation.setDoi(fields[20]);
We lost the warnings, but there were only 4 of those, so we can add
them back explicitly:
if (! isEmpty(fields[13])) warn("Empty field expected...");
This might have been an improvement, except that we also lost the
isEmpty tests on the nonempty fields. To get them back we
must spend at least all our gains, possibly more:
if (! isEmpty(fields[0])) citation.setCitationType(fields[0]);
if (! isEmpty(fields[1])) setAuthors(citation,fields[1],personHome,false);
if (! isEmpty(fields[2])) citation.setPublishYear(Integer.parseInt(fields[2]));
if (! isEmpty(fields[3])) citation.setTitle(fields[3]);
...
if (! isEmpty(fields[13])) warn("Empty field expected...");
...
if (! isEmpty(fields[19])) citation.setURL(fields[19]);
if (! isEmpty(fields[20])) citation.setDoi(fields[20]);
So at least in this case, my instinct to eliminate the loop-switch was
not helpful. There are plenty of Java-esque techniques for cutting up
the complexity and sweeping each little piece underneath its own
little carpet ("Replace fields with an object! Or with a
series of 20 objects!") but nothing that actually reduces the entia
multiplicantis. There may be ways to easily improve this code, but
I have not been able to think of any.
[Other articles in category /prog] permanent link Sat, 24 Jan 2009
Higher-Order Perl: nonmemoizing streams
sub tail {
my $s = shift;
if (is_promise($s->[1])) {
return $s->[1]->(); # Force promise
} else {
return $s->[1];
}
}
But this is soon replaced with a version that caches the value
returned by the promise:
sub tail {
my $s = shift;
if (is_promise($s->[1])) {
$s->[1] = $s->[1]->(); # Force and save promise
}
return $s->[1];
}
The reason that I give for this in the book is a performance reason.
It's accompanied by an extremely bad explanation. But I couldn't do
any better at the time.There are much stronger reasons for the memoizing version, also much easier to explain. Why use streams at all instead of the iterators of chapter 4? The most important reason, which I omitted from the book, is that the streams are rewindable. With the chapter 4 iterators, once the data comes out, there is no easy way to get it back in. For example, suppose we want to process the next bit of data from the stream if there is a carrot coming up soon, and a different way if not. Consider:
# Chapter 4 iterators
my $data = $iterator->();
if (carrot_coming_soon($iterator)) {
# X
} else {
# Y
}
sub carrot_coming_soon {
my $it = shift;
my $soon = shift || 3;
while ($soon-- > 0) {
my $next = $it->();
return 1 if is_carrot($next);
}
return; # No carrot
}
Well, this probably doesn't work, because the carrot_coming_soon()
function extracts and discards the upcoming data from the iterator,
including the carrot itself, and now that data is lost.One can build a rewindable iterator:
sub make_rewindable {
my $it = shift;
my @saved; # upcoming values in LIFO order
return sub {
my $action = shift || "next";
if ($action eq "put back") {
push @saved, @_;
} elsif ($action eq "next") {
if (@saved) { return pop @saved; }
else { return $it->(); }
}
};
}
But it's kind of a pain in the butt to use:
sub carrot_coming_soon {
my $it = shift;
my $soon = shift || 3;
my @saved;
my $saw_carrot;
while ($soon-- > 0) {
push @saved, $it->();
$saw_carrot = 1, last if is_carrot($saved[-1]);
}
$it->("put back", @saved);
return $saw_carrot;
}
Because you have to explicitly restore the data you extracted.With the streams, it's all much easier:
sub carrot_coming_soon {
my $s = shift;
my $soon = shift || 3;
while ($seen-- > 0) {
return 1 if is_carrot($s->head);
drop($s);
}
return;
}
The working version of carrot_coming_soon() for streams looks just like
the non-working version for iterators.But this version of carrot_coming_soon() only works for memoizing streams, or for streams whose promise functions are pure. Let's consider a counterexample:
my $bad = filehandle_stream(\*DATA);
sub filehandle_stream {
my $fh = shift;
return node(scalar <$fh>,
promise { filehandle_stream($fh) });
}
__DATA__
fish
dog
carrot
goat rectum
Now consider what happens if I do this:
$carrot_soon = carrot_coming_soon($bad);
print "A carrot appears soon after item ", head($bad), "\n"
if $carrot_soon;
It says "A carrot appears soon after item fish". Fine.
That's because $bad is a node whose head contains
"fish". Now let's see what's after the fish:
print "After ", head($bad), " is ", head(tail($bad)), "\n";
This should print After fish is dog, and for the memoizing
streams I used in the book, it does. But a non-memoizing stream will
print "After fish is goat rectum". Because
tail($bad) invokes the promise function, which, since the
next() was not saved after carrot_coming_soon()
examined it, builds a new node, which reads the next item from the
filehandle, which is "goat rectum".I wish I had explained the rewinding property of the streams in the book. It's one of the most significant omissions I know about. And I wish I'd appreciated sooner that the rewinding property only works if the tail() function autosaves the tail node returned from the promise.
[Other articles in category /prog/perl] permanent link Wed, 12 Nov 2008
Flag variables in Bourne shell programs
Suppose you want to set a flag variable, and then later you want to test it. You probably do something like this:
if some condition; then
IS_NAKED=1
fi
...
if [ "$IS_NAKED" == "1" ]; then
flag is set
else
flag is not set
fi
Or maybe you use ${IS_NAKED:-0} or some such
instead of "$IN_NAKED". Whatever.Today I invented a different technique. Try this on instead:
IS_NAKED=false
if some condition; then
IS_NAKED=true
fi
...
if $IS_NAKED; then
flag is set
else
flag is not set
fi
The arguments both for and against it seem to be obvious, so I won't
make them.I have never seen this done before, but, as I concluded and R.J.B. Signes independently agreed, it is obvious once you see it. [ Addendum 20090107: some followup notes ]
[Other articles in category /prog] permanent link Thu, 18 Sep 2008
Return return
There were two different pieces of code in this paper that wowed me. When I started this article, I was planning to write about #2. I decided that I would throw in a couple of paragraphs about #1 first, just to get it out of the way. This article is that couple of paragraphs. [ Addendum 20080917: Here's the article about #2. ] Suppose you have a type that represents terms over some type v of variable names. The v type is probably strings but could possibly be something else:
data Term v = TVar v -- Type variable | TInt -- Integer type | TString -- String type | Fun (Term v) (Term v) -- Function typeThere's a natural way to make the Term type constructor into an instance of Monad:
instance Monad Term where
return v = TVar v
TVar v >>= f = f v
TInt >>= f = TInt
TString >>= f = TString
Fun d r >>= f = Fun (d >>= f) (r >>= f)
That is, the return operation just lifts a variable name to
the term that consists of just that variable, and the bind
operation just maps its argument function over the variable names in
the term, leaving everything else alone.Jones wants to write a function, unify, which performs a unification algorithm over these terms. Unification answers the question of whether, given two terms, there is a third term that is an instance of both. For example, consider the two terms a → Int and String → b, which are represented by Fun (TVar "a") TInt and Fun TString (TVar "b"), respectively. These terms can be unified, since the term String → Int is an instance of both; one can assign a = TString and b = TInt to turn both terms into Fun TString TInt. The result of the unification algorithm should be a set of these bindings, in this example saying that the input terms can be unified by replacing the variable "a" with the term TString, and the variable "b" with the term TInt. This set of bindings can be represented by a function that takes a variable name and returns the term to which it should be bound. The function will have type v → Term v. For the example above, the result is a function which takes "a" and returns TString, and which takes "b" and returns TInt. What should this function do with variable names other than "a" and "b"? It should say that the variable named "c" is "replaced" by the term TVar "c", and similarly other variables. Given any other variable name x, it should say that the variable x is "replaced" by the term TVar x. The unify function will take two terms and return one of these substitutions, where the substition is a function of type v → Term v. So the unify function has type:
unify :: Term v → Term v → (v → Term v)
Oh, but not quite. Because unification can also fail. For example,
if you try to unify the terms a → b and Int,
represented by Fun (TVar "a") (TVar "b") and TInt
respectively, the unfication should fail, because there is no term
that is an instance of both of those; one represents a function and
the other represents an integer. So unify does not actually
return a substitution of type v → Term v. Rather, it
returns a monad value, which might contain a substitution, if the
unification is successful, and otherwise contains an error value. To handle
the example above, the unify function will contain a case
like this:
unify TInt (Fun _ _) = fail ("Cannot unify" ....)
It will fail because it is not possible to unify functions and
integers.If unification is successful, then instead of using fail, the unify function will construct a substitution and then return it with return. Let's consider the result of unifying TInt with TInt. This unification succeeds, and produces a trivial substitition with no bindings. Or more precisely, every variable x should be "replaced" by the term TVar x. So in this case the substitution returned by unify should be the trivial one, a function which takes x and returns TVar x for all variable names x. But we already have such a function. This is just what we decided that Term's return function should do, when we were making Term into a monad. So in this case the code for unify is:
unify TInt TInt = return returnYep, in this case the unify function returns the return function. Wheee! At this point in the paper I was skimming, but when I saw return return, I boggled. I went back and read it more carefully after that, you betcha. That's my couple of paragraphs. I was planning to get to this point and then say "But that's not what I was planning to discuss. What I really wanted to talk about was...". But I think I'll break with my usual practice and leave the other thing for tomorrow. Happy Diada Nacional de Catalunya, everyone! [ Addendum 20080917: Here's the article about the other thing. ]
[Other articles in category /prog] permanent link
data Mu f = In (f (Mu f))
data Mu f = In (f (Mu f)) -- (???)
I bet a bunch of people reading this on Planet Haskell are nodding and
saying "Oh, that!"When I first saw this I couldn't figure out what it was saying at all. It was totally opaque. I still have trouble recognizing in Haskell what tokens are types, what tokens are type constructors, and what tokens are value constructors. Code like (???) is unusually confusing in this regard. Normally, one sees something like this instead:
data Maybe f = Nothing | Just f
Here f is a type variable; that is, a variable that ranges over
types. Maybe is a type constructor, which is like a function
that you can apply to a type to get another type. The most familiar
example of a type constructor is List:
data List e = Nil | Cons e (List e)
Given any type f, you can apply the type constructor
List to f to get a new type List f.
For example, you can apply List to Int to get the
type List Int. (The Haskell built-in list type constructor
goes by the funny name of [], but works the same way. The
type [Int] is a synonym for ([] Int).)Actually, type names are type constructors also; they're argumentless type constructors. So we have type constructors like Int, which take no arguments, and type constructors like List, which take one argument. Haskell also has type constructors that take more than one argument. For example, Haskell has a standard type constructor called Either for making union types:
data Either a b = Left a | Right b;
Then the type Either Int String contains values like Left
37 and Right "Cotton Mather".To keep track of how many arguments a type constructor has, one can consider the, ahem, type, of the type constructor. But to avoid the obvious looming terminological confusion, the experts use the word "kind" to refer to the type of a type constructor. The kind of List is * → *, which means that it takes a type and gives you back a type. The kind of Either is * → * → *, which means that it takes two types and gives you back a type. Well, actually, it is curried, just like regular functions are, so that Either Int is itself a type constructor of kind * → * which takes a type a and returns a type which could be either an Int or an a. The nullary type constructor Int has kind *. Continuing the "Maybe" example above, f is a type, or a constructor of kind *, if you prefer. Just is a value constructor, of type f → Maybe f. It takes a value of type f and produces a value of type Maybe f. Now here is a crucial point. In declarations of type constructors, such as these:
data Either a b = ...
data List e = ...
data Maybe f = ...
the type variables a, b, e, and f actually
range over type constructors, not over types. Haskell can infer the
kinds of the type constructors Either, List, and
Maybe, and also the kinds of the type variables, from the
definitions on the right of the = signs. In this case, it
concludes that all four variables must have kind *, and so really do
represent types, and not higher-order type constructors. So you can't
ask for Either Int List because List is known to
have kind * → *, and Haskell needs a type constructor of kind * to
serve as an argument to Either. But with a different definition, Haskell might infer that a type variable has a higher-order kind. Here is a contrived example, which might be good for something, perhaps. I'm not sure:
data TyCon f = ValCon (f Int)
This defines a type constructor TyCon with kind (* → *) → *,
which can be applied to any type constuctor f that has kind *
→ *, to yield a type. What new type? The new type TyCon
f is isomorphic to the type f of Int. For
example, TyCon List is basically the same as List
Int. The value Just 37 has type Maybe Int,
and the value ValCon (Just 37) has type TyCon
Maybe.Similarly, the value [1, 2, 3] has type [Int], which, you remember, is a synonym for [] Int. And the value ValCon [1, 2, 3] has type TyCon []. Now that the jargon is laid out, let's look at (???) again:
data Mu f = In (f (Mu f)) -- (???)
When I was first trying to get my head around this, I had trouble
seeing what the values were going to be. It looks at first like it
has no bottom. The token f here, like in the TyCon
example, is a variable that ranges over type constructors with kind *
→ *, so could be List or Maybe or [],
something that takes a type and yields a new type. Mu itself
has kind (* → *) → *, taking something like f and yielding a
type. But what's an actual value? You need to apply the value
constructor In to a value of type f (Mu
f), and it's not immediately clear where to get such a
thing.I asked on #haskell, and Cale Gibbard explained it very clearly. To do anything useful you first have to fix f. Let's take f = Maybe. In that particular case, (???) becomes:
data Mu Maybe = In (Maybe (Mu Maybe))
So the In value constructor will take a value of type
Maybe (Mu Maybe) and return a value of type
Mu Maybe. Where do we get a value of type
Maybe (Mu Maybe)? Oh, no problem: the value Nothing
is polymorphic, and has type Maybe a for all a,
so in particular it has type Maybe (Mu Maybe). Whatever
Maybe (Mu Maybe) is, it is a Maybe-type, so it has a
Nothing value. So we do have something to get started
with.Since Nothing is a Maybe (Mu Maybe) value, we can apply the In constructor to it, yielding the value In Nothing, which has type Mu Maybe. Then applying Just, of type a → Maybe a, to In Nothing, of type Mu Maybe, produces Just (In Nothing), of type Maybe (Mu Maybe) again. We can repeat the process as much as we want and produce as many values of type Mu Maybe as we want; they look like these:
In Nothing
In (Just (In Nothing))
In (Just (In (Just (In Nothing))))
In (Just (In (Just (In (Just (In Nothing))))))
...
And that's it, that's the type Mu Maybe, the set of those
values. It will look a little simpler if we omit the In
markers, which don't really add much value. We can just agree to omit
them, or we can get rid of them in the code by defining some semantic
sugar:
nothing = In Nothing
just = In . Just
Then the values of Mu Maybe look like this:
nothing
just nothing
just (just nothing)
just (just (just nothing))
...
It becomes evident that what the Mu operator does is to close
the type under repeated application. This is analogous to the way the
fixpoint combinator works on values. Consider the usual definition of
the fixpoint combinator:
Y f = f (Y f)
Here f is a function of type a → a. Y f
is a fixed point of f. That is, it is a value x of type
a such that f x = x. (Put x = Y
f in the definition to see this.)The fixed point of a function f can be computed by considering the limit of the following sequence of values:
⊥ This actually finds the least fixed point of f, for a certain definition of "least". For many functions f, like x → x + 1, this finds the uninteresting fixed point ⊥, but for many f, like x → λ n. if n = 0 then 1 else n * x(n - 1), it's something better. Mu is analogous to Y. Instead of operating on a function f from values to values, and producing a single fixed-point value, it operates on a type constructor f from types to types, and produces a fixed-point type. The resulting type T is the least fixed point of the type constructor f, the smallest set of values such that f T = T. Consider the example of f = Maybe again. We want to find a type T such that T = Maybe T. Consider the following sequence:
{ ⊥ } The first item is the set that contains nothing but the bottom value, which we might call t0. But t0 is not a fixed point of Maybe, because Maybe { ⊥ } also contains Nothing. So Maybe { ⊥ } is a different type from t0, which we can call t1 = { Nothing, ⊥ }. The type t1 is not a fixed point of Maybe either, because Maybe t1 evidently contains both Nothing and Just Nothing. Repeating this process, we find that the limit of the sequence is the type Mu Maybe = { ⊥, Nothing, Just Nothing, Just (Just Nothing), Just (Just (Just Nothing)), ... }. This type is fixed under Maybe. It might be worth pointing out that this is not the only such fixed point, but is is the least fixed point. One can easily find larger types that are fixed under Maybe. For example, postulate a special value Q which has the property that Q = Just Q. Then Mu Maybe ∪ { Q } is also a fixed point of Maybe. But it's easy to see (and to show, by induction) that any such fixed point must be a superset of Mu Maybe. Further consideration of this point might take me off to co-induction, paraconsistent logic, Peter Aczel's nonstandard set theory, and I'd never get back again. So let's leave this for now. So that's what Mu really is: a fixed-point operator for type constructors. And having realized this, one can go back and look at the definition and see that oh, that's precisely what the definition says, how obvious:
Y f = f (Y f) -- ordinary fixed-point operator
data Mu f = In (f (Mu f)) -- (???)
Given f, a function from values to values, Y(f)
calculates a value x such that x = f(x).
Given f, a function from types to types, Mu(f) calculates
a type T such that f(T) = T. That's why the
definitions are identical. (Except for that annoying In
constructor, which really oughtn't to be there.)You can use this technique to construct various recursive datatypes. For example, Mu Maybe turns out to be equivalent to the following definition of the natural numbers:
data Number = Zero | Succ Number;
Notice the structural similarity with the definition of Maybe:
data Maybe a = Nothing | Just a;
One can similarly define lists:
data Mu f = In (f (Mu f))
data ListX a b = Nil | Cons a b deriving Show
type List a = Mu (ListX a)
-- syntactic sugar
nil :: List a
nil = In Nil
cons :: a → List a → List a
cons x y = In (Cons x y)
-- for example
ls = cons 3 (cons 4 (cons 5 nil)) -- :: List Integer
lt = (cons 'p' (cons 'y' (cons 'x' nil))) -- :: List Char
Or you could similarly do trees, or whatever. Why one might want to
do this is a totally separate article, which I am not going to write
today.Here's the point of today's article: I find it amazing that Haskell's type system is powerful enough to allow one to defined a fixed-point operator for functions over types. We've come a long way since FORTRAN, that's for sure. A couple of final, tangential notes: Google search for "Mu f = In (f (Mu f))" turns up relatively few hits, but each hit is extremely interesting. If you're trying to preload your laptop with good stuff to read on a plane ride, downloading these papers might be a good move. The Peter Aczel thing seems to be less well-known that it should be. It is a version of set theory that allows coinductive definitions of sets instead of inductive definitions. In particular, it allows one to have a set S = { S }, which standard set theory forbids. If you are interested in co-induction you should take a look at this. You can find a clear explanation of it in Barwise and Etchemendy's book The Liar (which I have read) and possibly also in Aczel's book Non Well-Founded Sets (which I haven't read).
[Other articles in category /prog] permanent link Sat, 12 Jul 2008
runN revisited
Check it out. Thank you, M. Crane.
[Other articles in category /prog] permanent link Tue, 17 Jun 2008
Defunctionalization and Java
Consider, for example, the following Haskell program:
-- Haskell
aux f = f 1 + f 10
res x = aux (λz -> z + x)
The defunctionalization of this example is:
-- Haskell
data Hold = HOLD Int
fake_apply (HOLD a) b = a + b
aux held = fake_apply held 1 + fake_apply held 10
res x = aux (HOLD x)
I hope this will make the idea clear.M. Knowles cites the paper Defunctionalization at work by Olivier Danvy and Lasse R. Nielsen, which was lots of fun. (My Haskell example above is a simplification of the example from page 5 of Danvy and Nielsen.) Among other things, Danvy and Nielsen point out that this defunctionalization transformation is in a certain sense dual to the transformation that turns ordinary data structures into λ-terms in Church encoding. Church encloding turns data items like pairs or booleans into higher-order functions; defunctionalization turns them back again. Section 1.4 of the Danvy and Nielsen paper lists a whole bunch of contexts in which this technique has been studied and used, but one thing I didn't think I saw there is that this is essentially the transformation that Java programmers use when they want to use closures. For example, suppose a Java programmer wants to write something like aux in:
-- Haskell
aux f = f 1 + f 10
res x = aux (λz -> z + x)
But they can't, because Java doesn't have closures.So instead, they do this:
/* Java */
class Hold {
private int a;
public Hold(int a) {
this.a = a;
}
public int fake_apply(int b) {
return this.a + b;
}
}
private static int aux(Hold h) {
return h.fake_apply(1) + h.fake_apply(10);
}
static int res(int x) {
Hold h = new Hold(x);
return aux(h);
}
Where the class Hold corresponds directly to the
data type Hold in the defunctionalized Haskell code.Here is a real example. Consider GNU Emacs. When I enter text-mode in Emacs, I want a bunch of subsystems to be notified. Emacs has a text-mode-hook variable, which is basically a list of functions, and when an Emacs buffer is put into text-mode, Emacs invokes the hooks. Any subsystem that wants to be notified puts its own hook function into that variable. If I wanted to accomplish something similar in Haskell or SML, I would similarly use a list of functions. In Java, the corresponding facility is called java.util.Observable. Were one implementing Emacs in Java (perish the thought!) the mode object would inherit from Observable, and so would provide an addObserver method for adding a hook to a list somewhere. When the mode was switched to text-mode, the mode object would call notifyObservers, which would loop over the hook list, calling the hooks. So far this is just like Emacs Lisp. But in Java the hooks are not functions, as they are in Emacs, because in Java functions are not first-class entities. Instead, the hooks are objects which conform to the Observer interface specification, and instead of invoking functions directly, the notifyObservers method calls the update method on each hook object. Here's another example. I wrote a recursive descent parser in Java a while back. An ActionParser is just like a Parser, except that if its parse succeeds, it invokes a callback. If I were programming in SML or Haskell or Perl, an ActionParser would be nothing but a Parser with an associated closure, something like this:
# Perl
package ActionParser;
sub new {
my ($class, $parser, $action) = @_;
bless { Parser => $parser,
Action => $action } => $class;
}
# Just like the embedded parser, but invoke the action on success
sub parse {
my $self = shift;
my $input = shift;
my $result = $self->{Parser}->parse($input);
if ($result->success)
$self->{Action}->($result); # Invoke action
}
return $result;
}
Here the Action member is expected to be a closure, which is
automatically invoked if the parse succeeds. To use this, I would
write something like this:
# Perl
my $missiles;
...
my $parser = ActionParser->new($otherParser,
sub { $missiles->launch() }
);
$parser->parse($input);
And then if the input parses correctly, the parser launches the
missiles from the anonymous closure, which has captured the local
$missiles object.But in Java, you have no closures. Instead, you defunctionalize, and represent closures with objects:
/* Java */
abstract class Action {
void invoke(ParseResults results) {}
}
class ActionParser extends Parser {
Action action;
Parser parser;
ActionParser(Parser p, Action a) {
action = a;
parser = p;
}
ParseResults Parse(Input input) {
ParseResults res = this.parser.Parse(input);
if (res.isSuccess) {
this.action.invoke(res);
}
return res;
}
}
To use this, one writes something like this:
/* Java */
class LaunchMissilesAction extends Action {
Missiles m;
LaunchMissilesAction(Missiles m) { this.m = m; }
void invoke(ParseResults results) {
m.launch();
}
}
...
Action a = new LaunchMissilesAction(missiles);
Parser p = new ActionParser(otherParser, a);
p.parse(input);
The constructor argument missiles takes the place of a free
variable in a closure. The closure itself has been replaced with
an object from an ad hoc class, just as in Danvy and Nielsen's
formulation, the closure is replaced with a synthetic data object that
holds the values of the free variables. The invoke method
plays the role of fake_apply.Now, it's not a particularly interesting observation that this can be done. The interesting part, I think, is that this is what Java programmers actually do. And also, perhaps, that Danvy and Nielsen didn't mention it in their paper, because I think the technique is pretty widespread.
[Other articles in category /prog] permanent link Fri, 30 May 2008
Glade
Glade lets you design a window interface, by positioning buttons and sliders and things, and then does something or other. At the time I didn't know what it would do, but I knew I could mock up the window I wanted, and I thought maybe I could screenshot the mockup for the blog article. The Glade thing was so easy to use that the easiest way to get a mockup of the dialog was to have Glade generate a complete, working windowing application, compile and run the application, and then screenshot the application. I got this done in about fifteen minutes. The application I made doesn't actually do anything, but it does compile, run, and pop up the dialog box I designed. I'm confident that I could get it to do something pretty easily, if I wanted. The auto-generated code, and some of the Glade controls, are very suggestive. I give Glade a big gold star. I went from having never heard of it to a working (although trivial) window application in one two-minute session and one fifteen-minute session. Maybe two big gold stars and a "Good work!" sticker. [ Addendum 20080530: I went ahead with making an application that actually does something. It worked. ]
[Other articles in category /prog] permanent link
More Glade
The outcome: big success. The application has a window with two input fields, a "+" button, and an output field that shows the sum of the input fields when you press the "+" button. It took about half an hour from start to finish, and the only thing I had to look up in the manual was the names of the functions that read and write the values of the text fields. Everything else I got through bricolage and tinkering with the autogenerated monkey code. The biggest problem that I encountered was that the application didn't exit when I clicked the close box, although the window disappeared. I figured out that the close box was sending a "delete" event and not a "destroy" event and fixed it up right quick. Gtk+ and Glade Interface Designer get at least two gold stars. Maybe three. Maybe fifty-three.
[Other articles in category /prog] permanent link Fri, 28 Mar 2008
Suffering from "make install"
After building DBI, for example, how do you get it to install itself into X/lib instead of the default system-wide location, which only the super-user has permission to modify? There are at least five solutions to this common problem. Uh-oh. If solution #1 had worked, people would not have needed to invent solution #2. If solution #2 had worked, people would not have needed to invent solution #3. Since there are five solutions, there is a good chance that none of them work. You can, I am informed:
Some of these items fail because they just plain fail. For example, the first thing everyone says is that you can just set PREFIX to X. No, because then the module Foo does not go into X/lib/Foo.pm. It goes into X/Foo/lib/perl5/site_perl/5.12.23/Foo.pm. Which means that if X does use lib 'X/lib'; it will not be able to find Foo.
It turns out there is a simple solution that doesn't involve travelling to New Jersey. The first thing you have to do is give up entirely on trying to use make install to install the modules. It is completely broken for this application, because even if the destination could somehow be forced to be what you wanted—and, after all, why would you expect that make install would let you configure the destination directory in a simple fashion?—it would still install not only the contents of MODULE/lib, but also the contents of MODULE/bin, MODULE/man, MODULE/share, MODULE/pus, MODULE/dork, MODULE/felch, and MODULE/scrotum, some of which you probably didn't want. So no. But the solution is actually simple. The normal module build process (as distinct from the install process) puts all this crap under MODULE/blib. The test suite is run against the blib installation. So the test programs have the same problem that X has. If they can find the stuff under blib, so can X, by replicating the layout under blib and then doing what the test suite does. In fact, the modules are installed into the proper subdirectories of MODULE/blib/lib. So the simple solution is just to build the module and then, instead of trying to get the installer to put the right stuff in the right place, use cp -pr MODULE/blib/lib/* X/lib. Problem solved. For modules with a shared library, you need to copy MODULE/blib/arch/auto/* into X/lib/auto also. I remember suffering over this at least ten years ago, when a student in a class I was teaching asked me how to do it and I let ExtUtils::MakeMaker make a monkey of me. I was amazed to find myself suffering over it once again. I am relieved to have found the right answer. This is one of those days when I am not happy with software. It sometimes surprises me how many of those days involve make. Dennis Ritchie once said that "make is like Pascal. Everybody likes it, so they go in and change it." I never really thought about this before, but it now occurs to me that probably Ritchie meant that they like make in about the same way that they like bladder stones. Because Dennis Ritchie probably does not like Pascal, and actually nobody else likes Pascal either. They may say they do, and they may even think they do, but if you look a little closer it always turns out that the thing they like is not actually Pascal, but some language that more or less resembles Pascal. Unfortunately, the changes people make to make tend to make it bigger and wartier, and this improves make about as much as it would improve a bladder stone. I would like to end this article on a positive note. If you haven't already, please read Recursive make Considered Harmful and be prepared to be blinded by the Glorious Truth therein.
[Other articles in category /prog] permanent link Fri, 21 Mar 2008
Closed file descriptors: the answer
my $command = shift;
for my $file (@ARGV) {
if ($file =~ /\.gz$/) {
my $fh;
unless (open $fh, "<", $file) {
warn "Couldn't open $file: $!; skipping\n";
next;
}
my $fd = fileno $fh;
$file = "/proc/self/fd/$fd";
}
}
exec $command, @ARGV;
die "Couldn't run command '$command': $!\n";
When the loop exits, $fh is out of scope, and the
filehandle it contains is garbage-collected, closing the file."Duh." Several people suggested that it was because open files are not preserved across an exec, or because the meaning of /proc/self would change after an exec, perhaps because the command was being run in a separate process; this is mistaken. There is only one process here. The exec call does not create a new process; it reuses the same one, and it does not affect open files, unless they have been flagged with FD_CLOEXEC. Abhijit Menon-Sen ran a slightly different test than I did:
% z cat foo.gz bar.gz
cat: /proc/self/fd/3: No such file or directory
cat: /proc/self/fd/3: No such file or directory
As he said, this makes it completely obvious what is wrong, since
the two files are both represented by the same file descriptor.
[Other articles in category /prog/perl] permanent link Thu, 20 Mar 2008
Closed file descriptors
my $command = shift;
for my $file (@ARGV) {
if ($file =~ /\.gz$/) {
my $fh;
unless (open $fh, "<", $file) {
warn "Couldn't open $file: $!; skipping\n";
next;
}
my $fd = fileno $fh;
$file = "/proc/self/fd/$fd";
}
}
exec $command, @ARGV;
die "Couldn't run command '$command': $!\n";
The idea here is that this program, called z, will preprocess
the arguments of some command, and then run the command
with the modified arguments. For some of the command-line arguments,
here the ones named *.gz, the original file will be replaced
by the output of some file descriptor. In the example above, the
descriptor is attached to the original file, which is pointless. But
once this part of the program was working, I planned to change the code
so that the descriptor would be
attached to a pipe instead.Having written something like this, I then ran a test, which failed:
% z cat foo.gz cat: /proc/self/fd/3: No such file or directory"Aha," I said instantly. "I know what is wrong. Perl set the close-on-exec flag on file descriptor 3." You see, after a successful exec, the kernel will automatically close all file descriptors that have the close-on-exec flag set, before the exec'ed image starts running. Perl normally sets the close-on-exec flag on all open files except for standard input, standard output, and standard error. Actually it sets it on all open files whose file descriptor is greater than the value of $^F, but $^F defaults to 2. So there is an easy fix for the problem: I just set $^F = 100000 at the top of the program. That is not the best solution, but it can be replaced with a better one once the program is working properly. Which I expected it would be:
% z cat foo.gz cat: /proc/self/fd/3: No such file or directoryHuh, something is still wrong. Maybe I misspelled /proc/self/fd? No, it is there, and contains the special files that I expected to find. Maybe $^F did not work the way I thought it did? I checked the manual, but it looked okay. Nevertheless I put in use Fcntl and used the fcntl function to remove the close-on-exec flags explicitly. The code to do that looks something like this:
use Fcntl;
....
my $flags = fcntl($fh, F_GETFD, 0);
fcntl($fh, F_SETFD, $flags & ~FD_CLOEXEC);
And try it again:
% z cat foo.gz cat: /proc/self/fd/3: No such file or directoryHuh. I then wasted a lot of time trying to figure out an easy way to tell if the file descriptor was actually open after the exec call. (The answer turns out to be something like this: perl -MPOSIX=fstat -le 'print "file descriptor 3 is ", fstat(3) ? "open" : "closed"'.) This told me whether the error from cat meant what I thought it meant. It did: descriptor 3 was indeed closed after the exec. Now your job is to figure out what is wrong. It took me a shockingly long time. No need to email me about it; I have it working now. I expect that you will figure it out faster than I did, but I will also post the answer on the blog tomorrow. Sometime on Friday, 21 March 2008, this link will start working and will point to the answer. [ Addendum 20080321: I posted the answer. ]
[Other articles in category /prog/perl] permanent link Fri, 14 Mar 2008
Drawing lines
As part of my job I had to produce the following display:
If you wanted to hear more about phylogeny, Java programming, or tree algorithms, you are about to be disappointed. The subject of my article today is those fat black lines. The first draft of the page did not have the fat black lines. It had some incredibly awful ASCII-art that was not even properly aligned. Really it was terrible; it would have been better to have left it out completely. I will not make you look at it. I needed the lines, so I popped down the "graphics" menu on my computer and looked for something suitable. I tried the Gimp first. It seems that the Gimp has no tool for drawing straight lines. If someone wants to claim that it does, I will not dispute the claim. The Gimp has a huge and complex control panel covered with all sorts of gizmos, and maybe one of those gizmos draws a straight line. I did not find one. I gave up after a few minutes. Next I tried Dia. It kept selecting the "move the line around on the page" tool when I thought I had selected the "draw another line" tool. The lines were not constrained to a grid by default, and there was no obvious way to tell it that I wanted to draw a diagram smaller than a whole page. I would have had to turn the thing into a bitmap and then crop the bitmap. "By Zeus's Beard," I cried, "does this have to be so difficult?" Except that the oath I actually uttered was somewhat coarser and less erudite than I have indicated. I won't repeat it, but it started with "fuck" and ended with "this". Here's what I did instead. I wrote a program that would read an input like this:
>-v-<
'-+-`
and produce a jpeg file that looks like this:
.---,
| >--,
'---` '-
Becomes this:
Now I know some of you are just itching to write to me and ask "why didn't you just use...?", so before you do that, let me remind you of two things. First, I had already wasted ten or fifteen minutes on "just use..." that didn't work. And second, this program only took twenty minutes to write. The program depends on one key insight, which is that it is very, very easy to write a Perl program that generates a graphic output in "PBM" ("portable bitmap") format. Here is a typical PBM file:
P1
10 10
1111111111
1000000001
1000000001
1001111001
1001111001
1001111001
1001111001
1000000001
1000000001
1111111111
The P1 is a magic number that identifies the file format; it
is always the same. The 10 10 warns the processor that the
upcoming bitmap is 10 pixels wide and 10 pixels high. The following
characters are the bitmap data.
I'm not going to insult you by showing the 10×10 bitmap image
that this represents.PBM was invented about twenty years ago by Jef Poskanzer. It was intended to be an interchange format: say you want to convert images from format X to format Y, but you don't have a converter. You might, however, have a converter that turns X into PBM and then one that turns PBM into Y. Or if not, it might not be too hard to produce such converters. It is, in the words of the Extreme Programming guys, the Simplest Thing that Could Possibly Work. There are also PGM (portable graymap) and PPM (portable pixmap) formats for grayscale and 24-bit color images as well. They are only fractionally more complicated. Because these formats are so very, very simple, they have been widely adopted. For example, the JPEG reference implementation includes a sample cjpeg program, for converting an input to a JPEG file. The input it expects is a PGM or PPM file. Writing a Perl program to generate a P?M file, and then feeding the output to pbmtoxbm or ppmtogif or cjpeg is a good trick, and I have used it many times. For example, I used this technique to generate a zillion little colored squares in this article about the Pólya-Burnside counting lemma. Sure, I could have drawn them one at a time by hand, and probably gone insane and run amuck with an axe immediately after, but the PPM technique was certainly much easier. It always wins big, and this time was no exception. The program may be interesting as an example of this technique, and possibly also as a reminder of something else. The Perl community luminaries invest a lot of effort in demonstrating that not every Perl program looks like a garbage heap, that Perl can be as bland and aseptic as Java, that Perl is not necessarily the language that most closely resembles quick-drying shit in a tube, from which you can squirt out the contents into any shape you want and get your complete, finished artifact in only twenty minutes and only slightly smelly. No, sorry, folks. Not everything we do is a brilliant, diamond-like jewel, polished to a luminous gloss with pages torn from one of Donald Knuth's books. This line-drawing program was squirted out of a tube, and a fine brown piece of engineering it is.
#!/usr/bin/perl
my ($S) = shift || 50;
$S here is "size". The default is to turn every character in
the input into a 50×50 pixel tile. Here's the previous example
with $S=10: ![]()
my ($h, $w);
my $output = [];
while (<>) {
chomp;
$w ||= length();
$h++;
push @$output, convert($_);
}
The biggest defect in the program is right here: it assumes that each
line will have the same width $w. Lines all must be
space-padded to the same width. Fixing this is left as an easy
exercise, but it wasn't as easy as padding the inputs, so I didn't do it.The magic happens here:
open STDOUT, "| pnmscale 1 | cjpeg" or die $!;
print "P1\n", $w * $S, " ", $h * $S, "\n";
print $_, "\n" for @$output;
exit;
The output is run through cjpeg to convert the PBM data to
JPEG. For some reason cjpeg doesn't accept PBM data, only
PGM or PPM, however, so the output first goes through
pnmscale, which resizes a P?M input. Here the scale factor
is 1, which is a no-op, except that pnmscale happens to turn
a PBM input into a PGM output. This is what is known in the business
as a "trick". (There is a pbmtopgm program, but it does
something different.)If we wanted gif output, we could have used "| ppmtogif" instead. If we wanted output in Symbolics Lisp Machine format, we could have used "| pgmtolispm" instead. Ah, the glories of interchange formats. I'm going to omit the details of convert, which just breaks each line into characters, calls convert_ch on each character, and assembles the results. (The complete source code is here if you want to see it anyway.) The business end of the program is convert_ch:
#
sub convert_ch {
my @rows;
my $ch = shift;
my $up = $ch =~ /[<|>^'`+]/i;
my $dn = $ch =~ /[<|>V.,+]/i;
my $lt = $ch =~ /[-<V^,`+]/i;
my $rt = $ch =~ /[->V^.'+]/i;
These last four variables record whether the tile has a line from its
center going up, down, left, or right respectively. For example,
"|" produces a tile with lines coming up and down from the
center, but not left or right. The /i in the regexes is
because I kept writing v instead of V in the
inputs.
my $top = int($S * 0.4);
my $mid = int($S * 0.2);
my $bot = int($S * 0.4);
The tile is divided into three bands, of the indicated widths. This
probably looks bad, or fails utterly, unless $S is a multiple
of 5. I haven't tried it. Do you think I care? Hint: I haven't tried
it.
my $v0 = "0" x $S;
my $v1 = "0" x $top . "1" x $mid . "0" x $bot;
push @rows, ($up ? $v1 : $v0) x $top;
This assembles the top portion of the tile, including the "up" line,
if there is one. Note that despite their names, $top also
determines the width of the left portion of the tile, and
$bot determines the width of the right portion. The letter
"v" here is for "vertical".Perhaps I should explain for the benefit of the readers of Planet Haskell (if any of them have read this far and not yet fainted with disgust) that "$a x $b" in Perl is like concat (replicate b a) in the better sorts of languages.
my $ls = $lt ? "1" : "0";
my $ms = ($lt || $rt || $up || $dn) ? "1" : "0";
my $rs = $rt ? "1" : "0";
push @rows, ($ls x $top . $ms x $mid . $rs x $bot) x $mid;
This assembles the middle section, including the "left" and "right"
lines.
push @rows, ($dn ? $v1 : $v0) x $bot;
This does the bottom section.
return @rows;
}
And we are done.
Nothing to it. Adding diagonal lines would be a fairly simple matter.Download the complete source code if you haven't seen enough yet. There is no part of this program of which I am proud. Rather, I am proud of the thing as a whole. It did the job I needed, and it did it by 5 PM. Larry Wall once said that "a Perl script is correct if it's halfway readable and gets the job done before your boss fires you." Thank you, Larry. No, that is not quite true. There is one line in this program that I'm proud of. I noticed after I finished that there is exactly one comment in this program, and it is blank. I don't know how that got in there, but I decided to leave it in. Who says program code can't be funny?
[Other articles in category /prog/perl] permanent link Thu, 24 Jan 2008
Emacs and alists
Yesterday I upgraded Emacs, and since it was an upgrade, something that had been working for me for fifteen years stopped working, because that's what "upgrade" means. My .emacs file contains:
(aput 'auto-mode-alist "\\.pl\\'" (function cperl-mode))
(aput 'auto-mode-alist "\\.t\\'" (function cperl-mode))
(aput 'auto-mode-alist "\\.cgi\\'" (function cperl-mode))
(aput 'auto-mode-alist "\\.pm\\'" (function cperl-mode))
(aput 'auto-mode-alist "\\.blog\\'" (function text-mode))
(aput 'auto-mode-alist "\\.sml\\'" (function sml-mode))
I should explain this, since I imagine that most readers of this blog
are like me in that they touch Emacs Lisp only once a year on Saint
Vibrissa's Day. An alist ("association list") is a common data
structure in Lisp programs. It is a list of pairs; the first element
of each pair is a key, and the second element is an associated value.
The pairs in the special auto-mode-alist variable have regexes
as their keys and functions as their values. Whenever Emacs opens a
new file, it scans this alist, until it finds a regex that matches the
name of the file. It then executes the associated function. Thus the
effect of the first line above is to have Emacs enable the
cperl-mode function on any file whose name ends in ".pl".
The aput function is for maintaining alists. It takes an alist, a key, and a value, scans the alist looking for a matching key, and then if it finds it, it amends the corresponding value. Otherwise, it appends a new association onto the front of the alist. When I upgraded emacs, this broke. The aput function was moved into a separate package, which I now had to load with (require 'assoc). I asked about this on IRC, and was told that the correct way to do this, if I did not want to (require 'assoc), was to use the following abomination:
(mapc (lambda (x) (when (eq 'perl-mode (cdr x)) (setcdr x 'cperl-mode)))
(append auto-mode-alist interpreter-mode-alist))
The effect of this is to scan over auto-mode-alist (and also
interpreter-mode-alist, a related variable) looking for any
association whose value was the perl-mode function, and
using setcdr to replace perl-mode with
cperl-mode.(This does not address the issue of what to do with .t files or .blog files, for which no association exists yet, presumably, but I did not ask about those specifically on IRC.) I was totally boggled. Choosing the right editing mode for a file is a basic function of emacs. I could not believe that the best and simplest way to add or change associations was to use mapc lambda gobhorn oleo potatopudding quote potrzebie. I was assured that this was indeed the only correct method. Struck almost speechless, I managed to come up with "Bullshit." Apparently the issue was that if auto-mode-alist already contains an association for ".pl", there is no guarantee that my new association will be found and preferred to the old one, unless I somehow remove the old one, or edit it to be the way I want. This seemed very unlikely to me. You see, an alist is a list. This means that it is searched from head to tail, because this is the only way a list can be searched. So in particular, if you cons a second association to the front of the list, which has the same key as a later (older) association, the search will find the new one first, and the older one becomes inoperative. I asked if there was not a guarantee that the alist would be searched from front to back. I was told that there is not. I looked in the manual, and reported that the assoc function, which is the getter that corresponds to aput, taking an alist and a key, and returning the corresponding value, is expressly guaranteed to return the first matching item. I was told that there was no guarantee that assoc would be used. I pondered the manual some more and found this passage:
However, association lists have their own advantages. Depending on your application, it may be faster to add an association to the front of an association list than to update a property.That is, it is expressly endorsing the technique of adding a new item to the front of an alist in order to override any later item that might have the same key. After finding that the add-to-the-front technique really did work, I reasoned that if someday Emacs stopped searching alists sequentially, I would not be in any more trouble than I had been today when they removed the aput function. So I did not take the advice I was given. Instead, I left it pretty much the way it was. I did take the opportunity to clean up the code a bit:
(push '("\\.pl\\'" . cperl-mode) auto-mode-alist)
(push '("\\.t\\'" . cperl-mode) auto-mode-alist)
(push '("\\.cgi\\'" . cperl-mode) auto-mode-alist)
(push '("\\.pm\\'" . cperl-mode) auto-mode-alist)
(push '("\\.blog\\'" . text-mode) auto-mode-alist)
(push '("\\.sml\\'" . sml-mode) auto-mode-alist)
The push function simply appends an element to the front of a
list, modifying the list in-place.But wow, the advice I got was phenomenally bad. It was bad in a really interesting way, too. It reminded me of the advice people get on the #math channel, where some guy comes in with some question about triangles and gets the category-theoretic viewpoint on triangles as natural transformations of something or other. The advice was bad because although it was correct, it was completely devoid of common sense. [ Addendum 20080124: It has been brought to my attention that the Emacs FAQ endorses my solution, which makes the category-theoretic advice proposed by the #emacs blockheads even less defensible. ] [ Addendum 20080201: Steve Vinoski suggests replacing the aput function. ]
[Other articles in category /prog] permanent link Fri, 11 Jan 2008
Help, help!
Przemek Klosowski wrote to offer me physics help, and also to ask about introspection on Perl objects. Specifically, he said that if you called a nonexistent method on a TCL object, the error message would include the names of all the methods that would have worked. He wanted to know if there was a way to get Perl to do something similar. There isn't, precisely, because Perl has only a conventional distinction between methods and subroutines, and you Just Have To Know which is which, and avoid calling the subroutines as methods, because the Perl interpreter has no idea which is which. But it does have enough introspection features that you can get something like what you want. This article will explain how to do that. Here is a trivial program that invokes an undefined method on an object:
use YAML;
my $obj = YAML->new;
$obj->nosuchmethod;
When run, this produces the fatal error:
Can't locate object method "nosuchmethod" via package "YAML" at test.pl line 4.
(YAML in this article is just an example; you don't have to
know what it does. In fact, I don't know what it does.)Now consider the following program instead:
use YAML;
use Help 'YAML';
my $obj = YAML->new;
$obj->nosuchmethod;
Now any failed method calls to YAML objects, or objects of
YAML's subclasses, will produce a more detailed error
message:
Unknown method 'nosuchmethod' called on object of class YAML
Perhaps try:
Bless
Blessed
Dump
DumpFile
Load
LoadFile
VALUE
XXX
as_heavy (inherited from Exporter)
die (inherited from YAML::Base)
dumper_class
dumper_object
export (inherited from Exporter)
export_fail (inherited from Exporter)
export_ok_tags (inherited from Exporter)
export_tags (inherited from Exporter)
export_to_level (inherited from Exporter)
field
freeze
global_object
import (inherited from Exporter)
init_action_object
loader_class
loader_object
new (inherited from YAML::Base)
node_info (inherited from YAML::Base)
require_version (inherited from Exporter)
thaw
warn (inherited from YAML::Base)
ynode
Aborting at test.pl line 5
Some of the methods in this list are bogus. For example, the stuff
inherited from Exporter should almost certainly not be
called on a YAML object.Some of the items may be intended to be called as functions, and not as methods. Some may be functions imported from some other module. A common offender here is Carp, which places a carp function into another module's namespace; this function will show up in a list like the one above, without even an "inherited from" note, even though it is not a method and it does not make sense to call it on an object at all. Even when the items in the list really are methods, they may be undocumented, internal-use-only methods, and may disappear in future versions of the YAML module. But even with all these warnings, Help is at least a partial solution to the problem. The real reason for this article is to present the code for Help.pm, not because the module is so intrinsically useful itself, but because it is almost a catalog of weird-but-useful Perl module hackery techniques. A full and detailed tour of this module's 30 lines of code would probably make a decent 60- or 90-minute class for intermediate Perl programmers who want to become wizards. (I have given many classes on exactly that topic.) Here's the code:
package Help;
use Carp 'croak';
sub import {
my ($selfclass, @classes) = @_;
for my $class (@classes) {
push @{"$class\::ISA"}, $selfclass;
}
}
sub AUTOLOAD {
my ($bottom_class, $method) = $AUTOLOAD =~ /(.*)::(.*)/;
my %known_method;
my @classes = ($bottom_class);
while (@classes) {
my $class = shift @classes;
next if $class eq __PACKAGE__;
unshift @classes, @{"$class\::ISA"};
for my $name (keys %{"$class\::"}) {
next unless defined &{"$class\::$name"};
$known_method{$name} ||= $class;
}
}
warn "Unknown method '$method' called on object of class $bottom_class\n";
warn "Perhaps try:\n";
for my $name (sort keys %known_method) {
warn " $name " .
($known_method{$name} eq $bottom_class
? ""
: "(inherited from $known_method{$name})") .
"\n";
}
croak "Aborting";
}
sub help {
$AUTOLOAD = ref($_[0]) . '::(none)';
goto &AUTOLOAD;
}
sub DESTROY {}
1;
use Help 'Foo'When any part of the program invokes use Help 'Foo', this does two things. First, it locates Help.pm, loads it in, and compiles it, if that has not been done already. And then it immediately calls Help->import('Foo').Typically, a module's import method is inherited from Exporter, which gets control at this point and arranges to make some of the module's functions available in the caller's namespace. So, for example, when you invoke use YAML 'freeze' in your module, Exporter's import method gets control and puts YAML's "freeze" function into your module's namespace. But that is not what we are doing here. Instead, Help has its own import method:
sub import {
my ($selfclass, @classes) = @_;
for my $class (@classes) {
push @{"$class\::ISA"}, $selfclass;
}
}
The $selfclass variable becomes Help and @classes
becomes ('Foo'). Then the module does its first tricky
thing. It puts itself into the @ISA list of another class.
The push line adds Help to @Foo::ISA.@Foo::ISA is the array that is searched whenever a method call on a Foo objects fails because the method doesn't exist. Perl will search the classes named in @Foo::ISA, in order. It will search the Help class last. That's important, because we don't want Help to interfere with Foo's ordinary inheritance. Notice the way the variable name Foo::ISA is generated dynamically by concatenating the value of $class with the literal string ::ISA. This is how you access a variable whose name is not known at compile time in Perl. We will see this technique over and over again in this module. The backslash in @{"$class\::ISA"} is necessary, because if we wrote @{"$class::ISA"} instead, Perl would try to interpolate the value of $ISA variable from the package named class. We could get around this by writing something like @{$class . '::ISA'}, but the backslash is easier to read.
AUTOLOADSo what happens when the program calls $foo->nosuchmethod? If one of Foo's base classes includes a method with that name, it will be called as usual.But when method search fails, Perl doesn't give up right away. Instead, it tries the method search a second time, this time looking for a method named AUTOLOAD. If it finds one, it calls it. It only throws an exception of there is no AUTOLOAD. The Help class doesn't have a nosuchmethod method either, but it does have AUTOLOAD. If Foo or one of its other parent classes defines an AUTOLOAD, one of those will be called instead. But if there's no other AUTOLOAD, then Help's AUTOLOAD will be called as a last resort.
$AUTOLOADWhen Perl calls an AUTOLOAD function, it sets the value of $AUTOLOAD to include the full name of the method it was trying to call, the one that didn't exist. In our example, $AUTOLOAD is set to "Foo::nosuchmethod".This pattern match dismantles the contents of $AUTOLOAD into a class name and a method name:
sub AUTOLOAD {
my ($bottom_class, $method) = $AUTOLOAD =~ /(.*)::(.*)/;
The $bottom_class variable contains Foo, and the
$method variable contains nosuchmethod.The AUTOLOAD function is now going to accumulate a table of all the methods that could have been called on the target object, print out a report, and throw a fatal exception. The accumulated table will reside in the private hash %known_method. Keys in this hash will be method names. Values will be the classes in which the names were found.
Accumulating the table of method namesThe AUTOLOAD function accumulates this hash by doing a depth-first search on the @ISA tree, just like Perl's method resolution does internally. The @classes variable is a stack of classes that need to be searched for methods but that have not yet been searched. Initially, it includes only the class on which the method was actually called, Foo in this case:
my @classes = ($bottom_class);
As long as some class remains unsearched, this loop will continue to
look for more methods. It begins by grabbing the next class off the
stack:
while (@classes) {
my $class = shift @classes;
Foo inherits from Help too, but we don't want our
error message to mention that, so the search skips Help:
next if $class eq __PACKAGE__;
(__PACKAGE__ expands at compile time to the name of the
current package.)Before the loop actually looks at the methods in the current class it's searching, it looks to see if the class has any base classes. If there are any, it pushes them onto the stack to be searched next:
unshift @classes, @{"$class\::ISA"};
Now the real meat of the loop: there is a class name in
$class, say Foo,
and we want the program to find all the methods in that class. Perl
makes the symbol table for the Foo package available in the
hash %Foo::. Keys in this hash are variable, subroutine, and
filehandle names.To find out if a name denotes a subroutine, we use defined(&{subroutine_name}) for each name in the package symbol table. If there is a subroutine by that name, the program inserts it and the class name into %known_method. Otherwise, the name is a variable or filehandle name and is ignored:
for my $name (keys %{"$class\::"}) {
next unless defined &{"$class\::$name"};
$known_method{$name} ||= $class;
}
}
The ||= sets a new value for $name in the hash only
if there was not one already. If a method name appears in more than
one class, it is recorded as being in the first one found in the
search. Since the search is proceeding in the same order that Perl
uses, the one recorded is the one that Perl will actually find. For
example, if Foo inherits from Bar, and both classes
define a this method, the search will find Foo::this
before Bar::this, and that is what will be recorded in the
hash. This is correct, because Foo's this method
overrides Bar's.If you have any clever techniques for identifying other stuff that should be omitted from the output, this is where you would put them. For example, many authors use the convention that functions whose names have a leading underscore are private to the implementation, and should not be called by outsiders. We might omit such items from the output by adding a line here:
next if $name =~ /^_/;
After the loop finishes searching all the base classes, the
%known_method hash looks something like this:
(
this => Foo,
that => Foo,
new => Base,
blookus => Mixin::Blookus,
other => Foo
)
This means that methods this, that, and
other were defined in Foo itself, but that
new is inherited from Base and that blookus
was inherited from Mixin::Blookus.
Printing the reportThe AUTOLOAD function then prints out some error messages:
warn "Unknown method '$method' called on object of class $bottom_class\n";
warn "Perhaps try:\n";
And at last the payoff: It prints out the list of methods that the
programmer could have called:
for my $name (sort keys %known_method) {
warn " $name " .
($known_method{$name} eq $bottom_class
? ""
: "(inherited from $known_method{$name})") .
"\n";
}
croak "Aborting";
}
Each method name is printed. If the class in which the method was
found is not the bottom class, the name is annotated with the message
(inherited from wherever).The output for my example would look like this:
Unknown method 'nosuchmethod' called on object of class Foo:
Perhaps try:
blookus (inherited from Mixin::Blookus)
new (inherited from Base)
other
that
this
Aborting at YourErroneousModule.pm line 679
Finally the function throws a fatal exception. If we had used
die here, the fatal error message would look like
Aborting at Help.pm line 34, which is extremely unhelpful.
Using croak instead of die makes the message look
like Aborting at test.pl line 5 instead. That is, it reports
the error as coming from the place where the erroneous method was
actually called.
Synthetic calls
Suppose you want to force the help message to come out. One way is to
call $object->fgsfds, since probably the object does not
provide a fgsfds method. But this is ugly, and it might not
work, because the object might provide a fgsfds
method. So Help.pm provides another way.You can always force the help message by calling $object->Help::help. This calls a method named help, and it starts the inheritance search in the Help package. Control is transferred to the following help method:
sub help {
$AUTOLOAD = ref($_[0]) . '::(none)';
goto &AUTOLOAD;
}
The Help::help method sets up a fake $AUTOLOAD
value and then uses "magic goto" to transfer control to the real
AUTOLOAD function. "Magic goto" is not the evil bad goto
that is Considered Harmful. It is more like a function call. But
unlike a regular function call, it erases the calling function
(help) from the control stack, so that to subsequently
executed code it appears that AUTOLOAD was called directly in
the first place.Calling AUTOLOAD in the normal way, without goto, would have worked also. I did it this way just to be a fusspot.
DESTROYWhenever a Perl object is destroyed, its DESTROY method is called, if it has one. If not, method search looks for an AUTOLOAD method, if there is one, as usual. If this lookup fails, no fatal exception is thrown; the object is sliently destroyed and execution continues.It is very common for objects to lack a DESTROY method; usually nothing additional needs to be done when the object's lifetime is over. But we do not want the Help::AUTOLOAD function to be invoked automatically whenever such an object is destroyed! So Help defines a last-resort DESTROY method that is called instead; this prevents Perl from trying the AUTOLOAD search when an object with no DESTROY method is destroyed:
sub DESTROY {}
This DESTROY method restores the default
behavior, which is to do nothing.
Living dangerouslyPerl has a special package, called UNIVERSAL. Every class inherits from UNIVERSAL. If you want to apply Help to every class at once, you can try:
use Help 'UNIVERSAL';
but don't blame me if something weird happens.
About use strictWhenever I present code like this, I always get questions (or are they complaints?) from readers about why I omitted "use strict". "Always use strict!" they say.Well, this code will not run with "use strict". It does a lot of stuff on purpose that "strict" was put in specifically to keep you from doing by accident. At some point you have to take off the training wheels, kiddies.
LicenseCode in this article is hereby placed in the public domain.Share and enjoy.
[Other articles in category /prog/perl] permanent link Tue, 08 Jan 2008
Clubbing someone to death with a loaded Uzi
foreach $k (keys %in) {
if ($k eq q1) {
if ($in{$k} eq agree) {
$count{q10} = $count{q10} + 1;
}
if ($in{$k} eq disaagree) {
$count{q11} = $count{q11} + 1;
}
}
if ($k eq q2) {
@q2split = split(/\0/, $in{$k});
foreach (@q2split) {
$count{$_} = $count{$_} + 1;
}
}
if ($k eq q3) {
$count{$in{$k}} = $count{$in{$k}} + 1;
}
...
}
There is a lot wrong with this code, but it's all trivial compared
with the one big problem, which is the wholly unnecessary loop and
tests. The whole thing could be (and should be, and was) rewritten
as:
if ($in{q1} eq agree) {
$count{q10} = $count{q10} + 1;
}
if ($in{q1} eq disaagree) {
$count{q11} = $count{q11} + 1;
}
@q2split = split(/\0/, $in{q2});
foreach (@q2split) {
$count{$_} = $count{$_} + 1;
}
$count{$in{q3}} = $count{$in{q3}} + 1;
...
After which one could start addressing the smaller problems, like the
fact that "disagree" is misspelled.This is the sort of mistake you expect from an intern. I chuckled and corrected him. But I've seen it several times since from non-interns. Here's another example. I am not making this up. Whether it's more or less odious than the intern code is up to you to decide:
foreach $location_name (%LOCATION ) {
$location_code = $LOCATION{$location_name};
if ($location_name eq $location ) {
printf FILE "$location_code\,";
printf FILE "%4s", "$min3\,";
printf FILE "%4s", "$max3\,";
printf FILE "%1s", "$wx3\n";
}
}
It could have been written like this:
printf FILE "$LOCATION{$location}\,";
printf FILE "%4s", "$min3\,";
printf FILE "%4s", "$max3\,";
printf FILE "%1s", "$wx3\n";
I started using this problem as an interview question. I'll present
the subject with trivial code like this:
for my $k (keys %hash) {
if ($k eq "name") {
$hash{$k}++;
}
}
and then ask if they have any comments about it. One nice thing about
the question is that it translates naturally into whatever imperative
language they claim expertise in.It's appalling how many supposedly professional programmers see nothing wrong here. They squint at the code, and say "I think you need parentheses around %hash there", or they criticize the choice of variable names. I first used this as an interview question because the Python code sample submitted by a job applicant contained an example of it. "Weird," I thought, "but maybe she's outgrown that." Since she claimed to be an expert Perl user, I asked her about it in Perl, using code like the example above. After she made a syntactic suggestion, I said "It's not a syntax problem, and it's not a trick question." She criticized the syntax some more. Finally I told her the answer: "Couldn't you just use $hash{name}++?" "Oh, yeah, I guess so," she said. A few minutes later we were going over her Python code sample and I pointed out the place where she had done the exact same thing, and asked if she was happy with that loop and wanted to change it. No, she thought it was just fine. "Doesn't this look like the example I showed you on the whiteboard a little while ago?" "Oh, I guess it does." We didn't hire her. Larry Wall once said that iterating over the keys of a hash is like clubbing someone to death with a loaded Uzi. I had already realized that you could, in principle, commit this error with a regular array instead of with a hash, but I had never seen an example until today's episode of the Daily WTF. The Daily WTF code is so awful, all the way through, that I was afraid that people might miss this slightly-more subtle gem lurking in the middle, and that was what motivated me to write this article in the first place. Here's the gem:
// Java
for (int a=1;a<=params.size();a++) switch (a)
{
case 1 : if (params.get(0) != null)
this.one=params.get(0).toString();
break;
case 2 : if (params.get(1) != null)
this.two=params.get(1).toString();
break;
...
case 14 : if (params.get(13) != null)
this.fourteen=params.get(13).toString();
break;
}
}
Wow, that is just, uh, stunning.[ Addendum 20080201: A bit more. ] [ Addendum 20090213: A counterexample. ]
[Other articles in category /prog] permanent link Thu, 03 Jan 2008
Note on point-free programming style
grep '^X-Spam-Level' | sort | uniq | wc -l
and the analogous Haskell code:
length . nub . sort . filter (isPrefixOf "X-Spam-Level")
Neither one explicitly mentions its argument, which is why this is
"point-free". In "point-free" programming, instead of defining a
function in terms of its effect on its arguments, one defines it by
composing the component functions themselves, directly, with
higher-order operators. For example, instead of:
foo x y = 2 * x + yone has, in point-free style:
foo = (+) . (2 *)where (2 *) is the function that doubles its argument, and (+) is the (curried) addition function. The two definitions of foo are entirely equivalent. As the two examples should make clear, point-free style is sometimes natural, and sometimes not, and the example chosen by M. Lai was carefully selected to bias the argument in favor of point-free style. Often, after writing a function in pointful style, I get the computer to convert it automatically to point-free style, just to see what it looks like. This is usually educational, and sometimes I use the computed point-free definition instead. As I get better at understanding point-free programming style in Haskell, I am more and more likely to write certain functions point-free in the first place. For example, I recently wrote:
soln = int 1 (srt (add one (neg (sqr soln))))
and then scratched my head, erased it, and replaced it with the equivalent:
soln = int 1 ((srt . (add one) . neg . sqr) soln)
I could have factored out the int 1 too:
soln = (int 1 . srt . add one . neg . sqr) soln
I could even have removed soln from the right-hand side:
soln = fix (int 1 . srt . add one . neg . sqr)
but I am not yet a perfect sage.Sometimes I opt for an intermediate form, one in which some of the arguments are explicit and some are implicit. For example, as an exercise I wrote a function numOccurrences which takes a value and a list and counts the number of times the value occurs in the list. A straightforward and conventional implementation is:
numOccurrences x [] = 0
numOccurrences x (y:ys) =
if (x == y) then 1 + rest
else rest
where rest = numOccurrences x ys
but the partially point-free version I wrote was much better:
numOccurrences x = length . filter (== x)
Once you see this, it's easy to go back to a fully pointful
version:
numOccurrences x y = length (filter (== x) y)
Or you can go the other way, to a point-free version:
numOccurrences = (length .) . filter . (==)
which I find confusing.Anyway, the point of this note is not to argue that the point-free style is better or worse than the pointful style. Sometimes I use the one, and sometimes the other. I just want to point out that the argument made by M. Lai is deceptive, because of the choice of examples. As an equally biased counterexample, consider:
bar x = x*x + 2*x + 1
which the automatic converter informs me can be written in point-free
style as:
bar = (1 +) . ap ((+) . join (*)) (2 *)
Perusal of this example will reveal much to the attentive reader,
including the definitions of join and ap. But I
don't think many people would argue that it is an improvement on the
original. (Maybe I'm wrong, and people would argue that it was an
improvement. I won't know for sure until I have more experience.)For some sort of balance, here is another example where I think the point-free version is at least as good as the pointful version: a recent comment on Reddit suggested a >>> operator that composes functions just like the . operator, but in the other order, so that:
f >>> g = g . f
or, if you prefer:
>>> f g x = g(f(x))
The point-free definition of >>> is:
(>>>) = flip (.)
where the flip operator takes a function of two arguments and
makes a new function that does the same thing, but with the arguments
in the opposite order. Whatever your feelings about point-free style,
it is undeniable that the point-free definition makes perfectly clear
that >>> is nothing but . with its arguments in
reverse order.
[Other articles in category /prog/haskell] permanent link Sun, 30 Dec 2007
Welcome to my ~/bin
So I went to fix that, and then I couldn't decide which sample slides to show. And I haven't given the tutorial for a couple of years, and I have an upcoming project that will prevent me from giving it for another couple of years. Eh, figuring out what to put online is more trouble than it's worth. I decided it would be a lot less toil to just put the whole thing online. The materials are copyright © 2004 Mark Jason Dominus, and are not under any sort of free license. But please enjoy them anyway. I think the title is an accidental ripoff of an earlier class by Damian Conway. I totally forgot that he had done a class on the same subject, and I think he used the same title. But that just makes us even, because for the past few years he has been making money going around giving talks on "Conference Presentation Aikido", which is a blatant (and deliberate) ripoff of my 2002 Perl conference talk on Conference Presentation Judo. So I don't feel as bad as I might have. Welcome to my ~/bin complete slides and other materials. I hereby wish you a happy new year, unless you don't want one, in which case I wish you a crappy new year instead.
[Other articles in category /prog/perl] permanent link Thu, 20 Dec 2007
Another trivial utility: accumulate
k1 v1
k1 v2
k2 v3
k1 v4
k2 v5
k3 v6
and writes it out in this format:
k1 v1 v2 v4
k2 v3 v5
k3 v6
I wanted it this time because I had a bunch of files that included some
duplicates, and wanted to get rid of the duplicates. So:
md5sum * | accumulate | perl -lane 'unlink @F[2..$#F]'
(Incidentally, people sometimes argue that Perl's .. operator
should count backwards when the left operand exceeds the right one.
These people are wrong. There is only one argument that needs to be
made to refute this idea; maybe it is the only argument that
can be made. And examples of it abound. The code above is one
such example.)I'm afraid of insulting you by showing the source code for accumulate, because of course it is so very trivial, and you could write it in five minutes, as I did. But who knows; maybe seeing the source has some value:
#!/usr/bin/perl
use Getopt::Std;
my %opt = (k => 1, v => 2);
getopts('k:v:', \%opt) or usage();
for (qw(k v)) {
$opt{$_} -= 1 if $opt{$_} > 0;
}
while (<>) {
chomp;
my @F = split;
push @{$K{$F[$opt{k}]}}, $F[$opt{v}];
}
for my $k (keys %K) {
print "$k @{$K{$k}}\n";
}
It's tempting to add a -F option to tell it that the input is
not delimited by white space, or an option to change the output
format, or blah blah blah, but I managed to restrain myself, mostly.Several years ago I wrote a conference tutorial about the contents of my ~/bin directory. The clearest conclusion that transpired from my analysis was that the utilities I write have too many features that I don't use. The second-clearest was that I waste too much time writing custom argument-parsing code instead of using Getopt::Std. I've tried to learn from this. One thing I found later is that a good way to sublimate the urge to put in some feature is to put in the option to enable it, and to document it, but to leave the feature itself unimplemented. This might work for you too if you have the same problem. I did put in -k and -v options to control which input columns are accumulated. These default to the first and second columns, naturally. Maybe this was a waste of time, since it occurs to me now that accumulate -k k -v v could be replaced by cut -fk,v | accumulate, if only cut didn't suck quite so badly. Of course one could use awk {print "$k $v" } | accumulate to escape cut's suckage. And some solution of this type obviates the need for accumulate's putative -F option also. Well, I digress. The accumulate program itself reminds me of a much more ambitious project I worked on for a while between 1998 and 2001, as does the yucky line:
push @{$K{$F[$opt{k}]}}, $F[$opt{v}];
The ambitious project was tentatively named "twingler".Beginning Perl programmers often have trouble with compound data structures because Perl's syntax for the nested structures is so horrendous. Suppose, for example, that you have a reference to a two-dimensional array $aref, and you want to produce a hash, such that each value in the array appears as a key in the hash, associated with a list of strings in the form "m,n" indicating where in the array that value appeared. Well, of course it is obviously nothing more than:
for my $a1 (0 .. $#$aref) {
for my $a2 (0 .. $#{$aref->[$a1]}) {
push @{$hash{$aref->[$a1][$a2]}}, "$a1,$a2";
}
}
Obviously. <sarcasm>Geez, a child could see
that.</sarcasm>The idea of twingler was that you would specify the transformation you wanted declaratively, and it would then write the appropriate Perl code to perform the transformation. The interesting part of this project is figuring out the language for specifying the transformation. It must be complex enough to be able to express most of the interesting transformations that people commonly want, but if it isn't at the same time much simpler than Perl itself, it isn't worth using. Nobody will see any point in learning a new declarative language for expressing Perl data transformations unless it is itself simpler to use than just writing the Perl would have been. There are some hard problems here: What do people need? What subset of this can be expressed simply? How can we design a simple, limited language that people can use to express their needs? Can the language actually be compiled to Perl? I had to face similar sorts of problems when I was writing linogram, but in the case of linogram I was more successful. I tinkered with twingler for some time and made several pages of (typed) notes but never came up with anything I was really happy with. At one point I abandoned the idea of a declarative language, in favor of just having the program take a sample input and a corresponding sample output, and deduce the appropriate transformation from there. For example, you would put in:
[ [ A, B ],
[ C, B ],
[ D, E ] ]
and
{ B => [A, C],
E => [D],
}
and it would generate:
for my $a1 (@$input) {
my ($e1, $e2) = @$a1;
push @{$output{$e2}}, $e1;
}
And then presumably you could eyeball this, and if what you really
wanted was @{$a1}[0, -1] instead of @$a1 you could
tinker it into the form you needed without too much extra trouble.
This is much nicer from a user-experience point of view, but at the
same time it seems more difficult to implement. I had some ideas. One idea was to have it generate a bunch of expressions for mapping single elements from the input to the output, and then to try to unify those expressions. But as I said, I never did figure it out. It's a shame, because it would have been pretty cool if I had gotten it to work. The MIT CS grad students' handbook used to say something about how you always need to have several projects going on at once, because two-thirds of all research projects end in failure. The people you see who seem to have one success after another actually have three projects going on all the time, and you only see the successes. This is a nice example of that.
[Other articles in category /prog] permanent link Mon, 29 Oct 2007
Undefined behavior in Perl and other languages
Undefined behaviorFor people unfamiliar with this concept, I should explain briefly. The C standard is full of places that say "if the program contains x, the behavior is undefined", which really means "C programs do not contain x, so If the program contains x, it is not written in C, and, as this standard only defines the meaning of programs in C, it has nothing to say about the meaning of your program." There are around a couple of hundred of these phrases, and a larger number of places where it is implied.For example, everyone knows that it means when you write x = 4;, but what does it mean if you write 4 = x;? According to clause 6.3.2.1[#1], it means nothing, and this is not a C program. The non-guarantee in this case is extremely strong. The C compiler, upon encountering this locution, is allowed to abort and spontaneously erase all your files, and in doing so it is not violating the requirements of the standard, because the standard does not require any particular behavior in this case. The memorable phrase that the comp.lang.c folks use is that using that construction might cause demons to fly out of your nose. [ Addendum 20071030: I am informed that I misread the standard here, and that the behavior of this particular line is not undefined, but requires a compiler diagnostic. Perhaps a better example would have been x = *(char *)0. ] I mentioned this in passing in one of my recent articles about a C program I wrote:
unsigned strinc(char *s)
{
char *p = strchr(s, '\0') - 1;
while (p >= s && *p == 'A' + colors - 1) *p-- = 'A';
if (p < s) return 0;
(*p)++;
return 1;
}
Here the pointer p starts at the end of the string s,
and the loop might stop when p points to the position just
before s. Except no, that is forbidden, and the program might
at that moment cause demons to fly out of your nose. You are allowed
to have a pointer that points to the position just after an
object, but not one that points just before. Well anyway, I seem to have digressed. My point was that M. Gould says that one advantage of languages like Perl that are defined wholly by their (one) implementation is that you never have "undefined behavior". If you want to know what some locution does, you type it in and see what it does. Poof, instant definition. Although I think this is a sound point, it occurred to me that that is not entirely correct. The manual is a specification of sorts, and even if the implementation does X in situation Y, the manual might say "The implementation does X in situation Y, but this is unsupported and may change without warning in the future." Then what you have is not so different from Y being undefined behavior. Because the manual is (presumably) a statement of official policy from the maintainers, and, as a communiqué from the people with the ultimate authority to define the future meaning of the language, it has some of the same status that a formal specification would.
Perl: the static variable hackSuch disclaimers do appear in the Perl documentation. Probably the most significant example of this is the static variable hack. For various implementation reasons, the locution my $static if 0 has a strange and interesting effect:
sub foo {
my $static = 42 if 0;
print "static is now $static\n";
$static++;
}
foo() for 1..5;
This makes $static behave as a "static" variable, and persist
from call to call of foo(). Without the ... if 0,
the code would print "static is now 42" five times. But with
... if 0, it prints:
static is now
static is now 1
static is now 2
static is now 3
static is now 4
This was never an intentional feature. It arose accidentally, and
then people discovered it and started using it. Since the behavior
was the result of a strange quirk of the implementation, caused by the
surprising interaction of several internal details, it was officially
decided by the support group that this behavior would not be supported
in future versions. The manual was amended to say that this behavior
was explicitly undefined, and might change in the future. It can be
used in one-off programs, but not in any important program, one that
might have a long life and need to be run under several different
versions of Perl. Programs that use pointers that point outside the
bounds of allocated storage in C are in a similar position. It might
work on today's system, with today's compiler, today, but you can't do
that in any larger context.Having the "undefined behavior" be determined by the manual, instead of by a language standard, has its drawbacks. The language standard is fretted over by experts for months. When the C standard says that behavior is undefined, it is because someone like Clive Feather or Doug Gwyn or P.J. Plauger, someone who knows more about C than you ever will, knows that there is some machine somewhere on which the behavior is unsupported and unsupportable. When the Perl manual says that some behavior is undefined, you might be hearing from the Perl equivalent of Doug Gwyn, someone like Nick Clark or Chip Salzenberg or Gurusamy Sarathy. Or you might be hearing from a mere nervous-nellie who got their patch into the manual on a night when the release manager had stayed up too late.
Perl: modifying a hash in a loopHere is an example of this that has bothered me for a long time. One can use the each() operator to loop lazily over the contents of a hash:
while (my $key = each %hash) {
# do something with $key and $hash{$key}
}
What happens if you modify the hash in the middle of the loop? For
various implementation reasons, the manual forbids this.For example, suppose the loop code adds a new key to the hash. The hash might overflow as a result, and this would trigger a reorganization that would move everything around, destroying the ordering information. The subsequent calls to each() would continue from the same element of the hash, but in the new order, making it likely that the loop would visit some keys more than once, or some not at all. So the prohibition in that case makes sense: The each() operator normally guarantees to produce each key exactly once, and adding elements to a hash in the middle of the loop might cause that guarantee to be broken in an unpredictable way. Moreover, there is no obvious way to fix this without potentially wrecking the performance of hashes. But the manual also forbids deleting keys inside the loop, and there the issue does not come up, because in Perl, hashes are never reorganized as the result of a deletion. The behavior is easily described: Deleting a key that has already been visited will not affect the each() loop, and deleting one that has not yet been visited will just cause it to be skipped when the time comes. Some people might find this general case confusing, I suppose. But the following code also runs afoul of the "do not modify a hash inside of an each loop" prohibition, and I don't think anyone would find it confusing:
while (my $key = each %hash) {
delete $hash{$key} if is_bad($hash{$key});
}
Here we want to delete all the bad items from the hash. We do this by
scanning the hash and deleting the current item whenever it is bad.
Since each key is deleted only after it is scanned by each,
we should expect this to visit every key in the hash, as indeed it
does. And this appears to be a useful thing to write. The only
alternative is to make two passes, constructing a list of bad keys on
the first pass, and deleting them on the second pass. The code would
be more complicated and the time and memory performance would be much
worse.There is a potential implementation problem, though. The way that each() works is to take the current item and follow a "next" pointer from it to find the next item. (I am omitting some unimportant details here.) But if we have deleted the current item, the implementation cannot follow the "next" pointer. So what happens? In fact, the implementation has always contained a bunch of code, written by Larry Wall, to ensure that deleting the current key will work properly, and that it will not spoil the each(). This is nontrivial. When you delete an item, the delete() operator looks to see if it is the current item of an each() loop, and if so, it marks the item with a special flag instead of deleting it. Later on, the next time each() is invoked, it sees the flag and deletes the item after following the "next" pointer. So the implementation takes some pains to make this work. But someone came along later and forbade all modifications of a hash inside an each loop, throwing the baby out with the bathwater. Larry and perl paid a price for this feature, in performance and memory and code size, and I think it was a feature well bought. But then someone patched the manual and spoiled the value of the feature. (Some years later, I patched the manual again to add an exception for this case. Score!)
Perl: modifying an array in a loopAnother example is the question of what happens when you modify an array inside a loop over the array, as with:
@a = (1..3);
for (@a) {
print;
push @a, $_ + 3 if $_ % 2 == 1;
}
(This prints 12346.) The internals are simple, and the semantics are
well-defined by the implementation, and straightforward, but the
manual has the heebie-jeebies about it, and most of the Perl community
is extremely superstitious about this, claiming that it is "entirely
unpredictable". I would like to support this with a quotation from
the manual, but I can't find it in the enormous and disorganized mass
that is the Perl documentation.[ Addendum: Tom Boutell found it. The perlsyn page says "If any part of LIST is an array, foreach will get very confused if you add or remove elements within the loop body, for example with splice. So don't do that." ] The behavior, for the record, is quite straightforward: On the first iteration, the loop processes the first element in the array. On the second iteration, the loop processes the second element in the array, whatever that element is at the time the second iteration starts, whether or not that was the second element before. On the third iteration, the loop processes the third element in the array, whatever it is at that moment. And so the loop continues, terminating the first time it is called upon to process an element that is past the end of the array. We might imagine the following pseudocode:
index = 0;
while (index < array.length()) {
process element array[index];
index += 1;
}
There is nothing subtle or difficult about this, and claims that the
behavior is "entirely unpredictable" are probably superstitious
confessions of ignorance and fear.Let's try to predict the "entirely unpredictable" behavior of the example above:
@a = (1..3);
for (@a) {
print;
push @a, $_ + 3 if $_ % 2 == 1;
}
Initially the array contains (1, 2, 3), and so the first iteration
processes the first element, which is 1. This prints 1, and, since 1
is odd, pushes 4 onto the end of the array.The array now contains (1, 2, 3, 4), and the loop processes the second element, which is 2. 2 is printed. The loop then processes the third element, printing 3 and pushing 6 onto the end. The array now contains (1, 2, 3, 4, 6). On the fourth iteration, the fourth element (4) is printed, and on the fifth iteration, the fifth element (6) is printed. That is the last element, so the loop is finished. What was so hard about that?
Haskell: n+k patternsMy blog was recently inserted into the feed for planet.haskell.org, and of course I immediately started my first streak of posting code-heavy articles about C and Perl. This is distressing not just because the articles were off-topic for Planet Haskell—I wouldn't give the matter two thoughts if I were posting my usual mix of abstract math and stuff—but it's so off-topic that it feels weird to see it sitting there on the front page of Planet Haskell. So I thought I'd make an effort to talk about Haskell, as a friendly attempt to promote good relations between tribes. I'm not sure what tribe I'm in, actually, but what the heck. I thought about Haskell a bit, and a Haskell example came to mind.Here is a definition of the factorial function in Haskell:
fact 0 = 1
fact n = n * fact (n-1)
I don't need to explain this to anyone, right?Okay, now here is another definition:
fact 0 = 1
fact (n+1) = (n+1) * fact n
Also fine, and indeed this is legal Haskell. The pattern n+1
is allowed to match an integer that is at least 1, say 7, and doing so binds n to
the value 6. This is by a rather peculiar special case in the
specification of Haskell's pattern-matcher. (It is section 3.17.2#8
of Haskell 98 Language and Libraries: The Revised
Report, should you want to look it up.) This peculiar
special case is known sometimes as a "successor pattern" but more
often as an "n+k pattern".The spec explicitly deprecates this feature:
Many people feel that n+k patterns should not be used. These patterns may be removed or changed in future versions of Haskell. (Page 33.) One wonders why they put it in at all, if they were going to go ahead and tell you not to use it. The Haskell committee is usually smarter than this.
I think it might be worth pointing out that the version of the program with the n+k pattern is technically superior to the other version. Given a negative integer argument, the first version recurses forever, possibly taking a long time to fail and perhaps taking out the rest of the system on which it is running. But the n+k version fails immediately, because the n+1 pattern will only match an integer that is at least 1.
XML screws upThe "nasal demons" of the C standard are a joke, but a serious one. The C standard defines what C compilers must do when presented with C programs; it does not define what they do when presented with other inputs, nor what other software does when presented with C programs. The authors of C standard clearly understood the standard's role in the world.Earlier versions of the XML standard were less clear. There was a particularly laughable clause in the first edition of the XML 1,0 standard:
XML documents may, and should, begin with an XML declaration which specifies the version of XML being used. For example, the following is a complete XML document, well-formed but not valid: (Emphasis is mine.) The XML 1.0 spec is just a document. It has no power, except to declare that certain files are XML 1.0 and certain files are not. A file that complies with the requirements of the spec is XML 1.0; all other files are not XML 1.0. But in the emphasized clause, the spec says that certain behavior "is an error" if it is exhibited by documents that do not conform to the spec. That is, it is declaring certain non-XML-1.0 documents "erroneous". But within the meaning of the spec, "erroneous" simply means that the documents are not XML 1.0. So the clause is completely redundant. Documents that do not conform to the spec are erroneous by definition, whether or not they use the value "1.0". It's as if the Catholic Church issued an edict forbidding all rabbis from wearing cassocks, on pain of excommunication. I am happy to discover that this dumb error has been removed from the most recent edition of the XML 1.0 spec.
[Other articles in category /prog/perl] permanent link Sun, 14 Oct 2007
Van der Waerden's problem: programs 3 and 4
If you don't remember what the program does, here's an explanation. Here is program 1, which was an earlier attempt to do the same thing. Here's program 2.
Program 3Complete source code for this version.I said of the previous program:
The problem is all in the implementation. You see, this program actually constructs the entire tree in memory.Somewhere along the line it dawned on me that constructing the tree was unnecessary, so I took that machinery out, and the result was version 3. Consequently, this program is easy to explain once you have seen the previous version: almost all I have to do is list the stuff that I took out. Since this program does not construct a tree of node structures, it omits the definition of the node structure and the macro for manufacturing nodes. Since it gets rid of the node allocation, it also gets rid of the memory leak of the previous version, and so omits the customized memory allocation functions Malloc and Free that performed memory tracking. The previous program had a compiled-in limit on the number of colors it would handle, because at the time I didn't know how to do a dynamic array. In this program, I got rid of the node structures, so there was no array of node structures, so no need for a limit on the number of node structures in the array. And all the code that enforced the limit is gone. The apchk function, which checks to see if a string is good, remains unchanged from the previous version. The makenodes function, which was the principal function in the previous program, remains, but has lost a lot of code. It is simpler to call, too; the node argument is gone:
makenodes(maxlen,"");
I got rid of the silly !howfar test in favor of a more
easily-understood howfar == 0 test. There are lots of times
when ! is appropriate, but testing whether a non-negative
integer has reached zero is not one of them. I was going to comment
earlier about what a novice error this is, and I'm glad to see that I
fixed it.The main use of apchk in the previous program had if (!apchk(...)) { ... }. That was okay, because apchk returns a Boolean result. But the negation is annoying. It suggests that apchk's return value is backward. (Instead of returning true for a bad string, it should return true for a good string.) This is not very much a big deal, and I only brought it up so that I could diffidently confess that these days I would probably have done:
#define unless(c) if(!(c))
...
unless (is_bad(...)) {
}
There are a lot of stories of doofus Pascal programmers who do:
#define begin {
#define end }
and Fortran programmers who do:
#define GT >
#define GE >=
#define LT <
#define LE <=
and I find, to my shame, that I have become one of them. Anyone
seeing #define unless(c) if(!(c)) would snort and say "Oh,
this was obviously written by a Perl programmer."But at least I was a C programmer first. Actually I was a Fortran programmer first. But I was never a big enough doofus to #define GE >=. The big flaw in the current program is the string argument to makenodes. Each call to makenodes copies this string so that it can append a character to the end. I discussed this at some length in the previous article, so I don't want to make too much of it now; I'll just say that a better technique would have reused the string buffer from call to call. This obviously saves a little memory, and since most of the contents of the string doesn't change, it also saves a lot of time. This might be worth seeing, since it seems to me now to be a marvel of wasted code:
ls = strlen(s);
newarg = STRING(ls + 1);
if (!newarg)
{
fprintf(stderr,"Couldn't get %d bytes for newarg in makenodes\n",ls+2);
fprintf(stderr,"Total get was %d.\n",gotten);
fprintf(stderr,"P\n L\n O\n P\n !\n");
abort();
}
strcpy(newarg,s);
newarg[ls+1] = '\0';
newarg[ls] = 'A' + i;
makenodes(howfar-1,newarg);
free(newarg);
The repeated strlen, for example, when ls could be
calculated as maxlen - howfar. The excessively verbose
failure message, which should be inside the STRING macro
anyway. (The code that maintains gotten has gone away with
the debugging allocation routines, so the second fprintf is
superfluous.) And why did I think abort was the right
thing to call on an out-of-memory condition?Oh well, you live and learn.
Program 4Complete source code for this version.The fourth version of the program is even more trimmed-down. In this version of the program I did get the idea to reuse the string buffer instead of copying the string on every recursive call. But I also got an even better idea, and eliminated the recursive call. The makenodes function is now down to one argument, which tells it how deep a tree to search.
void
makenodes(maxdepth)
int maxdepth;
{
int apchk(), depth = 0;
char curlet, *curstring = STRING(maxdepth);
curstring[0] = '\0';
curlet = 'A';
while (depth >= 0)
{
while (curlet <= 'A' - 1 + colors)
{
#ifdef DIAG
printf("%s makenoding with string %s%c, depth %d.\n",
TABS+12-depth,curstring,curlet,depth);
#endif
if (apchk(curstring,curlet))
curlet++;
else
if (depth < maxdepth)
{
curstring[depth] = curlet;
curstring[depth+1] = '\0';
depth += 1;
curlet = 'A';
}
else
{
printf("%s%c\n",curstring,curlet);
curlet++;
}
}
depth -= 1;
curlet = curstring[depth] + 1;
curstring[depth] = '\0';
}
}
This is a better job all around, and not very different from what I
wrote last month to do the same thing. I was going to title this
series of articles "I have become a better programmer!", and now that
I see this version, I'm glad I didn't, because there's no evidence
here that I am much better. This version of the program gets a solid
A from my older self.The value depth scans forward in the string when the search is going well, and is decremented again when the search needs to backtrack. If depth == maxdepth, a witness of the desired length has been found, and is printed out. The curlet ("current letter") variable tracks which branch of the current tree node we are "recursing" down. After the function recurses down, by incrementing depth, curlet is set to 'A' to visit the first sub-node of the new current node. The curstring buffer tracks the path through the tree to the current node. When the function needs to backtrack, it restores the state of curlet from the last character in the buffer and then trims that character off the end of the path. I'd only want to make two changes to this code. One would be to make depth a pointer into the curstring buffer instead of an index into it. Then again, the compiler may well have optimized it into one anyway. But it would also allow me to eliminate curlet in favor of just using *depth everywhere. The other change would address a more serious defect: the contents of curstring are kept properly zero-terminated at all times, whenever depth is advanced or retracted. This zero-termination is unnecessary, since curstring is never used as a string except when depth == maxdepth. When printfing curstring, I could have used something like:
printf("%.*s%c\n",curstring,maxlen,curlet);
which prints exactly maxlen characters from the buffer,
regardless of whether it is zero-terminated. It would, however, have required that I know about %.*s, which I'm sure I did not. Was %.*s even available in 1988? I forget, and my copy of K&R First Edition is in a box somewhere since my recent move. Anyway, if %.*s was unavailable for whatever reason, the code could have had a single curstring[maxdepth] = 0 up front, which would have been quite sufficient for the one printf it needed to do. Coming next: one very different program to solve the same problem, and a comparison with last month's effort.
[Other articles in category /prog] permanent link Fri, 05 Oct 2007
Van der Waerden's problem: program 2
If you don't remember what the program does, here's an explanation. Here is program 1, which was an earlier attempt to do the same thing.
Program 2In yesterday's article I wrote about a crappy program to search for "good" strings in van der Waerden's problem. It was crappy because it searched the entire space of all 327 strings, with no pruning.I can't remember whether I expected this to be practical at the time. Did I really think it would work? Well, there was some sense to it. It does work just fine for the 29 case. I think probably my idea was to do the simplest thing that could possibly work, and get as much information out of it as I could. On my current machine, this method proves that V(3,3) > 19 by finding a witness (RRBRRBBYYRRBRRBBYYB) in under 10 seconds. If we estimate that the computer I had then was 10,000 times slower, then I could have produced the same result in about 28 hours. I was at college, and there was plenty of free computing power available, so running a program for 28 hours was easily done. While I was waiting for it to finish, I could work on a better program. Excerpts of the better program follow. The complete source code is here. The idea behind this program is that the strings of length less than V form a tree, with the empty string as the root, and the children of string s are obtained from s by appending a single character to the end of s. If the string at a node is bad, so will be all the strings under it, and we can prune the entire branch at that node. This leaves us with a tree of all the good strings. The ones farthest from the root will be the witnesses we seek for the values of V(n, C), and we can find these by doing depth-first search on the tree, There is nothing wrong with this idea in principle; that's the way my current program works too. The problem is all in the implementation. You see, this program actually constructs the entire tree in memory:
#define NEWN ((struct tree *) Malloc(sizeof(struct tree)));\
printf("*")
struct tree {
char bad;
struct tree *away[MAXCOLORS];
} *root;
struct tree is a tree node structure. It represents a
string s, and has a flag to record whether s is bad. It
also has pointers to its subnodes, which will represents strings
sA,
sB,
and so on.MAXCOLORS is a compiled-in limit on the number of different symbols the strings can contain, an upper bound on C. Apparently I didn't know the standard technique for avoiding this inflexibility. You declare the array as having length 1, but then when you allocate the structure, you allocate enough space for the array you are actually planning to use. Even though the declared size of the array is 1, you are allowed to refer to node->away[37] as long as there is actually enough space in the allocated chunk. The implementation would look like this:
struct tree {
char bad;
struct tree *away[1];
} ;
struct tree *make_tree_node(char bad, unsigned n_subnodes)
{
struct tree *t;
unsigned i;
t = malloc(sizeof(struct tree)
+ (n_subnodes-1) * sizeof(struct tree *));
if (t == NULL) return NULL;
t->bad = bad;
for (i=0; i < n_subnodes; i++) t->away[i] = NULL;
return t;
}
(Note for those who are not advanced C programmers: I give you my
solemn word of honor that I am not doing anything dodgy or bizarre
here; it is a standard, widely-used, supported technique, guaranteed
to work everywhere.)(As before, this code is in a pink box to indicate that it is not actually part of the program I am discussing.) Another thing I notice is that the NEWN macro is very weird. Note that it may not work as expected in a context like this:
for(i=0; i<10; i++)
s[i] = NEWN;
This allocates ten nodes but prints only one star, because it expands
to:
for(i=0; i<10; i++)
s[i] = ((struct tree *) Malloc(sizeof(struct tree)));
printf("*");
and the for loop does not control the printf. The
usual fix for multiline macros like this is to wrap them in
do...while(0), but that is not appropriate here.
Had I been writing this today, I would have made NEWN a
function, not a macro. Clevermacroitis is a common disorder of
beginning C programmers, and I was no exception.The main business of the program is in the makenodes function; the main routine does some argument processing and then calls makenodes. The arguments to the makenodes function are the current tree node, the current string that that node represents, and an integer howfar that says how deep a tree to construct under the current node. There's a base case, for when nothing needs to be constructed:
if (!howfar)
{
for (i=0; i<colors; i++)
n->away[i] = NULL;
return;
}
But in general the function calls itself recursively:
for (i=0; i<colors; i++)
{
n->away[i] = NEWN;
n->away[i]->bad = 0;
if (apchk(s,'A'+i))
{
n->away[i]->bad = 1;
}
else
...
Recall that apchk checks a string for an arithmetic
progression of equal characters. That is, it checks to see if a
string is good or bad. If the string is bad, the function prunes the
tree at the current node, and doesn't recurse further.Unlike the one in the previous program, this apchk doesn't bother checking all the possible arithmetic progressions. It only checks the new ones: that is, the ones involving the last character. That's why it has two arguments. One is the old string s and the other is the new symbol that we want to append to s. If s would still be good with symbol 'A'+i appended to the end, the function recurses:
...
else
{
ls = strlen(s);
newarg = STRING(ls + 1);
strcpy(newarg,s);
newarg[ls+1] = '\0';
newarg[ls] = 'A' + i;
makenodes(n->away[i],howfar-1,newarg);
Free(newarg,ls+2);
Free(n->away[i],sizeof(struct tree));
}
}
}
The entire string is copied here into a new buffer. A better
technique sould have been to allocate a single buffer back up in
main, and to reuse that buffer over again on each call to
makenodes. It would have looked something like this:
char *s = String(maxlen);
memset(s, 0, maxlen+1);
makenodes(s, s, maxlen);
void
makenodes(char *start, char *end, unsigned howfar)
{
...
for (i=0; i<colors; i++) {
*end = 'A' + i;
makenodes(start, end+1, howfar-1);
}
*end = '\0';
...
}
This would have saved a lot of consing, ahem, I mean a lot of
mallocing. Also a lot of string copying. We could avoid the
end pointer by using start+maxlen-howfar instead,
but this way is easier to understand.I was thinking this afternoon how it's intersting the way I wrote this. It's written the way it would have been done, had I been using a functional programming language. In a functional language, you would never mutate the same string for each function call; you always copy the old structure and construct a new one, just as I did in this program. This is why C programmers abominate functional languages. Had I been writing makenodes today, I would probably have eliminated the other argument. Instead of passing it a node and having it fill in the children, I would have had it construct and return a complete node. The recursive call would then have looked like this:
struct tree *new = NEWN;
...
for (i=0; i<colors; i++) {
new->away[i] = makenodes(...);
...
}
return new;
One thing I left out of all this was the diagnostic printfs;
you can see them in the complete code if you want. But there's one I
thought was worth mentioning anyway:
#define TABS " "
....
#ifdef DIAG
printf("%s makenoding with string %s, depth %d.\n",
TABS+12-maxlen+howfar,s,maxlen-howfar);
#endif
The interesting thing here is the
TABS+12-maxlen+howfar
argument, which indents the display depending on how far the recursion
has progressed. In Perl, which has nonaddressable strings, I usually
do something like this:
my $TABS = " " x (maxlen - howfar);
print $TABS, "....";
The TABS trick here is pretty clever, and I'm a bit surprised
that I thought of it in 1988, when I had been programming in C for
only about a year. It makes an interesting contrast to my failure to
reuse the string buffer in makenodes earlier.(Peeking ahead, I see that in the next version of the program, I did reuse the string buffer in this way.) TABS is actually forty spaces, not tabs. I suspect I used tabs when I tested it with V(2, 3), where maxlen was only 9, and then changed it to spaces for calculating V(3, 3), where maxlen was 27. The apchk function checks to see if a string is good. Actually it gets a string, qq, and a character, q, and checks to see if the concatenation of qq and q would be good. This reduces its running time to O(|qq|) rather than O(|qq|2).
int
apchk(qq,q)
char *qq ,q;
{
int lqq, f, s, t;
t = lqq = strlen(qq);
if (lqq < 2) return NO;
for (f=lqq % 2; f <= lqq - 2; f += 2)
{
s = (f + t) / 2;
if ((qq[f] == qq[s]) && (qq[s] == q))
return YES;
}
return NO;
}
It's funny that it didn't occur to me to include an extra parameter to
avoid the strlen, or to use q instead of
qq[s] in the first == test. Also, as in the previous
program, I seem unaware of the relative precedences of
&& and ==. This is probably a hangover from
my experience with Pascal, where the parentheses are required.It seems I hadn't learned yet that predicate functions like apchk should be named something like is_bad, so that you can understand code like if (is_bad(s)) { ... } without having to study the code of is_bad to figure out what it returns. I was going to write that I hated this function, and that I could do it a lot better now. But then I tried to replace it, and wasn't as successful as I expected I would be. My replacement was:
unsigned
is_bad(char *qq, int q)
{
size_t qql = strlen(qq);
char *f = qq + qql%2;
char *s = f + qql/2;
while (f < s) {
if (*f == q && *s == q) return 1;
f += 2; s += 1;
}
return 0;
}
I could simplify the initializations of f and s,
which are the parts I dislike most here, by making the pointers move
backward instead of forward, but then the termination test becomes
more complicated:
unsigned
is_bad(char *qq, int q)
{
char *s = strchr(qq, '\0')-1;
char *f = s-1;
while (1) {
if (*f == q && *s == q) return 1;
if (f - qq < 2) break;
f -= 2; s -= 1;
}
return 0;
}
Anyway, I thought I could improve it, but I'm not sure I did. On the
one hand, I like the
f -= 2; s -= 1;, which I think is pretty clear. On the other
hand, s = (f + t) / 2 is pretty clear too; s is
midway between f and t. I'm willing to give
teenage Dominus a passing grade on this one.Someone probably wants to replace the while loop here with a for loop. That person is not me. The Malloc and Free functions track memory usage and were presumably introduced when I discovered that my program used up way too much memory and crashed—I think I remember that the original version omitted the calls to free. They aren't particularly noteworthy, except perhaps for this bit, in Malloc:
if (p == NULL)
{
fprintf(stderr,"Couldn't get %d bytes.\n",c);
fprintf(stderr,"Total get was %d.\n",gotten);
fprintf(stderr,"P\n L\n O\n P\n !\n");
abort();
}
Plop!It strikes me as odd that I was using void in 1988 (this is before the C90 standard) but still K&R-style function declarations. I don't know what to make of that.
BehaviorThis program works, almost. On my current machine, it can find the length-26 witnesses for V(3, 3) in no time. (In 1998, it took several days to run on a Sequent Balance 21000.) The major problem is that it gobbles memory: the if (!howfar) base case in makenodes forgets to release the memory that was allocated for the new node. I wonder if the Malloc and Free functions were written in an unsuccessful attempt to track this down.Sometime after I wrote this program, while I was waiting for it to complete, it occurred to me that it never actually used the tree for anything, and I could take it out. I have this idea that one of the principal symptoms of novice programmers is that they take the data structures too literally, and always want to represent data the way it will appear when it's printed out. I haven't developed the idea well enough to write an article about it, but I hope it will show up here sometime in the next three years. This program, which constructs an entirely unnecessary tree structure, may be one of the examples of this idea. I'll show the third version sometime in the next few days, I hope. [ Addendum 20071014: Here is part 3. ]
[Other articles in category /prog] permanent link Thu, 04 Oct 2007
The world's worst macro preprocessor: postmortem
My overall assessment is that it has been a huge success, and that if I were doing it over I would do it the same way. A recent article contained a bunch of red and blue dots:
Well, clearly you can do four: • • • •. And then you can add another red one on the end: • • • • •. And then another that could be either red or blue: • • • • • •. And then the next can be either color, say blue: • • • • • • •.I typed this using these macros:
#define R* <span style="color: red">•</span>
#define B* <span style="color: blue">•</span>
#define Y* <span style="color: yellow">•</span>
Without the macro processor, I would have had to suffer a lot. Then,
a little while later, I needed to prepare this display:
•••••••••••••••••••••••••• •••••••••••••••••••••••••• •••••••••••••••••••••••••• •••••••••••••••••••••••••• •••••••••••••••••••••••••• •••••••••••••••••••••••••• •••••••••••••••••••••••••• ••••••••••••••••••••••••••No problem; the lines just look like R*R*B*B*R*R*B*Y*B*Y*Y*R*Y*R*R*B*R*B*B*Y*R*Y*Y*B*Y*B*. Some time later I realized that this display would be totally illegible to the blind, the color-blind, and people using text-only browsers. So I just changed the macros:
#define R* <span style="color: red">R</span>
#define B* <span style="color: blue">B</span>
#define Y* <span style="color: yellow">Y</span>
Problem solved. • • • • • • • instantly becomes
R R B B R B B. And a good thing, too, because I
discovered afterward that a lot of aggregators, like bloglines and
feedburner, discard the color information.I find that I've used the macro feature 114 times so far. The most common use has been:
#define ^2 <sup>2</sup>But I also have files with:
#define r2 √2
#define R2 √2
#define s2 √2
#define S2 √2
That last one appears in three files. Clearly, making the macros
local to files was a good decision.Those uses are pretty typical. A less typical one is:
#define <OVL> <span style="text-decoration: overline">
#define </OVL> </span>
This is the sort of thing that you can get away with on a one-time
basis, but which you wouldn't want to make a convention of. Since the
purpose of the macro processor is to enable such hacks for the
duration of a single article, it's all good.I did run into at least one problem: I was writing an article in which I had defined ^i to abbreviate <sup><i>i</i></sup>. And then several paragraphs later I had a TeX formula that contained the ^i sequence in its TeX meaning. This was being replaced with a bunch of HTML, which was then passed to TeX, which then produced the wrong output. One can solve this by reordering the plugins. If I had put the TeX plugin before the macro plugin, the problem would have gone away, because the TeX plugin would have replaced the TeX formula with an image element before the macro plugin ever saw the ^i. This approach has many drawbacks. One is that it would no longer have been possible to use Blosxom macros in a TeX formula. I wasn't willing to foreclose this possibility, and I also wasn't sure that I hadn't done it somewhere. If I had, the TeX formula that depended on the macro expansion would have broken. And this is a risk whenever you move the macro plugin: if you move it from before plugin X to after plugin X, you have to worry that maybe something in some article depended on the text passed to X having been macro-processed. When I installed the macro processor, I placed it first in plugin order for precisely this reason. Moving the macro substitution later would have required me to remember which plugins would be affected by the macro substitutions and which not. With the macro processing first, the question has a simple answer: all of them are affected. Also, I didn't ever want to have to worry that some macro definition might mangle the output of some plugin. What if you are hacking on some plugin, and you change it to return <span style="Foo"> instead of <span style="foo">, and then discover that three articles you wrote back in 1997 are now totally garbled because they contained #define Foo >WUGGA<? It's just too unpredictable. Having the macro processing occur first means that you can always see in the original article file just what might be macro-replaced. So I didn't reorder the plugins. Another way to solve the TeX ^i problem would have been to do something like this:
#define ^i <sup><i>i</i></sup>
#define ^*i ^i
with the idea that I could write ^*i in the TeX formula, and
the macro processor would replace it with ^i after it
was done replacing all the ^i's. At present the macro processor does not define any order to macro replacements, but it does guarantee to replace each string only once. That is, the results of macro replacement are not themselves searched for macro replacement. This limits the power of the macro system, but I think that is a good thing. One of the powers that is thus proscribed is the power to get stuck in an infinite loop. It occurs to me now that although I call it the world's worst macro system, perhaps that doesn't give me enough credit for doing good design that might not have been obvious. I had forgotten about my choice of single-substituion behavior, but looking back on it a year later, I feel pleased with myself for it, and imagine that a lot of people would have made the wrong choice instead. (A brief digression: unlimited, repeated substitution is a bad move here because it is complex—much more complex than it appears. A macro system with single substitution is nothing much, but a macro system with repeated substitution is a programming language. The semantics of the λ-calculus is nothing more than simple substitution, repeated as necessary, and the λ-calculus is a maximally complex computational engine. Term-rewriting systems are a more obvious theoretical example, and TeX is a better-known practical example of this phenomenon. I was sure I did not want my macro system to be a programming language, so I avoided repeated substitution.) Because each input text is substituted at most once, the processor's refusal to define the order of the replacements is not something you have to think about, as long as your macros are prefix-unique. (That is, as long as none is a prefix of another.) So you shouldn't define:
#define foo bar #define fool idiotbecause then you don't know if foolish turns into barlish or idiotish. This is not a big deal in practice. Well, anyway, I did not solve the problem with #define ^*i ^i. I took a much worse solution, which was to hack a #undefall directive into the macro processor. In my original article, I boasted that the macro processor "has exactly one feature". Now it has two, and it's not an improvement. I disliked the new feature at the time, and now that I'm reviewing the decision, I think I'm going to take it out. I see that I did use the double-macro solution elsewhere. In the article about Gödel and the U.S. Constitution, I macroed an abbreviation for the umlaut:
#define Godel Gödel
But this sequence also ocurred in the URLs in the link elements, and
the substitution broke the links. I should probably have changed this
to:
#define Go:del Gödel
But instead I added:
#define GODEL Godel
and then used GODEL in the URLs. Oh well, whatever works, I
guess.Perhaps my favorite use so far is in an (unfinished) article about prosopagnosia. I got tired of writing about prosopagnosia and prosopagnosiacs, so
#define PAa prosopagnosia
#define PAic prosopagnosiac
Note that with these definitions, I get PAa's,
and PAics for free. I could use PAac instead of defining
PAic, but that would prevent me from deciding later that
prosopagnosiac should be spelled "prosopagnosic".
[Other articles in category /prog] permanent link Wed, 03 Oct 2007
Van der Waerden's problem: program 1
If you don't remember what the program does, here's an explanation.
Program 1I'm going to discuss the program a bit at a time. The complete program is here.This program does an unpruned exhaustive search of the string space. Since for V(3, 3) the string space contains 327 = 7,625,597,484,987 strings, it takes a pretty long time to finish. I quickly realized that I was wasting my time with this program. The program is invoked with a length argument and an optional colors argument, which defaults to 2. It then looks for good strings of the specified length, printing those it finds. If there are none, one then knows that V(3, colors) > length. Otherwise, one knows that V(3, colors) ≤ length, and has witness strings to prove it. I don't want to spend a lot of time on it because there are plenty of C programming style guides you can read if you care for that. But already on lines 4–5 we have something I wouldn't write today:
#define NO 0
#define YES !NO
Oh well.The program wants to iterate through all Cn strings. How does it know when it's done? It's not easy to make a program as slow as this one even slower, but I found a way to do it.
last = STRING(length);
stuff(last,'A' - 1 + colors);
for (i=0; i<colors; i++)
last[i] = 'A' + i;
for (; strcmp(seq,last); strinc(seq))
...
It manufactures the string ABCDDDDDDDDD....D and compares the
current string to that one every time through the loop. A much simpler
method is to detect completion while incrementing the target string.
The function that does the increment looks like this:
void
strinc(s)
char *s;
{
int i;
for (i= length - 1; i>=0; i--)
{
if (s[i] != 'A' - 1 + colors)
{
s[i]++;
return;
}
s[i] = 'A';
}
return;
}
Had I been writing it today, it would have looked more like this:
unsigned strinc(char *s)
{
char *p = strchr(s, '\0') - 1;
while (p >= s && *p == 'A' + colors - 1) *p-- = 'A';
if (p < s) return 0;
(*p)++;
return 1;
}
(This code is in a pink box to show that it is not actually part of
the program I am discussing in this article.)The function returns true on success and false on failure. A false return can be taken by the caller as the signal to terminate the program. This replacement function invokes undefined behavior, because there is no guarantee that p is allowed to run off the beginning of the string in the way that it does. But there is no need to check the strings in lexicographic order. Instead of scanning the strings in the order AAA, AAB, ABA, ABB, BAA, etc., one can scan them in reverse lexicographic order: AAA, BAA, ABA, BBA, AAB, etc. Then instead of running off the beginning of the string, p runs off the end, which is allowed. This fixes the undefined behavior problem and also eliminates the call to strchr that finds the end of the string. This is likely to produce a significant speedup:
unsigned strinc(char *s)
{
while (*s == 'A' + colors - 1) *s++ = 'A';
if (!*s) return 0;
(*s)++;
return 1;
}
Here we're depending on the optimizer to avoid recomputing the value
of 'A' + colors - 1 every time through the loop.The heart of the program is the apchk() function, which checks whether a string q contains an arithmetic progression of length 3:
int
apchk(q)
char *q;
{
int f, s, t;
for (f=0; f <= length - 3; f++)
for (s=f+1; s <= length - 2; s++)
{
t = s+s-f;
if (t >= length) break;
if ((q[f] == q[s]) && (q[s] == q[t])) return YES;
}
return NO;
}
I hesitate to say that this is the biggest waste of time in the whole
program, since after all it is a program whose job is to examine
7,625,597,484,987 strings. But look. 2/3 of the
calls to this function are asking it to check a string that differs
from the previous string in the final character only. Nevertheless,
it still checks all 49 possible arithmetic progressions, even the ones
that didn't change.The t ≥ length test is superfluous, or if it isn't, it should be. Also notice that I wasn't sure of the precendence in the final test. It didn't take me long to figure out that this program was not going to finish in time. I wrote a series of others, which I hope to post here in coming days. The next one sucks too, but in a completely different way. [ Addendum 20071005: Here is part 2. ] [ Addendum 20071014: Here is part 3. ]
[Other articles in category /prog] permanent link Tue, 02 Oct 2007
Van der Waerden's problem
First I'll explain what the programs are about.
Van der Waerden's problemColor each of a row of dots red or blue, so that no three evenly-spaced dots are the same color. (That is, if dots n and n+i are the same color, dot n+2i must be a different color.) How many dots can you do?Well, clearly you can do four: R R B B. And then you can add another red one on the end: R R B B R. And then another that could be either red or blue: R R B B R B. And then the next can be either color, say blue: R R B B R B B. But now you are at the end, because if you make the next dot red, then dots 2, 5, and 8 will all be red (R R B B R B B R), and if you make the next dot blue then dots 6, 7, and 8 will be blue (R R B B R B B B). But maybe we made a mistake somewhere earlier, and if the first seven dots were colored differently, we could have made a row of more than 7 that obeyed the no-three-evenly-spaced-dots requirement. In fact, this is so: R R B B R R B B is an example. But this is the end of the line. Any coloring of a row of 9 dots contains three evenly-spaced dots of the same color. (I don't know a good way to prove this, short of an enumeration of all 512 possible arrangements of dots. Well, of course it is sufficient to enumerate the 256 that begin with R, but that is pretty much the same thing.) Van der Waerden's theorem says that for any number of colors, say C, a sufficiently-long row of colored dots will contain n evenly-spaced same-color dots for any n. Or, put another way, if you partition the integers into C disjoint classes, at least one class will contain arbitrarily long arithmetic progressions. The proof of van der Waerden's theorem works by taking C and n and producing a number V such that a row of V dots, colored with C colors, is guaranteed to contain n evenly-spaced dots of a single color. The smallest such V is denoted V(n, C). For example V(3, 2) is 9, because any row of 9 dots of 2 colors is guaranteed to contain 3 evenly-spaced dots of the same color, but this is not true of such row of only 8 dots. Van der Waerden's theorem does not tell you what V(n, C) actually is; it provides only an upper bound. And here's the funny thing about van der Waerden's theorem: the upper bound is incredibly bad. For V(3, 2), the theorem tells you only that V(3, 2) &le 325. That is, it tells you that any row of 325 red and blue dots must contain three evenly spaced dots of the same color. This is true, but oh, so sloppy, since the same is true of any row of 9 dots. For V(3, 3), the question is how many red, yellow, and blue dots do you need to guarantee three evenly-spaced same-colored dots. The theorem helpfully suggests that:
In fact, there is a rather large cash prize available to be won by the first person who comes up with a general upper bound for V(n, C) that is smaller than a tower of 2's of height n. (That's 222... with n 2's.) In the rest of this series, a string which does not contain three evenly-spaced equal symbols will be called good, and one which does contain three such symbols will be called bad. Then a special case of Van der Waerden's theorem, with n=3, says that, for any fixed number of symbols, all sufficiently long strings are bad. In college I wanted to investigate this a little more. In particular, I wanted to calculate V(3, 3). These days you can just look it up on Wikipedia, but in those benighted times such information was hard to come by. I also wanted to construct the longest possible good strings, witnesses of length V(3, 3)-1. Although I did not know it at the time, V(3, 3) = 27, so a witness should have length 26. It turns out that there are exactly 48 witnesses of length 26. Here are the 1/6 of them that begin with RB or RRB:
RRBBRRBYBYYRYRRBRBBYRYYBYB RRBBYRRYRYBBYYBBYRYRRYBBRR RRBYBRRYRYBBYYBBYRYRRBYBRR RBRRBRBYYBBYYBRBRRBYYRRYRY RBRBBRRYBBYBYRRYYRRYBYBBYR RBRBBRRYBBYBYRRYYRRYBYBBYB RBRBBYBRRYRYYBYBBRBRYYRRYY RBYYBYBRRBBRRBYBYYBRRYYRYR The rest of the witnesses may be obtained by permuting the colors in these eight. I wrote a series of C programs around 1988 to exhaustively search for good strings. Last month I was in a meeting and I decided to write the program again for some reason. I wrote a much better program. This series of articles will compare the five programs. I will post the first one tomorrow. [ Addendum 20071003: Here is part 1. ] [ Addendum 20071005: Here is part 2. ] [ Addendum 20071005: I made a mistake in the expression I gave for the upper bound on V(3,3) and left out a factor of 7 in the exponent on the last 3. I had said that the upper bound was around 102092, but actually it is more like the seventh power of this. ] [ Addendum 20071014: Here is part 3. ]
[Other articles in category /prog] permanent link Sat, 28 Jul 2007
Lightweight Database Strategies for Perl
I don't know why. I tried giving the class a snappier title, but that didn't help. I'm really bad at titles. Maybe people are embarrassed to think about all the lightweight data storage hackery they do in Perl, and feel that they "should" be using a relational database, and don't want to commit more resources to lightweight database techniques. Or maybe they just don't think there is very much to know about it. But there is a lot to know; with a little bit of technique you can postpone the day when you need to go to an RDB, often for quite a long time, and often forever. Many of the techniques fall into the why-didn't-I-think-of-that category, stuff that isn't too weird to write or maintain, but that you might not have thought to try. I think it's a good class, but since it never sold well, I've decided it would do more good (for me and for everyone else) if I just gave away the materials for free.
Table of ContentsThe class is in three sections. The first section is about using plain text files and talks about a bunch of useful techniques, such as how to do binary search on sorted text files (this is nontrivial) and how to replace records in-place, when they might not fit.The second section is about the Tie::File module, which associates a flat text file with a Perl array. The third section is about DBM files, with a comparison of the five major implementations. It finishes up with a discussion of some of Berkeley DB's lesser-known useful features, such as its DB_BTREE file type, which offers fast access like a hash but keeps the records in sorted order
Online materials
[Other articles in category /prog/perl] permanent link Fri, 20 Jul 2007
"More intuitive" programming language syntax
Chromatic says that these arguments are bunk because programming language syntax is much less important than programming language semantics. But I think that is straining at a gnat and swallowing a camel. To argue that a certain programming language feature is bad because it is confusing to beginners, you have to do two things. You have to successfully argue that being confusing to beginners is an important metric. Chromatic's article tries to refute this, saying that it is not an important metric. But before you even get to that stage, you first have to show that the programming language feature actually is confusing to beginners. But these arguments are never presented with any evidence at all, because no such evidence exists. They are complete fabrications, pulled out of the asses of their propounders, and made of equal parts wishful thinking and bullshit. Addendum 20070720:
[ Addendum 20070722: I screwed up the links to the paper when I first posted them; they are fixed now. The paper is here. Thanks to Anton Berezin for pointing this out. ]
[Other articles in category /prog] permanent link Thu, 12 Jul 2007
Another useful utility
#!/usr/bin/perl
my $field = shift or usage();
$field -= 1 if $field > 0;
$|=1;
while (<>) {
chomp;
my @f = split;
print $f[$field], "\n";
}
sub usage {
print STDERR "$0 fieldnumber\n";
exit 1;
}
I got tired of writing awk '{print $11}' when I wanted to
extract the 11th field of some stream of data in a Unix pipeline,
which is something I do about six thousand times a day. So
I wrote this tiny thing. It was probably the most useful piece of
software I wrote in that calendar year, and as you can see from the
length, it certainly had the best cost-to-benefit ratio. I use it
every day.The point here is that you can replace awk '{print $11}' with just f 11. For example, f 11 access_log finds out the referrer URLs from my Apache httpd log. I also frequently use f -1, which prints the last field in each line. ls -l | grep '^l' | f -1 prints out the targets of all the symbolic links in the current directory. Programs like this won't win me any prizes, but they certainly are useful. Anyway, today's post was inspired by another similarly tiny utility that I expect will be similarly useful that I just finished. It's called runN:
#!/usr/bin/perl
use Getopt::Std;
my %opt;
getopts('r:n:c:v', \%opt) or usage();
$opt{n} or usage();
$opt{c} or usage();
@ARGV = shuffle(@ARGV) if $opt{r};
my $N = $opt{n};
my %pid;
while (@ARGV) {
if (keys(%pid) < $N) {
$pid{spawn($opt{c}, split /\s+/, shift @ARGV)} = 1;
} else {
delete $pid{wait()};
}
}
1 while wait() >= 0;
sub spawn {
my $pid = fork;
die "fork: $!" unless defined $pid;
return $pid if $pid;
exec @_;
die "exec: $!";
}
You can tell I just finished it because the shuffle() and
usage() functions are unimplemented.The idea is that you execute the program like this:
runN -n 3 -c foo arg1 arg2 arg3 arg4...and it runs the commands foo arg1, foo arg2, foo arg3, foo arg4, etc., simultaneously, but with no more than 3 running at a time. The -n option says how many commands to run simultaneously; after running that many the main control waits until one has exited before starting another. If I had implemented shuffle(), then -r would run the commands in random order, instead of in the order specified. Probably I should get rid of -c and just have the program take the first argument as the command name, so that the invocation above would become runN -n 3 foo arg1 arg2 arg3 arg4.... The -v flag, had I implemented it, would put the program into verbose mode. I find that it's best to defer the implementation of features like -r and -v until I actually need them, which might be never. In the past I've done post-analyses of the contents of ~mjd/bin, and what I found was that my tendency was to implement a lot more features than I needed or used. In the original implementation, the -n is mandatory, because I couldn't immediately think of a reasonable default. The only obvious choice is 1, but since the point of the program was to run programs concurrently, 1 is not reasonable. But it occurs to me now that if I let -n default to 1, then this command would replace many of my current invocations of:
for i in ...; do cmd $i donewhich I do quite a lot. Typing runN cmd ... would be a lot quicker and easier. As I've written before, when a feature you put in turns out to have unanticipated uses, it's a sign of a good, modular design. The code itself makes me happy for two reasons. One is that the program worked properly on the first try, which does not happen very often for me. When I was in elementary school, my teachers always complained that although I was very bright, I made a lot of careless mistakes because I was not methodical enough. They tried hard to fix this personality flaw. They did not succeed. The other thing I like about the code is that it's so very brief. Not to say that it is any briefer than it should be; I think it's just about perfect. One of the recurring themes of my study of programming for the last few years is that beginner programmers use way more code than is necessary, just like beginning writers use way too many words. The process and concurrency management turned out to be a lot easier than I thought they would be: the default Unix behavior was just exactly what I needed. I am particularly pleased with delete $pid{wait()}. Sometimes these things just come together. The 1 while wait() >= 0 line is a non-obfuscated version of something I wrote in my prize-winning obfuscated program, of all places. Sometimes the line between the sublime and the ridiculous is very fine indeed. Despite my wariness of adding unnecessary features, there is at least one that I will put in before I deploy this to ~mjd/bin and start using it. I'll implement usage(), since experience has shown that I tend to forget how to invoke these things, and reading the usage message is a quicker way to figure it out than is rereading the source code. In the past, usage messages have been good investments. I'm tempted to replace the cut-rate use of split here with something more robust. The problem I foresee is that I might want to run a command with an argument that contains a space. Consider:
runN -n 2 -c ls foo bar "-l baz"This runs ls foo, then ls bar, then ls -l baz. Without the split() or something like it, the third command would be equivalent to ls "-l baz" and would fail with something like -l baz: no such file or directory. (Actually it tries to interpret the space as an option flag, and fails for that reason instead.) So I put the split in to enable this usage. (Maybe this was a you-ain't-gonna-need-it moment; I'm not sure.) But this design makes it difficult or impossible to apply the command to an argument with a space in it. Suppose I'm trying to do ls on three directories, one of which is called old stuff. The natural thing to try is:
runN -n 2 -c ls foo bar "old stuff"But the third command turns into ls old stuff and produces:
ls: old: No such file or directory ls: stuff: No such file or directoryIf the split() were omitted, it would just work, but then the ls -l baz example above would fail. If the split() were replaced by the correct logic, I would be able to get what I wanted by writing something like this:
runN -n 2 -c ls foo bar "'old stuff'"But as it is this just produces another error:
ls: 'old: No such file or directory ls: stuff': No such file or directoryPerl comes standard with a library called ShellWords that is probably close to what I want here. I didn't use it because I wasn't sure I'd actually need it—only time will tell—and because shell parsing is very complicated and error-prone, more so when it is done synthetically rather than by the shell, and even more so when it is done multiple times; you end up with horrible monstrosities like this:
s='q=`echo "$s" | sed -e '"'"'s/'"'"'"'"'"'"'"'"'/'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'/g'"'"'`; echo "s='"'"'"$q"'"'"'"; echo $s' q=`echo "$s" | sed -e 's/'"'"'/'"'"'"'"'"'"'"'"'/g'`; echo "s='"$q"'"; echo $sSo my fear was that by introducing a double set of shell-like interpretation, I'd be opening a horrible can of escape character worms and weird errors, and my hope was that if I ignored the issue the problems might be simpler, and might never arise in practice. We'll see. [ Addendum 20080712: Aaron Crane wrote a thoughtful followup. Thank you, M. Crane. ]
[Other articles in category /prog] permanent link Wed, 21 Feb 2007
A bug in HTML generation
But then I started to see requests in the HTTP error log for URLs like this:
/pictures/blog/tex/total-die-rolls.gif$${6/choose%20k}k!{N!/over%20/prod%20{i!}^{n_i}{n_i}!}/qquad%20/hbox{/rm%20where%20$k%20=%20/sum%20n_i$}$$.gif
Someone must be referring people to these incorrect URLs, and it is
presumably me. The HTML version of the blog looked okay, so I checked
the RSS and Atom files, and found that, indeed, they were malformed.
Instead of <img src="foo.gif" alt="$TeX$">, they
contained codes for <img src="foo.gif$TeX$">.I tracked down and fixed the problem. Usually when I get a bug like this, I ask myself what I could learn from it. This one is unusual. I can't think of much. Here's the bug. The <img> element is generated by a function called imglink. The arguments to imglink are the filename that contains the image (for use in the SRC attribute) and the text for the ALT attribute. The ALT text is optional. If it is omitted, the function tries to locate the TeX source code and fetch it. If this attempt fails, it continues anyway, and omits the ALT attribute. Then it generates and returns the HTML:
sub imglink {
my $file = shift;
...
my $alt = shift || fetch_tex($file);
...
$alt = qq{alt="$alt"} if $alt;
qq{<img $alt border=0 src="$url">};
}
This function is called from several places in the plugin. Sometimes
the TeX source code is available at the place from which the call
comes, and the code has return imglink($file, $tex);
sometimes it isn't and the code has
return imglink($file) and hopes that the imglink
function can retrieve the TeX.One such place is the branch that handles generation of tags for every type of output except HTML. When generating the HTML output, the plugin actually tries to run TeX and generate the resulting image file. For other types of output, it assumes that the image file is already prepared, and just calls imglink to refer to an image that it presumes already exists:
return imglink($file, $tex) unless $blosxom::flavour eq "html";The bug was that I had written this instead:
return imglink($file. $tex) unless $blosxom::flavour eq "html";The . here is a string concatenation operator. It's a bit surprising that I don't make more errors like this than I do. I am a very inaccurate typist. Stronger type checking would not have saved me here. Both arguments are strings, concatenation of strings is perfectly well-defined, and the imglink function was designed and implemented to accept either one or two arguments. The function did note the omission of the $tex argument, attempted to locate the TeX source code for the bizarrely-named file, and failed, but I had opted to have it recover and continue silently. I still think that was the right design. But I need to think about that some more. The only lesson I have been able to extract from this so far is that I need a way of previewing the RSS and Atom outputs before publishing them. I do preview the HTML output, but in this case it was perfectly correct.
[Other articles in category /prog/bug] permanent link Wed, 14 Feb 2007
Subtlety or sawed-off shotgun?
There's a line in one of William Gibson's short stories about how some situations call for a subtle and high-tech approach, and others call for a sawed-off shotgun. I think my success as a programmer, insofar as I have any, comes from knowing when to deploy each kind of approach. In a recent article I needed to produce the table that appears at left.This was generated by a small computer program. I learned a long time ago that although it it tempting to hack up something like this by hand, you should usually write a computer program to do it instead. It takes a little extra time up front, and that time is almost always amply paid back when you inevitably decide that that table should have three columns instead of two, or the lines should alternate light and dark gray, or that you forgot to align the right-hand column on the decimal points, or whatever, and then all you have to do is change two lines of code and rerun the program, instead of hand-editing all 34 lines of the output and screwing up two of them and hand-editing them again. And again. And again. When I was making up the seating chart for my wedding, I used this approach. I wrote a raw data file, and then a Perl program to read the data file and generate LaTeX output. The whole thing was driven by make. I felt like a bit of an ass as I wrote the program, wondering if I wasn't indulging in an excessive use of technology, and whether I was really going to run the program more than once or twice. How often does the seating chart need to change, anyway? Gentle readers, that seating chart changed approximately one million and six times.
But how is the left-hand column generated? In my book, I spent quite a lot of time discussing generation of partitions of an integer, as an example of iterator techniques. Some of these techniques are very clever and highly scalable. Which of these clever partition-generating techniques did I use to generate the left-hand column of the table?
while (<DATA>) {
chomp;
my @p = split //;
...
}
...
__DATA__
1
11
2
111
12
3
...
51
6
I guessed that it would take a lot longer to write code to generate
partitions, or even to find it already written and use it, than it
would just to generate the partitions out of my head and type them in.
This guess was correct.
The only thing wrong with my approach is that it doesn't scale. But
it doesn't need to scale.
[Other articles in category /prog] permanent link Tue, 03 Oct 2006
Ralph Johnson on design patterns
"Patterns" that are used recurringly in one language may be invisible or trivial in a different language. and ended by concluding:
Patterns are signs of weakness in programming languages.Ralph Johnson, one of the four authors of the famous book Design Patterns, took note of my article and responded. I found Johnson's response really interesting, and curious in a number of ways. I think everyone who was interested in my article should read his too. [ Addendum 20070127: The link above to Ralph Johnson's response is correct, but your client will be rejected if you are referred from here. To see his blog page, visit the page without clicking on the link. ] Johnson raises several points. First there is a meta-issue to deal with. Johnson says:
He clearly thinks that what he says is surprising. And other people think it is surprising, too. That is surprising to me.I did think that what I had to say was interesting and worth saying, of course, or I would not have said it. And I was not surprised to find that other people agreed with me. One thing that I did find surprising is the uniformity of other people's surprise and interest. There were dozens of blog posts and comments in the following two weeks, all pretty much saying what a great article I had written and how right I was. I tracked the responses as carefully as I could, and I did not see any articles that called me a dumbass; I did not see any except for Johnson's that suggested that what I was saying was unsurprising. We can't conclude from this that I am right, of course; people agree with all sorts of stupid crap. But we can conclude that that what I said was surprising and interesting, since people were surprised and interested by it, even people who already have some knowledge of this topic. Johnson is right to be surprised by this, because he thought this was obvious and well-known, and that it was clearly laid out in his book, and he was mistaken. Many or most of the readers of his book have completely missed this point. I didn't miss it, but I didn't get it from the book, either. Johnson and his three co-authors wrote this book, Design Patterns, which has had a huge influence on the way that programming is practiced. I think a lot of that influence has been malign. Any practice can be corrupted, of course, by being reduced to its formal aspects and applied in a rote fashion. (There's a really superb discussion of this in A. Ya. Khinchin's essay On the Teaching of Mathematics, and a shorter discussion in Polya's How to Solve It, in the section on "Pedantry and Mastery".) That will happen to any successful movement, and the Gang of Four can't take all the blame for that. But if they really intended that everyone should understand that each design pattern is a demonstration of a weakness in its target language, then they blew it, because it appears that hardly anyone understood that. Let's pause for a moment to imagine an alternate universe in which the subtitle of the Design Patterns book was not "Elements of Reusable Object-Oriented Software" but "Solutions for Recurring Problems in Object-Oriented Languages". And let's imagine that in each section, after "Pattern name", "Intent", "Motivation", "Applicability", and so forth, there was another subsection titled "Prophylaxis" that went something like this: "The need for the Iterator pattern in C++ appears to be due partly to its inflexible type system and partly to its lack of abstract iteration structures. The iterator pattern is unnecessary in the Python language, which avoids these defects as follows: ... at the expense of ... . In Common Lisp, on the other hand, ... (etc.)". I would have liked to have seen that universe, but I suppose it's too late now. Oh well. Anyway, moving on from meta-issues to the issues themselves, Johnson continues:
At the very end, he says that patterns are signs of weakness in programming languages. This is wrong.This is interesting, and I was going to address it later, but I now think that it's the first evidence of a conceptual mistake that Johnson has made that underlies his entire response to my article, so I'll take it up now. At the very end of his response, Johnson says:
No matter how complicated your language will be, there will always be things that are not in the language. These things will have to be patterns. So, we can eliminate one set of patterns by moving them into the language, but then we'll just have to focus on other patterns. We don't know what patterns will be important 50 years from now, but it is a safe bet that programmers will still be using patterns of some sort.Here we are in complete agreement. So, to echo Johnson, I was surprised that he would think this was surprising. But how can we be in complete agrement if what I said was "wrong"? There must be a misunderstanding somewhere. I think I know where it is. When I said "[Design] Patterns are signs of weakness in programming languages," what I meant was something like "Each design pattern is a sign of a weakness in the programming language to which it applies." But it seems that Johnson thinks that I meant that the very existence of design patterns, at all, is a sign of weakness in all programming languages everywhere. If I thought that the existence of design patterns, at all, was a sign that current programming languages are defective, as a group, I would see an endpoint to programming language development: someday, we would have a perfect überlanguage in which it would be unnecessary to use patterns because all possible patterns would have been built in already. I think Johnson thinks this was my point. In the passage quoted above, I think he is addressing the idea of the überlanguage that incorporates all patterns everywhere at all levels of abstraction. And similarly:
Some people like languages with a lot of features. . . . I prefer simple languages.And again:
No matter how complicated your language will be, there will always be things that are not in the language.But no, I don't imagine that someday we will have the ultimate language, into which every conceivable pattern has been absorbed. So a lot of what Johnson has to say is only knocking down a straw man. What I imagine is that when pattern P applies to language L, then, to the extent that some programmer on some project finds themselves needing to use P in their project, the use of P indicates a deficiency in language L for that project. The absence of a convenient and simple way to do P in language L is not always a problem. You might do a project in language L that does not require the use of pattern P. Then the problem does not manifest, and, whatever L's deficiencies might be for other projects, it is not deficient in that way for your project. This should not be difficult for anyone to understand. Perl might be a very nice language for writing a program to compile a bioinformatic data file into a more reasonable form; it might be a terrible language for writing a real-time missile guidance system. Its deficiencies operate in the missile guidance project in a way that they may not in the data munging project. But to the extent that some deficiency does come up in your project, it is a problem, because you are implementing the same design over and over, the same arrangement of objects and classes, to accomplish the same purpose. If the language provided more support for solving this recurring design problem, you wouldn't need to use a "pattern". Consider again the example of the "subroutine" pattern in assembly language: don't you have anything better to do than redesign and re-implement the process of saving the register values in a stack frame, over and over? Well, yes, you do. And that is why you use a language that has that built in. Consider again the example of the "object-oriented class" pattern in C: don't you have anything better to do than redesign and re-implement object-oriented method dispatch with inheritance, over and over? Yes, you do. And that is why you use a language that has that built in, if that is what you need. By Gamma, Helm, Johnson, and Vlissides' own definition, the problems solved by patterns are recurring problems, and programmers must address them recurringly. If these problems recurred in every language, we might conclude that they were endemic to programming itself. We might not, but it's hard to say, since if there are any such problems, they have not yet been brought to my attention. Every pattern discovered so far seems to be specific to only a small subset of the world's languages. So it seems a small step to conclude that these recurring, language-specific problems are actually problems with the languages themselves. No problem is a problem in every language, but rather each problem is a red arrow, pointing at a design flaw in the language in which it appears. Johnson continues:
Patterns might be a sign of weakness, but they might be a sign of simplicity. . . .I think this argument fails, in light of the examples I brought up in my original article. The argument is loaded by the use of the word "simplicity". As Einstein said, things should be as simple as possible, but no simpler. In assembly language, "subroutine call" is a pattern. Does Johnson or anyone seriously think that C++ or Smalltalk or Common Lisp or Java would be improved by having the "subroutine call" pattern omitted? The languages might be "simpler", but would they be better? The alternative, remember, is to require the programmer to use a "pattern": to make them consult a manual of "patterns" to implement a "general arrangement of objects and classes" to solve the subroutine-call problem every time it comes up. I guess you could interpret that as a sign of "simplicity", but it's the wrong kind of simplicity. Language designers have a hard problem to solve. If they don't put enough stuff into the language, it'll be too hard to use. But if they put in too much stuff, it'll be confusing and hard to program, like C++. One reason it's hard to be a language designer is that it's hard to know what to put in and what to leave out. There is an extremely complex tradeoff between simplicity and functionality. But in the case of "patterns", it's much easier to understand the tradeoff. A pattern, remember, is a general method for solving "a recurring design problem". Patterns might be a sign of "simplicity", but if so, they are a sign of simplicity in the wrong place, a place where the language needs to be less simple and more featureful. Because patterns are solutions to recurring design problems. If you're a language designer, and a "pattern" comes to your attention, then you have a great opportunity. The programmers using your language have a recurring problem. They have to implement the same solution to it, over and over. Clearly, this is a good place to try to expend some design effort; perhaps you can trade off a little simplicity for some functionality and fix the language so that the problem is a problem no longer. Getting rid of one recurring design problem might create new ones. But if the new problems are operating at a higher level of abstraction, you may have a win. Getting rid of the need for the "subroutine call" pattern in assembly language opened up all sorts of new problems: when and how do I do recursion? When and how do I do coroutines? Getting rid of the "object-oriented class" pattern in C created a need for higher-level patterns, including the ones described in the Design Patterns book. When people didn't have to worry about implementing inheritance themselves, a lot of their attention was freed up, and they could notice patterns like Façade. As Alfred North Whitehead says, civilization advances by extending the number of important operations which we can perform without thinking about them. The Design Patterns approach seems to be to identify the important operations and then to think about them over and over and over and over and over. Or so it seems to me. Johnson's next paragraph makes me wonder if I've completely missed his point, because it seems completely senseless to me: There is a trade-off between putting something in your programming language and making it be a convention, or perhaps putting it in the library. Smalltalk makes "constructor" be a convention. Arithmetic is in the library, not in the language. Control structures and exception handling are from the library, not in the language.Huh? Why does "library" matter? Unless I have missed something essential, whether something is in the "language" or the "library" is entirely an implementation matter, to be left to the discretion of the compiler writer. Is printf part of the C language, or its library? The library, everyone knows that. Oh, well, except that its behavior is completely standardized by the language standard, and it is completely permissible for the compiler writer to implement printf by putting a special case into the compiler that is enabled when the compiler happens to see the directive #include <stdio.h>. There is absolutely no requirement that printf be loaded from a separate file or anything like that. Or consider Perl's dbmopen function. Prior to version 5.000, it was part of the "language", in some sense; in 5.000 and later, it became part of the "library". But what's the difference, really? I can't find any. Is Johnson talking about some syntactic or semantic difference here? Maybe if I knew more about Smalltalk, I would understand his point. As it is, it seems completely daft, which I interpret to mean that there's something that went completely over my head. Well, the whole article leaves me wondering if maybe I missed his point, because Johnson is presumably a smart guy, but his argument about the built-in features vs. libraries makes no sense to me, his argument about simplicity seems so clearly and obviously dismantled by his own definition of patterns, and his apparent attack on a straw man seems so obviously erroneous. But I can take some consolation in the thought that if I did miss his point, I'm not the only one, because the one thing I can be sure of in all of this is that a lot of other people have been missing his point for years. Johnson says at the beginning that he "wasn't sure whether to be happy or unhappy". If I had written a book as successful and widely read as Design Patterns and then I found out that everyone had completely misunderstood it, I think I would be unhappy. But perhaps that's just my own grumpy personality. [ Addendum 20080303: Miles Gould wrote a pleasant and insightful article on Johnson's point about libraries vs. language features. As I surmised, there was indeed a valuable point that went over my head. I said I couldn't find any difference between "language" and "library", but, as M. Gould explains, there is an important difference that I did not appreciate in this context. ]
[Other articles in category /prog] permanent link
Really real examples of HOP techniques in action
I used HOP stuff less than I might have if I hadn't written the HOP book myself. There's always a tradeoff with the use of any advanced techniques: it might provide some technical benefit, like making the source code smaller, but the drawback is that the other people you work with might not be able to maintain it. Since I'm the author of the book, I can be expected to be biased in favor of the techniques. So I tried to compensate the other way, and to use them only when I was absolutely sure it was the best thing to do. There were two interesting uses of HOP techniques. One was in the username generator for new accounts. The other was in a generic server module I wrote.
Name generationThe name generator is used to offer account names to incoming students and faculty. It is given the user's full name, and optionally some additional information of the same sort. It then generates a bunch of usernames to offer the user. For example, if the user's name is "George Franklin Bauer, Jr.", it might generate usernames like:
george bauer georgef fgeorge fbauer bauerf
gf georgeb fg fb bauerg bf
georgefb georgebf fgeorgeb fbauerg bauergf bauerfg
ge ba gef gbauer fge fba
bgeorge baf gfbauer gbauerf fgbauer fbgeorge
bgeorgef bfgeorge geo bau geof georgeba
fgeo fbau bauerge bauf fbauerge bauergef
bauerfge geor baue georf gb fgeor
fbaue bg bauef gfb gbf fgb
fbg bgf bfg georg georgf gebauer
fgeorg bageorge gefbauer gebauerf fgebauer
The code that did this, before I got to it, was extremely long and
convoluted. It was also extremely slow. It would generate a zillion
names (slowly) and then truncate the list to the required length.It was convoluted because people kept asking that the generation algorithm be tweaked in various ways. Each tweak was accompanied by someone hacking on the code to get it to do things a little differently. I threw it all away and replaced it with a lazy generator based on the lazy stream stuff of Chapter 6. The underlying stream library was basically the same as the one in Chapter 6. Atop this, I built some functions that generated streams of names. For example, one requirement was that if the name generator ran out of names like the examples above, it should proceed by generating names that ended with digits. So:
sub suffix {
my ($s, $suffix) = @_;
smap { "$_$suffix" } $s;
}
# Given (a, b, c), produce a1, b1, c1, a2, b2, c2, a3...
sub enumerate {
my $s = shift;
lazyappend(smap { suffix($s, $_) } iota());
}
# Given (a, b, c), produce a, b, c, a1, b1, c1, a2, b2, c2, a3...
sub and_enumerate {
my $s = shift;
append($s, enumerate($s));
}
# Throw away names that are already used
sub available_filter {
my ($s, $pn) = @_;
$pn ||= PennNames::Generate::InUse->new;
sgrep { $pn->available($_) } $s;
}
The use of the stream approach was strongly indicated here for two
reasons. First, the number of names to generate wasn't known in
advance. It was convenient for the generation module to pass back a
data structure that encapsulated an unlimited number of names, and let
the caller mine it for as many names as were necessary. Second, the frequent changes and tinkerings to the name generation algorithm in the past suggested that an extremely modular approach would be a benefit. In fact, the requirements for the generation algorithm chanced several times as I was writing the code, and the stream approach made it really easy to tinker with the order in which names were generated, by plugging together the prefabricated stream modules.
Generic serverFor a different project, I wrote a generic forking server module. The module would manage a listening socket. When a new connection was made to the socket, the module would fork. The parent would go back to listening; the child would execute a callback function, and exit when the callback returned.The callback was responsible for communicating with the client. It was passed the client socket:
sub child_callback {
my $socket = shift;
# ... read and write the socket ...
return; # child process exits
}
But typically, you don't want to have to manage the socket manually.
For example, the protocol might be conversational: read a request
from the client, reply to it, and so forth:
# typical client callback:
sub child_callback {
my $socket = shift;
while (my $request = <$socket>) {
# generate response to request
print $socket $response;
}
}
The code to handle the loop and the reading and writing was
nontrivial, but was going to be the same for most client functions.
So I provided a callback generator. The input to the callback
generator is a function that takes requests and returns appropriate
responses:
sub child_behavior {
my $request = shift;
if ($request =~ /^LOOKUP (\w+)/) {
my $input = $1;
if (my $result = lookup($input)) {
return "OK $input $result";
} else {
return "NOK $input";
}
} elsif ($request =~ /^QUIT/) {
return;
} elsif ($request =~ /^LIST/) {
my $N = my @N = all_names();
return join "\n", "OK $N", @N, ".";
} else {
return "HUH?";
}
}
This child_behavior function is not suitable as a callback,
because the argument to the callback is the socket handle. But the
child_behavior function can be turned into a callback:
$server->run(CALLBACK => make_callback(\&child_behavior));
make_callback() takes a function like
child_behavior() and wraps it up in an I/O loop to turn it
into a callback function. make_callback() looks something
like this:
sub make_callback {
my $behavior = shift;
return sub {
my $socket = shift;
while (my $request = <$socket>) {
chomp $request;
my $response = $behavior->($request);
return unless defined $response;
print $socket $response;
}
};
}
I think this was the right design; it kept the design modular and
flexible, but also simple.
[Other articles in category /prog] permanent link Wed, 20 Sep 2006
The world's worst macro preprocessor
The goal last time was a macro processor. I write a lot of math articles. I get tired of writing <sup>2</sup> every time I want a superscript 2. Even if I bind a function key to that sequence of characters, it's hard to read. But now, with my new Blosxom macro processor, I just insert a line into my article that says:
#define ^2 <sup>2</sup> and for the rest of the article, ^2 is expanded to <sup>2</sup>. This has turned out really well, and I'm using it for all sorts of stuff. I use it for math notations, such as for making -> an abbreviation for → (→), and for making ~ an abbreviation for ¬ (¬). But I've also used it to #define Godel Gödel. I've used it to #define KK <b>K</b> and #define SS <b>S</b>, which makes an article I'm writing about combinatory logic readable, where it wasn't readable before. In my recent article about job hunting, I used it to #define CV résumé, which saved me from having to interrupt my train of thought several times in the article. There are some important points about the design that I think I got right on the first try. Whenever you write a macro system, you have to ask about escape sequences: what do you do if you don't want a macro expanded? For example, in the combinatory logic article I defined a macro SS. This meant that if I had written MOUSSE in the article somewhere, it would have turned into MOUSE. How should I prevent that kind of error? Answer: I don't. I'm unlikely to do that. But if I do, I'll pick it up during the article proofreading phase. If I can't avoid writing MOUSSE, I have two choices: I can change the name of the SS macro to something easier to avoid—like S*, say, or I can define a second macro: #define !MOUSSE MOUSSE. But so far, it hasn't come up. One alternative solution is to say that macros are expanded only in certain contexts. For example, SS might only be expanded when it is a complete word, not when it is in the middle of a word, as MOUSSE. I resisted this solution. It is much simpler to remember that every macro is expanded everywhere. And it it is much easier to fix the problem of a macro being expanded when I don't want it than it is to fix the problem of a macro not being expanded when I do want it. So every macro is expanded no matter where it appears. Related to the unintentional-expansion issue is that each article has its own private macro set. I don't have to worry that by defining a macro named -> in one article that I might be sabotaging my opportunity to actually write -> in some unknown future article. Each set of macros can be totally ad hoc. I don't have to worry about global tradeoffs. Do I #define --- —, knowing that that will foreclose my opportunity to use --- in any other way? I can make the decision based on simple, local information. It would have been tempting to over-engineer the system and add all sorts of complex escape facilities. I think I made the right choice here by not doing any of that. Another escaping issue: What if I want to write something that looks like a definition but isn't? Here I avoided the problem by choosing a definition syntax that I was unlikely to write in any other context: #define in the leftmost column indicates a definition. In this article, I had to write some similar text. It was no trouble to indent it a couple of spaces, disabling the special meaning. But HTML is already full of escape mechanisms, and it would have been no trouble to write #define instead of #define if for some reason I had really needed it to appear in the leftmost column. (Unlikely anyway, since HTML has no column semantics.) Another right choice I think I made was not to parametrize the macros. An article on algebra might well have:
#define ^2 <sup>2</sup> #define ^3 <sup>3</sup>and it might be oh-so-tempting to try to eliminate the duplication à la C:
#define ^(\w+) <sup>$1</sup>I did not do this. It would have complicated the processing substantially. It would also have complicated the use of the package substantially: I would have to worry a lot more than I do about invoking macros unintentionally. And it is not needed. Not so far, anyway. Because macro definitions only last for the duration of the article, there is no pressure to make a complete or consistent set of definitions. If an article happens to use the notations 2, i, and N, I can define macros for those and only those notations. Also tempting is to extend the macro system to support something like this:
#define BF(.*) <b>$1</b>I have so far resisted this. My feeling is that if I want to do anything like this, I should take it as a sign that I should be writing the articles in some markup system other than HTML. Choice of that markup system should be made carefully, and not organically as an ad-hoc overburdening of the macro system. I did run into one trouble with the macro system. Originally, it was invoked before some of my other plugins and after others. The earlier plugins automatically inserted certain text into the article that sometimes accidentally triggered my macros. I have not had any trouble with this since I changed the plugin order to invoke the macro processor before any of the other plugins. The macro-processing code is about 19 lines long, of which three are diagnostic. It is the world's worst macro system. It has exactly one feature. It is, I think the simplest thing that could possibly work, and so a good companion to Blosxom. For this application, the world's worst macro system is the world's best. [ Addendum 20071004: There's now a one-year retrospective analysis. ]
[Other articles in category /prog] permanent link Mon, 11 Sep 2006
Design patterns of 1972
"Patterns" that are used recurringly in one language may be invisible or trivial in a different language.
Extended Example: "object-oriented class"C programmers have a pattern that might be called "Object-oriented class". In this pattern, an object is an instance of a C struct.
struct st_employee_object *emp;
Or, given a suitable typedef:
EMPLOYEE emp;
Some of the struct members are function pointers. If "emp" is an
object, then one calls a method on the object by looking up the
appropriate function pointer and calling the pointed-to function:
emp->method(emp, args...);
Each struct definition defines a class; objects in the same class have
the same member data and support the same methods. If the structure
definition is defined by a header file, the layout of the structure
can change; methods and fields can be added, and none of the code that
uses the objects needs to know.There are a bunch of variations on this. For example, you can get opaque implementation by defining two header files for each class. One defines the implementation:
struct st_employee_object {
unsigned salary;
struct st_manager_object *boss;
METHOD fire, transfer, competence;
};
The other defines only the interface:
struct st_employee_object {
char __SECRET_MEMBER_DATA_DO_NOT_TOUCH[4];
struct st_manager_object *boss;
METHOD fire, transfer, competence;
};
And then files include one or the other as appropriate. Here "boss"
is public data but "salary" is private.You get abstract classes by defining a constructor function that sets all the methods to NULL or to:
void _abstract() { abort(); }
If you want inheritance, you let one of the structs be a prefix of
another:
struct st_manager_object; /* forward declaration */
#define EMPLOYEE_FIELDS \
unsigned salary; \
struct st_manager_object *boss; \
METHOD fire, transfer, competence;
struct st_employee_object {
EMPLOYEE_FIELDS
};
struct st_manager_object {
EMPLOYEE_FIELDS
unsigned num_subordinates;
struct st_employee_object **subordinate;
METHOD delegate_task, send_to_conference;
};
And if obj is a manager object, you can still treat
it like an employee and call employee methods on
it.This may seem weird or contrived, but the technique is widely used. The C standard contains guarantees that the common fields of struct st_manager_object and struct st_employee_object will be laid out identically in memory, specifically so that this object-oriented class technique can work. The code of the X window system has this structure. The code of the Athena widget toolkit has this structure. The code of the Linux kernel filesystem has this structure. Rob Pike, one of the primary architects of the Plan 9 operating system (the Bell Labs successor to Unix) and co-author (with Brian Kernighan) of The Unix Programming Environment, recommends this technique in his article "Notes on Programming in C".
This is a patternThere's only one way in which this technique doesn't qualify as a pattern according to the definition of Gamma, Helm, Johnson, and Vlissides. They say:
A design pattern systematically names, motivates, and explains a general design that addresses a recurring design problem in object-oriented systems. It describes the problem, the solution, when to apply the solution, and its consequences. It also gives implementation hints and examples. The solution is a general arrangement of objects and classes that solve the problem. The solution is customized and implemented to solve the problem in a particular context.Their definition arbitrarily restricts "design patterns" to addressing recurring design problems "in object-oriented systems", and to being general arrangements of "objects and classes". If we ignore this arbitrary restriction, the "object-oriented class" pattern fits the description exactly. The definition in Wikipedia is:
In software engineering, a design pattern is a general solution to a common problem in software design. A design pattern isn't a finished design that can be transformed directly into code; it is a description or template for how to solve a problem that can be used in many different situations.And the "object-oriented class" solution certainly qualifies.
Codification of patternsPeter Norvig's presentation on "Design Patterns in Dynamic Languages" describes three "levels of implementation of a pattern":
The single major driver for the invention of C++ was to codify this pattern into the language so that it was "invisible". In C++, you don't have to think about the structs and you don't have to worry about keeping data and methods private. You just declare a "class" (using syntax that looks almost exactly like a struct declaration) and annotate the items with "public" and "private" as appropriate. But underneath, it's doing the same thing. The earliest C++ compilers simply translated the C++ code into the equivalent C code and invoked the C compiler on it. There's a reason why the C++ method call syntax is object->method(args...): it's almost exactly the same as the equivalent code when the pattern is implemented in plain C. The only difference is that the object is passed implicitly, rather than explicitly. In C, you have to make a conscious decision to use OO style and to implement each feature of your OOP system as you go. If a program has fifty modules, you need to decide, fifty times, whether you will make the next module an OO-style module. In C++, you don't have to make a decision about whether or not you want OO programming and you don't have to implement it; it's built into the language.
Sherman, set the wayback machine for 1957If we dig back into history, we can find all sorts of patterns. For example:
Recurring problem: Two or more parts of a machine language program need to perform the same complex operation. Duplicating the code to perform the operation wherever it is needed creates maintenance problems when one copy is updated and another is not.This is a "pattern"-style description of the pattern we now know as "subroutine". It addresses a recurring design problem. It is a general arrangement of machine instructions that solve the problem. And the solution is customized and implemented to solve the problem in a particular context. Variations abound: "subroutine with passed parameters". "subroutine call with returned value". "Re-entrant subroutine". For machine language programmers of the 1950s and early 1960's, this was a pattern, reimplemented from scratch for each use. As assemblers improved, the pattern became formal, implemented by assembly-language macros. Shortly thereafter, the pattern was absorbed into Fortran and Lisp and their successors, and is now invisible. You don't have to think about the implementation any more; you just call the functions.
Iterators and model-view-controllerThe last time I wrote about design patterns, it was to point out that although the movement was inspired by the "pattern language" work of Christopher Alexander, it isn't very much like anything that Alexander suggested, and that in fact what Alexander did suggest is more interesting and would probably be more useful for programmers than what the design patterns movement chose to take.One of the things I pointed out was essentially what Norvig does: that many patterns aren't really addressing recurring design problems in object-oriented programs; they are actually addressing deficiencies in object-oriented programming languages, and that in better languages, these problems simply don't come up, or are solved so trivially and so easily that the solution doesn't require a "pattern". In assembly language, "subroutine call" may be a pattern; in C, the solution is to write result = function(args...), which is too simple to qualify as a pattern. In a language like Lisp or Haskell or even Perl, with a good list type and powerful primitives for operating on list values, the Iterator pattern is to a great degree obviated or rendered invisible. Henry G. Baker took up this same point in his paper "Iterators: Signs of Weakness in Object-Oriented Languages". I received many messages about this, and curiously, some made the same point in the same way: they said that although I was right about Iterator, it was a poor example because it was a very simple pattern, but that it was impossible to imagine a more complex pattern like Model-View-Controller being absorbed and made invisible in this way. This remark is striking for several reasons. It is an example of what is perhaps the most common philosophical fallacy: the writer cannot imagine something, so it must therefore be impossible. Well, perhaps it is impossible—or perhaps the writer just doesn't have enough imagination. It is worth remembering that when Edgar Allan Poe was motivated to investigate and expose Johann Maelzel's fraudulent chess-playing automaton, it was because he "knew" it had to be fraudulent because it was inconceivable that a machine could actually exist that could play chess. Not merely impossible, but inconceivable! Poe was mistaken, and the people who asserted that MVC could not be absorbed into a programming language were mistaken too. Since I gave my talk in 2002, several programming systems, such as Ruby on Rails and Subway have come forward that attempt to codify and integrate MVC in exactly the way that I suggested.
Progress in programming languagesHad the "Design Patterns" movement been popular in 1960, its goal would have been to train programmers to recognize situations in which the "subroutine" pattern was applicable, and to implement it habitually when necessary. While this would have been a great improvement over not using subroutines at all, it would have been vastly inferior to what really happened, which was that the "subroutine" pattern was codified and embedded into subsequent languages.Identification of patterns is an important driver of progress in programming languages. As in all programming, the idea is to notice when the same solution is appearing repeatedly in different contexts and to understand the commonalities. This is admirable and valuable. The problem with the "Design Patterns" movement is the use to which the patterns are put afterward: programmers are trained to identify and apply the patterns when possible. Instead, the patterns should be used as signposts to the failures of the programming language. As in all programming, the identification of commonalities should be followed by an abstraction step in which the common parts are merged into a single solution. Multiple implementations of the same idea are almost always a mistake in programming. The correct place to implement a common solution to a recurring design problem is in the programming language, if that is possible. The stance of the "Design Patterns" movement seems to be that it is somehow inevitable that programmers will need to implement Visitors, Abstract Factories, Decorators, and Façades. But these are no more inevitable than the need to implement Subroutine Calls or Object-Oriented Classes in the source language. These patterns should be seen as defects or missing features in Java and C++. The best response to identification of these patterns is to ask what defects in those languages cause the patterns to be necessary, and how the languages might provide better support for solving these kinds of problems. With Design Patterns as usually understood, you never stop thinking about the patterns after you find them. Every time you write a Subroutine Call, you must think about the way the registers are saved and the return value is communicated. Every time you build an Object-Oriented Class, you must think about the implementation of inheritance. People say that it's all right that Design Patterns teaches people to do this, because the world is full of programmers who are forced to use C++ and Java, and they need all the help they can get to work around the defects of those languages. If those people need help, that's fine. The problem is with the philosophical stance of the movement. Helping hapless C++ and Java programmers is admirable, but it shouldn't be the end goal. Instead of seeing the use of design patterns as valuable in itself, it should be widely recognized that each design pattern is an expression of the failure of the source language. If the Design Patterns movement had been popular in the 1980's, we wouldn't even have C++ or Java; we would still be implementing Object-Oriented Classes in C with structs, and the argument would go that since programmers were forced to use C anyway, we should at least help them as much as possible. But the way to provide as much help as possible was not to train people to habitually implement Object-Oriented Classes when necessary; it was to develop languages like C++ and Java that had this pattern built in, so that programmers could concentrate on using OOP style instead of on implementing it.
SummaryPatterns are signs of weakness in programming languages.When we identify and document one, that should not be the end of the story. Rather, we should have the long-term goal of trying to understand how to improve the language so that the pattern becomes invisible or unnecessary. [ Thanks to Garrett Rooney for pointing out some minor errors that I have since corrected. - MJD ] [ Addendum 20061003: There is a followup article to this one, replying to a response by Ralph Johnson, one of the authors of the "Design Patterns" book. This link URL is correct, but Johnson's website will refuse it if you come from here. ]
[Other articles in category /prog] permanent link Sat, 08 Jul 2006
A programmer had a problem...
Yesterday I was about to make the same mistake. I had a problem, and weak references seemed like the solution. Fortunately, it was time to go home, which is a two-mile walk. Taking a two-mile walk is a great way to fix mistakes, especially the ones you haven't made yet. On this particular walk, I came to my senses and avoided the weak references. The problem concerns the following classes and methods. You have a database object $db. You can call @rec = $db->lookup, which may return some record objects that represent records. You then call methods on the records, say $rec[3]->get_color, to extract data from them, or $rec[3]->set_color("purple"), to modify the data in the records. The updating is done in-memory only, and a later call to $db->flush writes all the updates back to the database. The database object needs to store the changes that have been made but not yet written out. The easy way to do this is to have it store a change log of the modified record objects. So set_color first makes its change to the target record object, and then calls an internal _update method on the original database object to attach the record to the change log. Later on, flush will process this array, writing out the indicated changes. In order for set_color to know which database to direct the _update call to, each record object must have a pointer back to the database that created it. This is convenient for other purposes too. Fine. But then if the record object is stored in the change log inside the database object, we now have a reference loop: the database contains a change log with a pointer to the record, which contains a pointer back to the database itself. This means that neither the database nor the record will ever be garbage collected. (This problem is common in complex Perl programs, and would simply vanish if Perl had even a slightly less awful garbage collector. Improvement is unlikely to occur before the release of Perl 6, now scheduled for October 28, 2073.) My first reaction when faced with a problem like this one is to gurgle contentedly in my sleep, turn over, and pull the blankets over my head. This strategy is the primary contributor to my success as a programmer; it is somewhat superior to the typical programmer's response, which is to swing into action, overthink the problem, and come up with an elaborate solution. Aron Nimzovitch once said that the problem chess novices have is the irrepressible urge to always be doing something. Programmers are similar. They are all very bright people, very good at solving problems, and they solve problems all the time, even the ones that don't need to be solved. I seem to be digressing. How unusual. In any case, this problem really did have to be solved. One wants the database object to flush out its pending changes at the time it becomes inacessible. If the object is never garbage collected, then the programmer must always remember to flush out the changes manually. Miss one call to flush, and your updates are lost. This is unacceptable. The primary purpose of a database is to record the updates. So I had to take my head out from under the covers, like it or not. I thought about several solutions, and even tried one out, but it was too complicated and got me into a horrible tar pit, so I threw it away and started over. (That is another superior strategy that programmers don't exercise as often as they should. As Erik Naggum says, they will drive a hundred miles through a forest, stopping every five feet to cut down another tree, instead of pausing to wonder if maybe they shouldn't have driven off the road in the first place.) Then I got the bright idea to use weak references, which seemed like just the thing. That's what weak references are for: breaking dependency loops so that things that need to be garbage collected can be. Fortunately, it was time to go, so I walked home instead of diving into the chyme-filled swimming pool of weak references. With the weak references, you need to decide which reference to weaken. There is a reference to the record object, in the change log inside the database object. And there is a reference to the database object, in the record object. Which do you weaken? If you weaken the reference to the record, you get a disaster:
{
my ($rec) = $db->lookup(...);
$rec->set_color("purple");
}
$db->flush;
When the block is exited, the last strong reference to the record goes
away, and the modified record evaporates, leaving nothing inside the
database object. The flush method can see by the lingering
ghost that there was something there it was supposed to deal with, but
it no longer knows what. So that choice is doomed.What if you weaken the reference inside the record, the one that points back to the database? That is hardly any better:
my $rec;
{
my $db = FlatFile->new(...);
($rec) = $db->lookup(...);
}
$rec->set_color("purple");
We would like the database object to hang around as long as there are
still some extant records from it. But because we weakened the
references from the records to the database, it doesn't; it evaporates
at the end of the block, leaving the record orphaned. The
set_color method then fails, because the database to which it
is supposed to write changes has evaporated.Conclusion: I've heard it before, and it wasn't funny the first time. On the walk home, I realized something else: actually storing the database data inside the record objects is a bad move. The general advice under which this is a bad move is something like Don't store the same data in two places. The specific problems in this instance are exemplified by this:
my ($a) = $db->lookup(unique_id => "142857");
my ($b) = $db->lookup(unique_id => "142857");
$a->set_color("red");
$b->set_color("purple");
$a->color eq "purple"; # True or false?
Since $a and $b represent the same record, the
answer should be true. But in the implementation I had (and still
have, actually; I haven't fixed this yet) it is false. The
set_color method on $b updates the data that is
cached in object $b, but has no idea that it should also
update the data cached in $a.To work properly, $a and $b should be identical objects. One way to do this is to store an object in memory for every record in the database, and hand out these preconstructed objects as needed; then both calls to lookup return the same object. This is time- and memory-intensive. Another way to do this is to cache the record objects as they are constructed, and arrange for lookup to return the cached objects when appropriate. This is more complicated. A simpler solution is not to store the data in memory at all. Record objects are always created as needed, but contain nothing but a database handle and some sort of locator information that says how to get the record data, should it be asked for. ("Any problem can be solved by another layer of indirection," they say, although it's not really true. Still, there are several classes of problems that can be solved by adding another layer of indirection, and this particular object identity problem could serve well as an exemplar of one of those classes.) Then modifications don't go into the record objects themselves. Instead, they go into the database object as an instruction to modify a certain record in a certain way. This solution, however, presupposes that there is a good way to build locator information for a flat file and update it as needed. Fortunately, there is. I did a really good job of solving this problem a few years ago when I wrote the Tie::File module. It represents a text file as a Perl array, so a record locator can simply be an index into the array, and a record object then becomes something like:
{
db => $db,
recno => 37,
}
The change log inside the database object looks something like:
{ 0 => no change,
1 => no change,
2 => "color" field was set to "purple",
3 => no change,
4 => "size" field was set to "unusually large",
...
}
This happily gets rid of the garbage collection problem I had been
trying to solve in the first place.Using Tie::File also eliminates a lot of I/O issues that I had solved before, and gets all the I/O code out of the database module. I had already been thinking about getting rid of the explicit I/O and having the database module depend on Tie::File, and when I recognized the lurking record object identity problem, I was convinced that it had to happen sooner rather than later. Having done it, I'm really pleased with the outcome.
[Other articles in category /prog] permanent link Fri, 07 Jul 2006
On design
The basic usage of the module is as follows: You create a database object that represents the entire database:
my $db = FlatFile->new(FILE => "/etc/passwd",
FIELDS => ['username', 'password', 'uid', 'gid',
'gecos', 'homedir', 'shell'],
FIELDSEP => ':',
) or die ...;
Then you can do queries on the database:
my @roots = $db->lookup(uid => 0);
This returns a list of Record objects. (Actually it returns
a list of FlatFile::Record::A objects, where
FlatFile::Record::A is a dynamically-generated class that was
manufactured at the time you did the new call, and which
inherits from FlatFile::Record, but we can ignore that here.)
Once we have the Record objects, we can query them or modify
them:
for my $root (@roots) {
if ($root->username eq 'root') {
$root->set_shell('/bin/false');
} else {
$root->delete;
}
}
This loops over the records that were selected in the earlier call and
examines the username field in each one. if the username is
root, the program sets the shell in the record to
/bin/false; otherwise it deletes the record entirely.Since lookup returns all the matching records, there is the question of what this should do:
my $root = $db->lookup(uid => 0);
Here we have provided enough room for at most one root user. What if
there is more than one?Every Perl function needs to make a decision about this issue. The function could be called in list context or in scalar context, and you need to choose the two behaviors sensibly. Here are some possibilities for what lookup might do if called in scalar context:
How to decide on the best behavior? This is the kind of problem that I really enjoy. What will people expect? What will they want? What do they need? Two important criteria are:
my $ref = [ $db->lookup(...) ];
Or they can subclass the Record module and add a new one-line
method that does the same:
sub lookup_ref {
my $self = shift;
[ $self->lookup(@_) ];
}
Similarly, behavior #2 (return a count) is so easy to get that
supporting it directly would probably not be a good use of my code or
my precious interface space:
my $N_recs = () = $db->lookup(...);
I had originally planned to do #3 (require that the query produce a
single record, on pain of death), and here's why: in my first forays
into programming with this module, I frequently found myself writing
things like my $rec = $db->lookup(...) without meaning to,
and in spite of the fact that I had documented the behavior in scalar
context as being undefined. I kept doing it unintentionally in cases
where I expected only one record to be returned. So each time I wrote
this code, I was putting in an implicit assumption that there would be
only one match. I would have been quite surprised in each case if
there had actually been multiple matches. That's the sort of
assumption that you might like to have automatically checked.I ran the question by the folks on IRC, and reaction against this design was generally negative. Folks said that it's not the module's job to try to discern the programmer's intention and enforce this inference by committing suicide. I can certainly get behind that point of view. I once wrote an article complaining bitterly about modules that call die. I said it was like when you're having tea and crumpets on your 112-piece Spode china set, and you accidentally chip the teacup, and the butler comes running in, crying "Don't worry, Master! I'll take care of that for you!" and then he whips out a hammer and smashes all 112 pieces of china to tiny bits. I don't think the point applies here, though. I had mentioned it in connection with the Text::ParseWords module, which would throw an exception if the input string was unparseable, hardly an uncommon occurrence, and one that was entirely unavoidable: if I knew that the string would be unparseable, I wouldn't be calling Text::ParseWords to parse it. Folks on IRC said that when the method might call die, you have to wrap every call to it in an exception handler, which I certainly agree is a pain in the ass. But in this example, you do not have to do that. Here, to prevent the function from dying is very easy: just call it in list context; then it will never die. If what you want is behavior #4, to have it discard all the records but the first one, that is easy to get, regardless of the design I adopt for scalar context behavior:
my ($rec) = $db->lookup(...);
This argues against #4 (return the first matching record) in the same
way that we argued against #2 and #5 already: it's so very easy to do
already, maybe we don't need an even easier way to do it. But if so,
couldn't the programmer just:
sub lookup_first {
my $self = shift;
my ($rec) = $self->lookup(@_);
return $rec;
}
A counterargument in favor of #4 might be based on the usefulness
criterion: perhaps this behavior is so commonly wanted that we
really do need an even easier way to do it.I was almost persuaded by the strong opinion in favor of #4, but then Roderick Schertler spoke up in favor of #3, for basically the reasons I set forth. I consider M. Schertler to have higher-than-normal reliability on matters of this type, so his opinion counterbalances several of the counteropinions on the other side. #3 is not too difficult to get, but still scores higher than most of the others on the difficulty scale. There doesn't seem to be a trivial inline expression of it, as there was with #2, #4, and #5. You would have to actually write a method, or else do something nasty like:
(my ($rec) = $db->lookup(...)) < 2 or die ...;
What about the other proposed behaviors? #1 (unconditional fatality)
is simple, but both criteria seem to argue against it. It does,
however, have the benefit of being a good temporary solution since it
is easy to change without breaking backward compatibility. Were I to
adopt it, it would be very unlikely (although not impossible) that
anyone would write a program that would depend on that behavior; I
would then be able to change it later on. #6 (return an iterator object) is very tempting, because it is the only one that scores high on the difficulty criterion scale: it is difficult or impossible to do this any other way, so by providing it, I am providing a real service to users of the module, rather than yet another way to do the same thing. The module's user cannot implement a good iterator interface as a wrapper around lookup, because lookup always searches the entire database before it returns, and allocates enough memory to store every returned record, whereas a good iterator interface will search only as far as is necessary to find the next matching record, and will store only one record at a time. This performance argument would be more important if we expected the databases to be very large. But since this is a module for manipulating plain text files, we can expect that they will not be too big, and perhaps the time and memory costs of searching them will be relatively small, so perhaps this design will score fairly low on the usefulness scale. I still haven't made up my mind, although writing this article has pushed me strongly toward #6. I would be glad to receive email on the matter.
[Other articles in category /prog] permanent link Mon, 15 May 2006
Creeping featurism and the ratchet effect
But the concept of "creeping featurism" his wider applicability than just to program features. We can recognize it in other contexts. For example, someone is reading the Perl manual. They read the section on the unpack function and they find it confusing. So they propose a documentation patch to add a couple of sentences, explicating the confusing point in more detail. It seems like a good idea at the time. But if you do it over and over—and we have—you end up with a 2,000 page manual—and we did. The real problem is that it's easy to see the benefit of any proposed addition. But it is much harder to see the cost of the proposed addition, that the manual is now 0.002% larger. The benefit has a poster child, an obvious beneficiary. You can imagine a confused person in your head, someone who happens to be confused in exactly the right way, and who is miraculously helped out by the presence of the right two sentences in the exact right place. The cost has no poster child. Or rather, the poster child is much harder to imagine. This is the person who is looking for something unrelated to the two-sentence addition. They are going to spend a certain amount of time looking for it. If the two-sentence addition hadn't been in there, they would have found what they were looking for. But the addition slowed them down just enough that they gave up without finding what they needed. Although you can grant that such a person might exist, they really aren't as compelling as the confused person who is magically assisted by timely advice. Even harder to imagine is the person who's kinda confused, and for whom the extra two sentences, clarifying some obscure point about some feature he wasn't planning to use in the first place, are just more confusion. It's really hard to understand the cost of that. But the benefit, such as it is, comes in one big lump, whereas the cost is distributed in tiny increments over a very large population. The benefit is clear, and the cost is obscure. It's easy to make a specific argument in favor of any particular addition ("people might be confused by X, so I'm going to explain it in more detail") and it's hard to make such an argument against the addition. And conversely: it's easy to make the argument that any particular bit of text should stay in, hard to argue that it should be removed. As a result, there's what I call a "ratchet effect": you can make the manual bigger, one tiny notch at a time, and people do. But having done so, you can't make it smaller again; someone will object to almost any proposed deletion. The manual gets bigger and bigger, worse and worse organized, more and more unusable, until finally it collapses under its own weight and all you can do is start over again. You see the same thing happen in software, of course. I maintain the Text::Template Perl module, and I frequently get messages from people saying that it should have some feature or other. And these people sometimes get quite angry when I tell them I'm not going to put in the feature they want. They're angry because it's easy to see the benefit of adding another feature, but hard to see the cost. "If other people don't like it," goes the argument, "they don't have to use it." True, but even if they don't use it, they still pay the costs of slightly longer download times, slightly longer compile times, a slightly longer and more confusing manual, slightly less frequent maintenance updates, slightly less prompt bug fix deliveries, and so on. It is so hard to make this argument, because the cost to any one person is so very small! But we all know where the software will end up if I don't make this argument every step of the way: on the slag heap. This has been on my mind on and off for years. But I just ran into it in a new context. Lately I've been working on a book about code style and refactoring in Perl. One thing you see a lot in Perl programs written by beginners is superfluous parentheses. For example:
next if ($file =~ /^\./);
next if !($file =~ (/[0-9]/));
next if !($file =~ (/txt/));
Or:
die $usage if ($#ARGV < 0);
There are a number of points I want to make about this. First, I'd
like to express my sympathy for Perl programmers, because Perl has
something like 95 different operators at something like 17 different
levels of precedence, and so nobody knows what all the precedences are
and whether parentheses are required in all circumstances. Does the
** operator have higher or lower precedence than the
<<= operator? I really have no idea.So the situation is impossible, at least in principle, and yet people have to deal with it somehow. But the advice you often hear is "if you're not sure of the precedence, just put in the parentheses." I think that's really bad advice. I think better advice would be "if you're not sure of the precedence, look it up." Because Perl's Byzantine operator table is not responsible for all the problems. Notice in the examples above, which are real examples, taken from real code written by other people: Many of the parentheses there are entirely superfluous, and are not disambiguating the precedence of any operators. In particular, notice the inner parentheses in:
next if !($file =~ (/txt/));
Inside the inner parentheses, there are no operators! So they cannot
be disambiguating any precedence, and they are completely unnecessary:
next if !($file =~ /txt/);
People sometimes say "well, I like to put them in anyway, just to be
sure." This is pure superstition, and we should not tolerate it in
people who purport to be engineers. Engineers should be capable of
making informed choices, based on technical realities, not on some
creepy feeling in their guts that perhaps a failure to sprinkle enough
parentheses over their program will invite the wrath the Moon God.By saying "if you're not sure, just avoid the problem" we are encouraging this kind of fearful, superstitious approach to the issue. That approach would be appropriate if it were the only way to deal with the issue, but fortunately it is not. There is a more rational approach: you can look it up, or even try an experiment, and then you will know whether the parentheses are required in a particular case. Then you can make an informed decision about whether to put them in. But when I teach classes on this topic, people sometimes want to take the argument even further: they want to argue that even if you know the precedence, and even if you know that the parentheses are not required, you should put them in anyway, because the next person to see the code might not know that. And there we see the creeping featurism argument again. It's easy to see the potential benefit of the superfluous parentheses: some hapless novice maintenance programmer might misunderstand the expression if I don't put them in. It's much harder to see the cost: The code is fractionally harder for everyone to read and understand, novice or not. And again, the cost of the extra parentheses to any particular person is so small, so very small, that it is really hard to make the argument against it convincingly. But I think the argument must be made, or else the code will end up on the slag heap much faster than it would have otherwise. Programming cannot be run on the convoy system, with the program code written to address the most ignorant, uneducated programmer. I think you have to assume that the next maintenance programmer will be competent, and that if they do not know what the expression means, they will look up the operator precedence in the manual. That assumption may be false, of course; the world is full of incompetent programmers. But no amount of parentheses are really going to help this person anyway. And even if they were, you do not have to give in, you do not have to cater to incompetence. If an incompetent programmer has trouble understanding your code, that is not your fault; it is their fault for being incompetent. You do not have to take special steps to make your code understandable even by incompetents, and you certainly should not do so at the expense of making it harder for competent programmers to read and understand, no, not to the tiniest degree. The advice that one should always put in the parentheses seems to me to be going in the wrong direction. We should be struggling for higher standards, both for ourselves and for our associates. The conventional advice, it seems to me, is to give up.
[Other articles in category /prog] permanent link Sat, 04 Mar 2006
Structured BASIC
That's what I started with, on the Acorn Electron. And I remember being excited about finding and understanding DEF FN. I also remember my disappointment about how limited it was. I remember my frustration whenever BASIC forced me into writing messy code.I remember my frustration with this too. I realized fairly early on that it was important to organize one's code in a modular fashion. My clearest memory of this was in developing an Adventure-style program. Each of the locations in the world was assigned a sequence number. Location #23 was handled by lines 2300--2399 of the program. Lines 2300--2319 would print the description of the location. Line 2320 would set the variables that recorded the player's location, and called the subroutine to print the descriptions of the other objects at that location. Line 2380 would call the subroutine that prompted the user for their next command. Other lines in between would provide the implementation of whatever special effects were required for that location. All the important utility subroutines were at mnemonic line numbers; the main loop was at line 50000, and the command processing was at 51000. Special handling for objects was in the 40000 range, with one hundred statement numbers reserved for each object. After each user command was processed, control was dispatched back to the appropriate part of the program, depending on where the player was now. Microsoft BASIC didn't have a computed GOTO, so the dispatch was performed by a jump table. I was unhappy with the jump table, recognizing that it didn't scale well. Object sizes and descriptions were stored in a table. I don't know why I didn't store the location descriptions in the table in the same way, but I suspect that I tried and found that my microcomputer didn't have enough string memory. I also discovered that the algorithm that mapped statement numbers to code did not scale well to programs with a lot of numbered statements; editing the program grew intolerably slow once the world contained more than about fifty locations. Still, I was pleased with the outcome. My goal (at the tender age of sixteen, or whatever) had been to adopt conventions that made it easy to extend or modify the world and to add new locations or objects, and I felt at the time that I had achieved that. M. Pagaltzis says:
I guess I have a natural penchant for structured code. Penchant? Instinct.I think anyone who is really interested in writing programs in BASIC and who reflects on the results of his projects is going to come to the conclusion that BASIC is a very poor tool for the job. These problems force themselves on everyone, and if you are thoughtful you will see the problems and try to come up with some techniques to solve them. I really wish I could see those old programs again. I'm sure I would learn a lot from them. I do have some code I wrote in C as long ago as 1987. I remember that shortly after that I got sick of programming and took a vacation from it for a year. One day the following year I was reading netnews, and I overheard a colleague complaining about his CS homework. He had to write a program in C to count the number of occurrences of each word in its input, using a binary tree to store the words. I said he was complaining about nothing and that I, a math major, could turn out such a program in two hours. I don't know why I said this, since I hadn't done any C programming in a year, and I didn't have any significant experience with C, but I was inspired, and I did finish it quickly, and it worked. I have been programming regularly ever since. I still have the source code for that program. Here's the funny thing about the programs from that time: when I look at the pre-vacation programs, they look to me as though they were written by someone else. When I look at the tree-sort program or any other program I have written since then, I recognize it as my own code. I don't know what happened in my brain during my one-year vacation, but my current programming style first emerged in that tree-sort program, and the code from after the break has all been a lot better than the code I wrote before. I'd like to take another vacation, but I can't now, because I have to earn a living.
[Other articles in category /prog] permanent link Mon, 30 Jan 2006
Rotten code in a ProFTPD plugin module
Here's the (exceptionally putrid) (relevant portion of the) code:
static int gss_netio_write_cb(pr_netio_stream_t *nstrm, char *buf,size_t buflen) {
int count=0;
int total_count=0;
char *p;
OM_uint32 maj_stat, min_stat;
OM_uint32 max_buf_size;
...
/* max_buf_size = maximal input buffer size */
p=buf;
while ( buflen > total_count ) {
/* */
if ( buflen - total_count > max_buf_size ) {
if ((count = gss_write(nstrm,p,max_buf_size)) != max_buf_size )
return -1;
} else {
if ((count = gss_write(nstrm,p,buflen-total_count)) != buflen-total_count )
return -1;
}
total_count = buflen - total_count > max_buf_size ? total_count + max_buf_size : buflen;
p=p+total_count;
}
return buflen;
}
(You know there's something wrong when the comment says "maximal input
buffer size", but the buffer is for performing output. I have not
looked at any of the other code in this module, which is 2,800 lines
long, so I do not know if this chunk is typical.)
Mr. Colleague suggested that p=p+total_count was wrong, and
should be replaced with p=p+max_buf_size. I agreed that it
was wrong, and that his change would fix the problem, although I
suggested that p += count would be a better change.
Mr. Colleague's change, although it would no longer manifest the bug,
was still "wrong" in the sense that it would leave p pointing
to a garbage location (and incidentally invokes behavior not defined
by the C language standard) whereas my change would leave p
pointing to the end of the buffer, as one would expect.Since this is a maintenance programming task, I recommended that we not touch anything not directly related to fixing the bug at hand. But I couldn't stop myself from pointing out that the code here is remarkably badly written. Did I say "exceptionally putrid" yet? Oh, I did. Good. It stinks like a week-old fish. The first thing to notice is that the expression buflen - total_count appears four times in only nine lines of code—five if you count the buflen > total_count comparison. This strongly suggests that the algorithm would be more clearly expressed in terms of whatever buflen - total_count really is. Since buflen is the total number of characters to be written, and total_count is the number of characters that have been written, buflen - total_count is just the number of characters remaining. Rather than computing the same expression four times, we should rewrite the loop in terms of the number of characters remaining.
size_t left_to_write = buflen;
while ( left_to_write > 0 ) {
/* */
if ( left_to_write > max_buf_size ) {
if ((count = gss_write(nstrm,p,max_buf_size)) != max_buf_size )
return -1;
} else {
if ((count = gss_write(nstrm,p,left_to_write)) != left_to_write )
return -1;
}
total_count = left_to_write > max_buf_size ? total_count + max_buf_size : buflen;
p=p+total_count;
left_to_write -= count;
}
Now we should notice that the two calls to gss_write are
almost exactly the same. Duplicated code like this can almost always
be eliminated, and eliminating it almost always produces a favorable
result. In this case, it's just a matter of introducing an auxiliary
variable to record the amount that should be written:
size_t left_to_write = buflen, write_size;
while ( left_to_write > 0 ) {
write_size = left_to_write > max_buf_size ? max_buf_size : left_to_write;
if ((count = gss_write(nstrm,p,write_size)) != write_size )
return -1;
total_count = left_to_write > max_buf_size ? total_count + max_buf_size : buflen;
p=p+total_count;
left_to_write -= count;
}
At this point we can see that write_size is going to be
max_buf_size for every write except possibly the last one, so
we can simplify the logic the maintains it:
size_t left_to_write = buflen, write_size = max_buf_size;
while ( left_to_write > 0 ) {
if (left_to_write < max_buf_size)
write_size = left_to_write;
if ((count = gss_write(nstrm,p,write_size)) != write_size )
return -1;
total_count = left_to_write > max_buf_size ? total_count + max_buf_size : buflen;
p=p+total_count;
left_to_write -= count;
}
Even if we weren't here to fix a bug, we might notice something fishy:
left_to_write is being decremented by count, but
p, the buffer position, is being incremented by
total_count instead. In fact, this is exactly the bug that
was discovered by Mr. Colleague. Let's fix it:
size_t left_to_write = buflen, write_size = max_buf_size;
while ( left_to_write > 0 ) {
if (left_to_write < max_buf_size)
write_size = left_to_write;
if ((count = gss_write(nstrm,p,write_size)) != write_size )
return -1;
total_count = left_to_write > max_buf_size ? total_count + max_buf_size : buflen;
p += count;
left_to_write -= count;
}
We could fix up the line the maintains the total_count
variable so that it would be correct, but since total_count
isn't used anywhere else, let's just delete it.
size_t left_to_write = buflen, write_size = max_buf_size;
while ( left_to_write > 0 ) {
if (left_to_write < max_buf_size)
write_size = left_to_write;
if ((count = gss_write(nstrm,p,write_size)) != write_size )
return -1;
p += count;
left_to_write -= count;
}
Finally, if we change the != write_size test to <
0, the function will correctly handle partial writes, should
gss_write be modified in the future to perform them:
size_t left_to_write = buflen, write_size = max_buf_size;
while ( left_to_write > 0 ) {
if (left_to_write < max_buf_size)
write_size = left_to_write;
if ((count = gss_write(nstrm,p,write_size)) < 0 )
return -1;
p += count;
left_to_write -= count;
}
We could trim one more line of code and one more state change by
eliminating the modification of p:
size_t left_to_write = buflen, write_size = max_buf_size;
while ( left_to_write > 0 ) {
if (left_to_write < max_buf_size)
write_size = left_to_write;
if ((count = gss_write(nstrm,p+buflen-left_to_write,write_size)) < 0 )
return -1;
left_to_write -= count;
}
I'm not sure I think that is an improvement. (My idea is that if we
do this, it would be better to create a p_end variable up
front, set to p+buflen, and then use p_end -
left_to_write in place of p+buflen-left_to_write. But
that adds back another variable, although it's a constant one, and the
backward logic in the calculation might be more confusing than the
thing we were replacing. Like I said, I'm not sure. What do you
think?)Anyway, I am sure that the final code is a big improvement on the original in every way. It has fewer bugs, both active and latent. It has the same number of variables. It has six lines of logic instead of eight, and they are simpler lines. I suspect that it will be a bit more efficient, since it's doing the same thing in the same way but without the redundant computations, although you never know what the compiler will be able to optimize away. Right now I'm engaged in writing a book about this sort of cleanup and renovation for Perl programs. I've long suspected that the same sort of processes could be applied to C programs, but this is the first time I've actually done it.
But the really exciting thing I've learned about code like this is that it doesn't matter if you don't already know how to do it right, because you can turn the wrong code into the right code, as we did here, by noticing a few common problems, like duplicate tests and repeated subexpressions, and applying a few simple refactorizations to get rid of them. That's what my book will be about. (I am also very pleased that it has taken me 37 blog entries to work around to discussing any programming-related matters.)
[Other articles in category /prog] permanent link |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||