Archive:
In this section: Subtopics:
Comments disabled |
Sun, 16 Dec 2012
How I got four errors into a one-line program
A The hook I wrote was basically a one-liner, and the reason I am posting this note is because I found three significant programming errors in it in the first day of use. Here's the first cut:
case $2 in message) perl -i -lpe "s/^(e\d+:\s+)?/$(cs -): /" $1 ;; esacThis is a shell script, but the main purpose is to run the perl one-liner. The shell script gets two arguments: $1 is the
path to the file that contains the proposed commit message.
The $2 argument is a tag which describes the commit's
context; it's merge if the commit is a merge commit, for
example; it's template if the commit message is supplied from
a template via -t on the command line or the
commit.template configuration option. The default is the
empty string, and message, which I have here, means that the
message was supplied with the -m command-line option.
The Perl script edits the commit message file, named in (
There is already an error here, although it's a design error, not an
implementation error: the Perl one-liner is only invoked when
So the first change was to let the hook run in the default case as well
as the
```
case $2 in
""|message)
perl -i -lpe "s/^(e\d+:\s+)?/$(cs -): /" $1
;;
esac
```
This was wrong because it inserts the ticket number at the start of
each line; I wanted it only at the start of the first line. So that
was programming error number 1:
```
case $2 in
""|message)
perl -i -lpe "$. == 1 && s/^(e\d+:\s+)?/$(cs -): /" $1
;;
esac
```
So far, so good.
Bug #2 appeared the first time I tried a rebase. The
There are several ways to fix this. The best way would be to make sure
that the current ticket number was stashed somewhere that
```
case $2 in
""|message)
perl -i -lpe "\$. == 1 && !/^e\d+:\s+/ && s/^/$(cs -): /" $1
;;
esac
```
It probably would have been a good idea to leave an escape hatch, and
have cs emit the value of $ENV{TICKET_NUMBER} if
that is set, to allow invocations like TICKET_NUMBER=e71828 git
commit -m …, but I didn't do it, yet.
The third bug appeared when I did So the hook became:
```
case $2 in
""|message)
perl -i -lpe "\$. == 1 && !/^(squash|fixup)! / && !/^e\d+:\s+/ && s/^/$(cs -): /" $1
;;
esac
```
(The exception for squash is similar to the one for
fixup. I never use squash, but it seemed foolish not
to put it in while I was thinking of it.)This is starting to look a little gross, but in a program this small I can tolerate a little grossness.
I thought it was remarkable that such a small program broke in so many
different ways. Much of that is because it must interact with git,
which is very large and complicated, and partly it is that it must
interact with
But the other two errors would have been avoided had the interface to
the hook been more uniform. There seems to be no reason that rebasing
(or cherry-picking) and
```
case $2 in
""|message)
perl -i -lpe "\$. == 1 && !/^(squash|fixup)! / && !/^e\d+:\s+/ && s/^/$(cs -): /" $1
;;
merge|template|squash|commit)
# do nothing
;;
*) # wat
echo "prepare-message-hook: unknown context '$2'" 1>&2
exit 1;
;;
esac
```
But mine doesn't and I bet a lot of others don't either.
Rewriting published history in Git
- Rewriting published history is not nearly as confusing as people seem to think it will be.
- I worked in a very small shop with very talented developers, so the necessary communication was easy.
- Our repository setup and workflow were very well-designed and unusually effective, and made a lot of things easier, including this one.
If there are
There is a master repository to which only a few very responsible
persons can push. It is understood that history in this repository
should almost never be rewritten, only in the most exceptional
circumstances. We usually call this master repository
In addition, each developer has their own semi-public repository,
named after them, which everyone can read, but which nobody but them
can write. Mine is It is understood that this semi-public repository is my sandbox and I am free to rewrite whatever history I want in it. People building atop my branches in this repo, therefore, know that they should be prepared for me to rewrite the history they see there, or to contact me if they want me to desist for some reason.
When I get the changes in my own semi-public repository the way I want
them, When a junior programmer is ready to deploy to the master repository, they can't do it themselves, because they only have read access on the master. Instead, they publish to their own semi-private repository, and then notify a senior programmer to review the changes. The senior programmer will then push those changes to the master repository and deploy them. The semi-public mjd repo has lots of benefits. I can rewrite
my branches 53 times a day (and I do!) but nobody will
care. Conversely, I don't need to know or care how much my co-workers
vacillate.
If I do work from three or four different machines, I can use the
I can create and abandon many topic branches without cluttering up the
master repository's history. If I want to send a change or a new test
file to a co-worker, I can push it to A related note: There is a lot of FUD around the rewriting of published history. For example, the "gitinfo" robot on the #git IRC channel has a canned message:
Rewriting public history is a very bad idea. Anyone else who may have pulled the old history will have toI think this grossly exaggerates the problems. Very bad! Humiliation! The server may deny you! But dealing with a rebased upstream branch is not very hard. It is at worst annoying: you have to rebase your subsequent work onto the rewritten branch and move any refs that pointed to that branch. If you don't have any subsequent work, you might still have to move refs, if you have any that point to it, but you might not have any. [ Thanks to Rik Signes for helping me put this together. ]
On the consistency of PA
Li Fu said, "The axioms are consistent because they have a model."
More about ZF's asymmetry between union and intersection
Let's review those briefly. The relevant axioms concern the operations by which sets can be constructed. There are two that are important. First is the axiom of union, which says that if !!{\mathcal F}!! is a family of sets, then we can form !!\bigcup {\mathcal F}!!, which is the union of all the sets in the family. The other is actually a family of axioms, the specification axiom schema. It says that for any one-place predicate !!\phi(x)!! and any set !!X!! we can construct the subset of !!X!! for which !!\phi!! holds: $$\{ x\in X \;|\; \phi(x) \}$$ Both of these are required. The axiom of union is for making bigger sets out of smaller ones, and the specification schema is for extracting smaller sets from bigger ones. (Also important is the axiom of pairing, which says that if !!x!! and !!y!! are sets, then so is the two-element set !!\{x, y\}!!; with pairing and union we can construct all the finite sets. But we won't need it in this article.)
Conspicuously absent is an axiom of intersection. If you have a
family !!{\mathcal F}!! of sets, and you want a set of every element that is in
Let's begin by defining this compact notation: $$\bigcap_{(X)} {\mathcal F}$$ for this longer formula: $$\{ x\in X \;|\; \forall f\in {\mathcal F} . x\in f \}$$ This is our intersection of the members of !!{\mathcal F}!!, taken "relative to !!X!!", as we say in the biz. It gives us all the elements of !!X!! that are in every member of !!{\mathcal F}!!. The !!X!! is mandatory in !!\bigcap_{(X)}!!, because ZF makes it mandatory when you construct a set by specification. If you leave it out, you get the Russell paradox. Most of the time, though, the !!X!! is not very important. When !!{\mathcal F}!! is nonempty, we can choose some element !!f\in {\mathcal F}!!, and consider !!\bigcap_{(f)} {\mathcal F}!!, which is the "normal" intersection of !!{\mathcal F}!!. We can easily show that $$\bigcap_{(X)} {\mathcal F}\subseteq \bigcap_{(f)} {\mathcal F}$$ for any !!X!! whatever, and this immediately implies that $$\bigcap_{(f)} {\mathcal F} = \bigcap_{(f')}{\mathcal F}$$ for any two elements of !!{\mathcal F}!!, so when !!{\mathcal F}!! contains an element !!f!!, we can omit the subscript and just write $$\bigcap {\mathcal F}$$ for the usual intersection of members of !!{\mathcal F}!!. Even the usually troublesome case of an empty family !!{\mathcal F}!! is no problem. In this case we have no !!f!! to use for !!\bigcap_{(f)} {\mathcal F}!!, but we can still take some other set !!X!! and talk about !!\bigcap_{(X)} \emptyset!!, which is just !!X!!. Now, let's return to topology. I suggested that we should consider the following definition of a topology, in terms of closed sets, but without an a priori notion of the underlying space:
A
- The union of any two elements of !!{\mathcal F}!! is again in !!{\mathcal F}!!, and
- The intersection of any subfamily of !!{\mathcal F}!! is again in !!{\mathcal F}!!.
It now immediately follows that !!U!! itself is a closed set, since it is the intersection !!\bigcap_{(U)} \emptyset!! of the empty subfamily of !!{\mathcal F}!!. If !!{\mathcal F}!! itself is empty, then so is !!U!!, and !!\bigcap_{(U)} {\mathcal F} = \emptyset!!, so that is all right. From here on we will assume that !!{\mathcal F}!! is nonempty, and therefore that !!\bigcap {\mathcal F}!!, with no relativization, is well-defined. We still cannot prove that the empty set is closed; indeed, it might not be, because even !!M = \bigcap {\mathcal F}!! might not be empty. But as David Turner pointed out to me in email, the elements of !!M!! play a role dual to the extratoplogical points of a topological space that has been defined in terms of open sets. There might be points that are not in any open set anywhere, but we may as well ignore them, because they are topologically featureless, and just consider the space to be the union of the open sets. Analogously and dually, we can ignore the points of !!M!!, which are topologically featureless in the same way. Rather than considering !!{\mathcal F}!!, we should consider !!{\widehat{\mathcal F}}!!, whose members are the members of !!{\mathcal F}!!, but with !!M!! subtracted from each one: $${\widehat{\mathcal F}} = \{\hat{f}\in 2^U \;|\; \exists f\in {\mathcal F} . \hat{f} = f\setminus M \}$$ So we may as well assume that this has been done behind the scenes and so that !!\bigcap {\mathcal F}!! is empty. If we have done this, then the empty set is closed. Now we move on to open sets. An open set is defined to be the complement of a closed set, but we have to be a bit careful, because ZF does not have a global notion of the complement !!S^C!! of a set. Instead, it has only relative complements, or differences. !!X\setminus Y!! is defined as: $$X\setminus Y = \{ x\in X \;|\; x\notin Y\} $$ Here we say that the complement of !!Y!! is taken relative to !!X!!. For the definition of open sets, we will say that the complement is taken relative to the universe of discourse !!U!!, and a set !!G!! is open if it has the form !!U\setminus f!! for some closed set !!f!!. Anatoly Karp pointed out on Twitter that we know that the empty set is open, because it is the relative complement of !!U!!, which we already know is closed. And if we ensure that !!\bigcap {\mathcal F}!! is empty, as in the previous paragraph, then since the empty set is closed, !!U!! is open, and we have recovered all the original properties of a topology.
Here Kelley deals with the empty set and the universe in two sentences, and never worries about them again. In contrast, doing the same thing for closed sets was fraught with technical difficulties, mostly arising from ZF. (The exception was the need to repair the nonemptiness of the minimal closed set !!M!!, which was not ZF's fault.)
On the other hand, perhaps this conclusion is knocking down a straw man. I think working mathematicians probably don't concern themselves much with whether their stuff works in ZF, much less with what silly contortions are required to make it work in ZF. I think day-to-day mathematical work, to the extent that it needs to deal with set theory at all, handles it in a fairly naïve way, depending on a sort of folk theory in which there is some reasonably but not absurdly big universe of discourse in which one can take complements and intersections, and without worrying about this sort of technical detail. [ MathJax doesn't work in Atom or RSS syndication feeds, and can't be made to work, so if you are reading a syndicated version of this article, such as you would in Google Reader, or on Planet Haskell or PhillyLinux, you are seeing inlined images provided by the Google Charts API. The MathJax looks much better, and if you would like to compare, please visit my blog's home site. ]
The non-duality of open and closed sets
We can define a topology without reference to
the underlying space as follows: A family !!{\mathfrak I}!! of sets is a topology
if it is closed under pairwise intersections and arbitrary unions, and
we call a set "open" if it is an element of !!{\mathfrak I}!!. From this we can
recover the omitted axiom that says that !!\emptyset!! is open: it must
be in !!{\mathfrak I}!! because it is the empty union !!\bigcup_{g\in\emptyset}
g!!. We can also recover the underlying space of the topology, or at
least
If we choose to work with closed sets instead, we run into problems.
We can try starting out the same way: A family !!{\mathfrak I}!! of sets is a
co-topology if it is closed under pairwise unions and arbitrary
intersections, and we call a set "closed" if it is an element of
!!{\mathfrak I}!!. But we can no longer prove that !!\emptyset\in{\mathfrak I}!!. We can
still recover an underlying space !!X = \bigcup_{c\in{\mathfrak I}} c!!, but we
cannot prove that !!X!! is closed, or identify any maximal closed set
analogous to the maximal open set of the definition of the previous
paragraph. We can construct a We can repair part of this asymmetry by changing the "pairwise unions" axiom to "finite unions"; then the empty set is closed because it is a finite union of closed sets. But we still can't recover any maximal closed set. Given a topology, it is easy to identify the unique maximal closed set, but given a co-topology, one can't, and indeed there may not be one. The same thing goes wrong if one tries to define a topology in terms of a Kuratowski closure operator. We might like to go on and say that complements of closed sets are open, but we can't, because we don't have a universe of discourse in which we can take complements. None of this may make very much difference in practice, since we usually do have an a priori idea of the universe of discourse, and so we do not care much whether we can define a topology without reference to any underlying space. But it is at least conceivable that we might want to abstract away the underlying space, and if we do, it appears that open and closed sets are not as exactly symmetric as I thought they were. Having thought about this some more, it seems to me that the ultimate source of the asymmetry here is in our model of set theory. The role of union and intersection in ZF is not as symmetric as one might like. There is an axiom of union, which asserts that the union of the members of some family of sets is again a set, but there is no corresponding axiom of intersection. To get the intersection of a family of sets !!\mathcal S!!, you use a specification axiom. Because of the way specification works, you cannot take an empty intersection, and there is no universal set. If topology were formulated in a set theory with a universal set, such as NF, I imagine the asymmetry would go away. [ This is my first blog post using MathJax, which I hope will completely replace the ad-hoc patchwork of systems I had been using to insert mathematics. Please email me if you encounter any bugs. ] [ Addendum 20120823: MathJax depends on executing Javascript, and so it won't render in an RSS or Atom feed or on any page where the blog content is syndicated. So my syndication feed is using the Google Charts service to render formulas for you. If the formulas look funny, try looking at http://blog.plover.com/ directly. ] [ Addendum 20120824: There is a followup to this article. ]
The weird ethics of life insurance
Without this clause, the insurance company might find itself in the business of enabling suicide, or even of encouraging people to commit suicide. Completely aside from any legal or financial problems this would cause for them, it is a totally immoral position to be in, and it is entirely creditable that they should try to avoid it. But enforcement of suicide clauses raises some problems. The insurance company must investigate possible suicides, and enforce the suicide clauses, or else they have no value. So the company pays investigators to look into claims that might be suicides, and if their investigators determine that a death was due to suicide, the company must refuse to pay out. I will repeat that: the insurance company has a moral obligation to refuse to pay out if, in their best judgment, the death was due to suicide. Otherwise they are neglecting their duty and enabling suicide. But the company's investigators will not always be correct. Even if their judgments are made entirely in good faith, they will still sometimes judge a death to be suicide when it wasn't. Then the decedent's grieving family will be denied the life insurance benefits to which they are actually entitled. So here we have a situation in which even if everyone does exactly what they should be doing, and behaves in the most above-board and ethical manner possible, someone will inevitably end up getting horribly screwed. [ Addendum 20120816: It has been brought to my attention that this post constains significant omissions and major factual errors. I will investigate further and try to post a correction. ]
My Git Habits
First the short version: I use
But I use
Often I'll be in the middle of something, with a dirty work tree, when
it's time to leave for the day. Then I'll just commit everything with
the subject So the model is that the current head is usually a terrible mess, accumulating changes as it moves forward in time. When I'm done, I will merge the topic into master and run the tests. If they pass, I am not finished. The merge I just created is only a draft merge. The topic branch is often full of all sorts of garbage, commits where I tried one approach, found it didn't work later on, and then tried a different approach, places where I committed debugging code, and so on. So it is now time to clean up the topic branch. Only the cleaned-up topic branch gets published.
## Cleaning up messy topic branchesThe core of the cleanup procedure is to reset the head back to the last place that look good, possibly all the way back to the merge-base if that is not too long ago. This brings all the topic changes into the working directory. Then:
- Compose the commits: Repeat until the working tree is clean:
- Eyeball the output of
`git-diff` - Think of an idea for an intelligible commit
- Use
`git-add -p`to stage the planned commit - Use
`git diff --cached`to make sure it makes sense - Commit it
- Eyeball the output of
- Order the commits: Use
`git-rebase --interactive`
By separating these tasks, I can proceed something like this: I
eyeball the diff, and the first thing I see is something about the
penguin feature. I can immediately say "Great, I'll make up a commit
of all the stuff related to the penguin feature", and proceed to the
When the time comes to put the commits in order, I can do it well because by then I have abstracted away all the details, and reduced each group of changes to a single atomic unit with a one-line description.
For the most complicated cases, I will print out the diffs, read them
over, and mark them up in six colors of highlighter: code to throw
away gets marked in orange; code that I suspect is erroneous is pink.
I make many notes in pen to remind me how I want to divide up the
changes into commits. When a commit occurs to me I'll jot a numbered
commit message, and then mark all the related parts of the diff with
that number. Once I have the commits planned, I'll reset the topic
ref and then run through the procedure above, using
For simple cases I'll just do a series of The very simplest cases of all require no cleanup, of course.
For example, here's my current topic branch, called
055a2f7 correction to bulk consumer template d9630bd DomainActivator half of Pobox Domain consumer ebebb4a Add HasDomain role to provide ->domain reader for domain consumers ade6ac6 stubbed domain test e170e77 start templates for Pobox domain consumers 067ca81 stubbed Domain::ThumbTwiddler 685a3ee cost calculations for DomainActivator ec8b1cc test fixes; trivial domain test passes now 845b1f2 rename InvoiceCharge::CreateDomain to ..::RegisterDomain (e) 6083a97 add durations to Domain consumers and charges c64fda0 tests for Domain::Activator consumer 41e4292 repeat activator tests for 1-year and 3-year durations 7d68065 tests for activator's replacement (d) 87f3b09 move days_in_year to Moonpig::Util 3cd9f3b WIP e5063d4 add test for sent invoice in domain.t c8dbf41 WIP 9e6ffa4 add missing MakesReplacement stuff fc13059 bring in Net::OpenSRS module (c) 52c18fb OpenSRS interface 893f16f notes about why domain queries might fail (b) f64361f rename "croak" method to "fail" to avoid conflicts 4e500ec Domain::Activator initial_invoice_charge_pairs (a) 3c5cdd4 WIP3c5cdd4 (a) was the end-of-day state for yesterday; I made it and pushed it just before I dashed out the door to go home. Such commits rarely survive beyond the following morning, but if I didn't make them, I wouldn't be able to continue work from home if the mood took me to do that.
f64361f (b) is a prime candidate for later squashing. 5c218fb (c)
introduced a module with a "croak" method. This turned out to be a
stupid idea, because this conflicted with the Similarly, 6083a97 (e) added a days_in_year function that I later decided at 87f3b09 (d) should be in a utility module in a different repository. 87f3b09 will eventually be squashed into 6083a97 so that days_in_year never appears in this code at all. I don't know what is in the WIP commits c8dbf41 or 3cd9f3b, for which I didn't invent commit messages. I don't know why those are left in the tree, but I can figure it out later.
## An example cleanupNow I'm going to clean up this branch. First Igit-checkout -b
cleanup c-domain so that if something goes awry I can start over
completely fresh by doing git-reset --hard c-domain. That's
probably superfluous in this case because origin/c-domain is
also pointing to the same place, and origin is my private
repo, but hey, branches are cheap.
The first order of business is to get rid of those
M lib/Pobox/Moonpig/Consumer/Domain/Activator.pm M lib/Pobox/Moonpig/Role/HasDomain.pm M lib/Pobox/Moonpig/TemplateSet.pm ?? bin/register_domains M t/consumer/domain.t ?? t/lib/MockOpenSRS.pm(This is the output from git-status --short, for which I have
an alias, git s. I use this probably 99 times as often as
plain git-status.)
Not too bad, probably no need for a printout. The new
%Next I'll deal with that new mock object class in t/lib/MockOpenSRS.pm. I'll add that, then use git-add
-p to add the related changes from the other files:
%The git ix command at the end there is an alias for git diff
--cached: it displays what's staged in the index. The output
looks good, so I'll commit it:
%Now I want to see if those tests actually pass. Maybe I forgot something! %The git-stash command hides the unrelated changes from the
test suite so that I can see if the tests I just put into
t/consumer/domain.t work properly. They do, so I bring back
the stashed changes and continue. If they didn't, I'd probably amend
the last commit with git commit --amend and try again.Continuing:
%That last bit should have been part of the "mock OpenSRS object" commit, but I forgot it. So I make a fixup commit, which I'll merge into the main commit later on. A fixup commit is one whose subject begins with fixup!. Did you know that you can name a commit
by writing :/, and it names the most recent commit
whose message contains that text?textIt goes on like that for a while:
%The only uncommitted change left in HasDomain.pm was a
superfluous line, so I just threw it away.
%By this time all the remaining changes belong in the same commit, so I use git-add -u to add them all at once. The working tree is
now clean. The history is as I showed above, except that in place of
the final WIP commit, I have:
a3c0b92 new register_domains utility program 53d704d mock OpenSRS object; add tests a24acd8 Domains do not have explicit start dates 17a915d fixup! mock OpenSRS object; add tests 86e472b Activator consumer can generate special charges 5b2ad2b separate templates for domain-registering and domain-renewing consumers(Again the oldest commit is first.) Now I'll get rid of that fixup!:
%Because of --autosquash, the git-rebase menu is
reordered so that the fixup commit is put just after
the commit it fixes up, and its default action is 'fixup' instead of
'pick'. So I don't need to edit the rebase instructions at all. But
I might as well take the opportunity to put the commits in the right
order. The result is:
a3c0b92 new register_domains utility program ea8dacd Domains do not have explicit start dates 297366a separate templates for domain-registering and domain-renewing consumers 4ef0e28 mock OpenSRS object; add tests c3ab1eb Activator consumer can generate special chargesI have two tools for dealing with cleaned-up branches like this one. One is git-vee, which compares two branches. It's
just a wrapper around the command git log --decorate --cherry-mark
--oneline --graph --boundary . A"..."B
Here's a
comparison the original
%This clearly shows where the original and cleaned up branches diverge, and what the differences are. I also use git-vee to compare
pre- and post-rebase versions of branches (with git-vee
ORIG_HEAD) and local branches with their remote tracking branches
after fetching (with git-vee remote or just plain
git-vee).
A cleaned-up branch should usually have the same final tree as the
tree at the end of the original branch. I have another tool,
%which tells me that they are not the same. Most often this
happens because I threw away all the debugging code that I put in
earlier, but this time it was because of that line of superfluous code
I eliminated from HasDomain.pm. When the treehashes differ, I'll use
git-diff to make sure that the difference is innocuous:
%Okay then. The next task is probably to deal with the older WIP commits. This time I'll omit all the details. But the enclosing procedure looks like this:
%The output of git-treehash says that the tree at the end of
the wip-cleanup branch is identical to the one in the WIP
commit it is supposed to replace, so it's perfectly safe to rebase the
rest of the cleanup branch onto it, replacing the one WIP
commit with the four new commits in wip-cleanup. Now the
cleaned up branch looks like this:
% git-vee marks a commit with an equal sign instead of a star
if it's equivalent to a commit in the other branch. The commits in
the middle marked with equals signs are the ones that weren't changed.
The upper WIP was replaced with five commits, and the lower one with
four.I've been planning for a long time to write a tool to help me with breaking up WIP commits like this, and with branch cleanup in general: It will write each changed hunk into a file, and then let me separate the hunk files into several subdirectories, each of which represents one commit, and then it will create the commits automatically from the directory contents. This is still only partly finished, but I think when it's done it will eliminate the six-color diff printouts.
[ Addendum 20120404: Further observation has revealed that I almost
never use [ Addendum 20120825: There is now a followup article about how to manage rewriting of published history. ]
Why can't Git resolve all conflicted merges?
What we need is a nice example. In the past my example was sort of silly. You have a file that contains the instruction:
Pay potato tax every April 15One branch adds an exception: Pay potato tax every April 15 (Except in years of potato blight.)While another branch broadens the original instruction:
Pay all tax due every April 15What's the correct resolution here? It's easy to understand that mashing together the two changes is a recipe for potential catastrophe:
Pay all tax due every April 15 (Except in years of potato blight.)You get fined for tax evasion after the next potato blight. And it's similarly easy to construct scenarios in which the correct resolution is to leave the whole thing in place including the modifier, change the thing to something else completely, delete the whole thing, or to refer the matter to Legal and shut down the whole system until you hear back. Clearly it's outside Git's scope to recognize when to call in the lawyers, much less to predict what their answer will be. But a few months ago I ran into a somewhat less silly example. At work we had two seprate projects, "Moonpig" and "Stick", each in its own repository. Moonpig contained a subsystem, "Collections", which we decided would make more sense as part of Stick. I did this work, removing the Collections code from the Moonpig project and integrating it into the Stick project. From the point of view of the Moonpig repository, the Collections system was deleted entirely. Meanwhile, on a parallel branch of Moonpig, R.J.B. Signes made some changes that included bug fixes to the Collections. After I removed the collections, he tried to merge his changes into the master branch, and got a merge conflict, because some of the files to which he was making bug fixes were no longer there. The correct resolution was to perform the rest of the merge without the bug fixes, which Git could conceivably have done. But then the unapplied bug fixes needed to be applied to the Collections module that was now in the completely separate Stick project, and there is no way Git could have done this, or even to have known it should be done. Human intervention was the only answer.
It came from... the HOLD SPACE
I recently posted about my work on Zach Holman's
One example of awful crap we had to deal with before Perl etc. were invented was that some people used to write 'sed scripts', although I am really not sure how they did it. I tried once, without much success, and then for this talk I tried again, and again did not have much success. "The hold space" is ased-ism. The basic model of sed is that it
reads the next line of data into the 'pattern space', then applies a
bunch of transformations to it, and then prints it out. If you need
to save this line for later examination, or for emitting later on
instead, you can hold it in the 'hold space'. Use of the hold space
is what distinguishes sed experts from mere sed nobodies like me. So
I planned to talk about the hold space, and then I got the happy idea
to analogize the Hold Space to the Twilight Zone, or maybe the Phantom
Zone, a place where you stick naughty data when you don't want it to
escape. I never feel like audiences appreciate the work I put into
this sort of thing; when I'm giving the talk it always sounds too much
like a private joke. Explaining it just feels like everyone is sitting
through my explanation of a private joke.
The little guy to the right is known
as
(Speaking of teletypewriters, ours had an attachment for punching paper tape, which you can see on the left side of the picture. The punched chads fell into a plastic chad box (which is missing in the picture), and when I was about three I spilled the chad box. Chad was everywhere, and it was nearly impossible to pick up. There were still chads stuck in the cracks in the floorboards when we moved out three years later. That's why, when the contested election of 2000 came around, I was one of the few people in North America who was not bemused to learn that there was a name for the little punched-out bits.)
Anyway, back to
I really wanted to present a tour de force of
On slide
28 I called
If you are interested in the details of the
Several people have asked me to clarify my claim to have invented
My own current version of the [ Addendum 20170325: I have revised this talk a couple of times since this blog article was written. Links to particular slides go to the 2011 versions, but the current version is from 2017. There are only minor changes. For example, I removed `awk` from the list of “monsters”. ]
Insane calculations in bash
A while back, Zach Holman (who I hadn't heard of before, but who is
apparently a bigwig at GitHub) implemented a kind of cute little
hack, called "spark".
It's a little shell utility,
spark 2,4,6,8will print out something like:
▃▄▆▇To do this in one of the 'P' languages (Perl, Python, PHP, Puby, or maybe Pickle) takes something like four lines of code. But M. Holman decided to implement it in bash for maximum
portability, so it took 72 lines, not counting comments, whitespace,
etc.Let's begin by discussing the (very simple) mathematics that underlies drawing bar graphs. Suppose you want to generate a set of bars for the numbers $1, $9, $20. And suppose you can actually generate bars of integer heights only, say integers from 0–7: 0 1 ▁ 2 ▂ 3 ▃ 4 ▄ 5 ▅ 6 ▆ 7 ▇(M. Holman 's original program did this, even though a height-8 bar █ is available. But the mathematics is the same either way.)
## Absolute scalingThe first step is to scale the input numbers onto the range of the bars. To do this, we find a scale factorf that maps dollars
onto bar heights, say that f bar units = $1.A reasonable thing to try is to say that since your largest number is $20, we will set 7 bar units = $20. Then 0.35 bar units = $1, and 3.45 bar units = $9. We'll call these the "natural heights" for the bars. Unfortunately we can't render the bars at their natural heights; we can only render them at integer heights, so we have to round off. 0.35 bar units rounds off to 0, so we will represent $1 as no bar at all. 3.45 bar units rounds off, badly, to 3, but that's the way it goes; if you try to squeeze the numbers from 1 to 20 into the range 0 to 7, something has to give. Anyway, this gives
(1,9,20) → ( ▃▇)The formula is: Let max be the largest input number (here,
20) and let n be the size of the largest possible bar (here, 7).
Then an input number x becomes a bar of size
n·x / max:
$$x\rightarrow {n\cdot x \over max } $$ Note that this mapsmax itself to n, and 0 to 0.I'll call this method "absolute scaling", because big numbers turn into big bars. (It fails for negative numbers, but we'll assume that the numbers are non-negative.)
(0…20) → ( ▁▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▆▇▇)There are a couple of variations we might want to apply. First, maybe we don't like that $1 mapped to no bar at all; it's too hard to see, depending on the context. Perhaps we would like to guarantee that only 0 maps to 0. One way to ensure that is to round everything up, instead of rounding to the nearest integer:
(0…20) → ( ▁▁▂▂▂▃▃▃▄▄▄▅▅▅▆▆▆▇▇▇) (1,9,20) → (▁▄▇)Another benefit of always rounding up is that it uses the bars equally. Suppose we're mapping numbers in the range 1–100 to bars of heights 1–7. If we round off to the nearest integer, each bar
represents 14 or 15 different numbers, except that the tallest bar
only represents the 8 numbers 93–100. This is a typical situation.
If we always round up, each bar corresponds to a nearly equal range of
numbers. (Another way to adjust this is to replace n with
n+½ in the formula.)
## Relative scalingNow consider the numbers $18, $19, $20. Under the absolute scaling method, we get:
(18,19,20) → (▆▇▇)or, if you're rounding up,
(18,19,20) → (▇▇▇)which obscures the difference between the numbers. There's only an 11% difference between the tallest and shortest bar, and that doesn't show up at this resolution. Depending on your application, this might be what you want, but we might also want to avail ourselves of the old trick of adjusting the baseline. Instead of the bottom of the bar being 0, we can say it represents 17. This effectively reduces every bar by 17 before scaling it, so that the number x is now
represented by a bar with natural height
n·(x−17) / (max−17).
Then we get these bars:
(18,19,20) → (▃▅▇)Whether this "relative scaling" is a better representation than ▇▇▇ depends on the application. It emphasizes different properties of the data.
In general, if we put the baseline at $$x\rightarrow {n\cdot (x-b) \over (max-b) } $$ That is the same formula as before, except that everything has been shifted down byb.
A reasonable choice of
## The shell sucksBut anyway, what I really wanted to talk about was how to fix this program, because I think my solution was fun and interesting. There is a tricky problem, which is that you need to calculate values like (n-b)/(x-b), which so you might like to do
some division, but as I wrote earlier, bash has no
facilities for doing fractional arithmetic. The original program
used $((…)) everywhere, which throws away fractions. You
can work around that, because you don't actually the fractional part
of (n-b)/(x-b); you only need the greatest
integer part. But the inputs to the program might themselves be
fractional numbers, like say 3.5, and $((…)) barfs if you
try to operate on such a number:
$ x=3.5; echo $((x + 1)) bash: 3.5: syntax error: invalid arithmetic operator (error token is ".5")and you seemingly cannot work around that.
My first response to this was to replace all the uses of So supposing that you take this complaint seriously, what can you do? Are you just doomed? No, I found a solution to the problem that solves all the problems. It is portable, efficient, and correct. It is also slightly insane.
## Portable fractions in
We cannot use decimal numbers: |