Claude and I write a utility program
Then I had two problems…
A few days ago I got angry at xargs for the hundredth time, because
for me xargs is one of those "then he had two problems" technologies.
It never does what I want by default and I can never remember how to
use it. This time what I wanted wasn't complicated: I had a bunch of
PDF documents in /tmp and I wanted to use GPG to encrypt some of
them, something like this:
gpg -ac $(ls *.pdf | menupick)
menupick
is a lovely little utility that reads lines from standard input,
presents a menu, prompts on the terminal for a selection from the
items, and then prints the selection to standard output. Anyway, this
didn't work because some of the filenames I wanted had spaces in them,
and the shell sucks. Also because
gpg probably only does one file at a time.
I could have done it this way:
ls *.pdf | menupick | while read f; do gpg -ac "$f"; done
but that's a lot to type. I thought “aha, I'll use xargs .” Then I
had two problems.
ls *.pdf | menupick | xargs gpg -ac
This doesn't work because xargs wants to batch up the inputs to run
as few instances of gpg as possible, and gpg only does one file at
a time. I glanced at the xargs manual looking for the "one at a
time please" option (which should have been the default) but I didn't
see it amongst the forest of other options.
I think now that I needed -n 1 but I didn't find it immediately, and
I was tired of looking it up every time when it was what I wanted
every time. After many years of not remembering how to get xargs to
do what I wanted, I decided the time had come to write a stripped-down
replacement that just did what I wanted and nothing else.
(In hindsight I should perhaps have looked to see if gpg 's
--multifile option did what I wanted, but it's okay that I didn't,
this solution is more general and I will use it over and over in
coming years.)
xar is a worse version of xargs , but worse is better (for me)
First I wrote a comment that specified the scope of the project:
# Version of xargs that will be easier to use
#
# 1. Replace each % with the filename, if there are any
# 2. Otherwise put the filename at the end of the line
# 3. Run one command per argument unless there is (some flag)
# 4. On error, continue anyway
# 5. Need -0 flag to allow NUL-termination
There! It will do one thing well, as Brian and Rob commanded us in
the Beginning Times.
I wrote a draft implementation that did not even do all those things,
just items 2 and 4, then I fleshed it out with item 1. I decided that
I would postpone 3 and 5 until I needed them. (5 at least isn't a
YAGNI, because I know I have needed it in the past.)
The result was this:
import subprocess
import sys
def command_has_percent(command):
for word in command:
if "%" in word:
return True
return False
def substitute_percents(target, replacement):
return [ s.replace("%", replacement) for s in target ]
def run_command_with_filename(command_template, filename):
command = command_template.copy()
if not command_has_percent(command):
command.append("%")
res = subprocess.run(substitute_percents(command, filename), check=False)
return res.returncode == 0
if __name__ == '__main__':
template = sys.argv[1:]
ok = True
for line in sys.stdin:
if line.endswith("\n"):
line = line[:-1]
if not run_command_with_filename(template, line):
ok = False
exit(0 if ok else 1)
Short, clean, simple, easy to use. I called it xar , ran
ls *.pdf | menupick | xar gpg -ac
and was content.
Now again, with Claude
The following day I thought this would be the perfect opportunity to
try getting some LLM help with programming. I already had a baseline
version of xar working, and had thought through the problem
specification. Now I could do it over with the LLM and compare the
two results. The program I wanted was small and self-contained. If
the LLM sabotaged me somehow, I would probably notice, and if I
didn't, it wouldn't matter, because I had a working version
already.
So I asked Claude about it. I knew Claude was not the right choice,
because Claude is not really designed for this task, and has no
integration with the coding editor, but Claude is what I am familiar
with so I thought I'd give it a shot.
On the whole it went extremely well.
The complete transcript is here. I
imagine this was pretty much a maximally good experience, that all the
terrible problems of LLM coding arise in larger, more complicated
systems. But this program is small and self-contained, with no
subtleties, so the LLM could be at its best.
Phase 1
First I just told Claude what I wanted and let it write the program
from scratch.
Let's write a program in Python called xar . It will be a
stripped-down version of the standard Unix xargs utility, but it
will do less and be simpler to use.
It will get command-line
arguments that specify a command template. Then it will read lines
from stdin. Each line will be appended to the command template to
produce a command. The program will run the resulting command, one
command for each line of input.
Any questions?
Default Claude is very chatty and I am going to omit most of its
chatty comments, which make me impatient.
The program it wrote it was not what I would have written, but it was
good enough. If I had just used it right off the bat, instead of
writing my own, it would have been enough, and it would have taken
somewhere between 2% and 10% as long to produce.
So the one-line summary of this article is: I should stop writing
simple command-line utilities, and just have Claude write them
instead, because Claude's are good enough, and definitely better than
yak-shaving.
Here's Claude's initial version:
#!/usr/bin/env python3
import sys
import subprocess
import shlex
def main():
if len(sys.argv) < 2:
print("Usage: xar command [args...]", file=sys.stderr)
sys.exit(1)
command_template = sys.argv[1:]
for line in sys.stdin:
line = line.rstrip('\n')
full_command = command_template + [line]
try:
result = subprocess.run(full_command, check=True)
except subprocess.CalledProcessError as e:
print(f"Command failed with exit code {e.returncode}: {' '.join(full_command)}",
file=sys.stderr)
except FileNotFoundError:
print(f"Command not found: {full_command[0]}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()
Claude's version had numerous comments, which I have omitted. I later
told it to stop putting in comments, which it did.
Claude's use of check here was not what I wanted, because that makes
subprocess.run raise an exception when the subcommand fails, and
then the exception has to be immediately caught and handled. My
original control flow had been simpler:
res = subprocess.run(substitute_percents(command, filename), check=False)
return res.returncode == 0
…
if not run_command_with_filename(template, line):
ok = False
Claude's program pulled in shlex without noticing that it was
unused. But Claude did teach me about str.rstrip('\n') which I had
not known about before (or maybe had forgotten), so that was one small
win already.
Argument parsing
The next step was a big win. Python's library for command-line
argument handling is called argparse and it is really nice. If I
were still writing programs in Perl, I would implement a Perl version
of argparse because Perl has 29 available argument parsing libraries
and they are all loathsome. The one problem with argparse is I never
remember off the top of my head how to use it. I think the module is
called argparse but it provides a class called Argparser but I
often get these backward and try to use argparser and Argparse .
Instead of figuring it out every time I usually dig up some previous
Python program and then copy-paste the argument parser from there,
amending it to suit the purpose.
But this time I didn't have to do that. Instead, I just said to
Claude:
This is good, now please add code at the top to handle argument
parsing with the standard Argparse library, even though there are
no options yet.
Claude handed me back pretty much the same program, but with the
argument parser at the top:
def parse_arguments():
"""Parse command-line arguments using argparse."""
parser = argparse.ArgumentParser(
description="A simplified version of xargs that appends each line from stdin to a command template.",
prog="xar"
)
# Store all positional arguments as the command to execute
parser.add_argument(
"command",
nargs="+",
help="The command template to execute for each input line"
)
return parser.parse_args()
Let's pause for a moment. Maybe you kids are unimpressed by this.
But if someone had sent this interaction back in time and showed it to
me even as late as five years ago, I would have been stunned. It
would have been completely science-fictional. The code that it
produced is not so impressive, maybe, but that I was able to get it in
response to an unstructured request in plain English is like seeing a
nuclear-powered jetpack in action.
I make some changes
It bothered me that the command-line arguments were being taken
implicitly from sys.argv ; that makes the program hard to test. A
better way to do this is to have the list of command-line arguments be a
function to the main function, and then for the runtime to call
main(sys.argv[1:]) instead of just main() . That way you can test
the program without having to fake up sys.argv . So I changed that,
and while I was making changes I got rid of Claude's comments and the
unnecessary exception handling. I pasted the result back at Claude,
saying:
Good so far. I have adjusted your source code in several ways:
- I removed the comments.
- I removed the exception handling for the subcommand. If one of the subcommands fails, that no longer terminates the main loop
- Command-line arguments are passed to the main() function rather than being accessed directly
Here is the code. Tell me if you see any problems or if you have any questions about it
I had also sabotaged the code with at least one likely error.
Python's subprocess.run returns a “result” structure that includes
the subcommand's output and exit status, and I can never remember the
name of the component that records the exit status. I either look it
up or get it by trial-and-error. This time I didn't bother to either,
I just guessed that it was result.exitcode , which I handed off to
Claude even though I thought was probably wrong.
Claude's reply said:
Here are a couple of issues I noticed:
There's an error in the error handling code - you're using result.exitcode but in the subprocess.run() function, the attribute is called returncode not exitcode
Awesome, now I don't have to remember that name of that component.
Claude continued:
The parse_arguments() function definition is missing in the pasted code, but it would need to be updated to accept the argv parameter
I had passed sys.argv[1:] to main , and then had main call parse_arguments(argv) .
But I hadn't changed parse_arguments to expect an argument list. (I
hadn't included it in my paste at all, in fact.) Claude was exactly
right, and continued with the corrected version:
#!/usr/bin/env python3
import sys
import subprocess
import argparse
def parse_arguments(argv):
…
return parser.parse_args(argv)
def main(argv):
# Parse command-line arguments
args = parse_arguments(argv)
…
if __name__ == "__main__":
ok = main(sys.argv[1:])
exit(0 if ok else 1)
Handling NUL-separated values
I said
It's great, thanks for catching those errors. Now let's add a -0 command-line option that tells the program to split the input on "\0" characters instead of at line breaks.
Recall that my handwritten xar did not do this yet; I had left that
feature for another day. Claude just added it for me, adding
parser.add_argument(
"-0", "--null",
action="store_true",
help="Input items are terminated by a null character instead of by newline"
)
to the argument parser and changing the main loop to:
# Split input based on the delimiter option
if args.null:
# Split on null characters
items = stdin_data.split('\0')
else:
# Split on newlines
items = stdin_data.splitlines()
for item in items:
if not item:
continue
…
I was tired of Claude's comments, so I said
Give me the same thing, with no comments
which it did, so I said
From now on leave out the comments. I'm a Python expert and I don't need them. If there is something I don't understand I will ask you.
Claude complied. If I were going to do use Claude again in the future
I would include that in the canned instructions that Claude is given
up front. Instead I will probably use a tool better-suited to
programming, and do whatever the analogous thing is.
Template filling
Now I told Claude to add the % feature:
I changed if not item to if item == "" . Now let's make the following change:
- If the command template includes any
% characters, each of these should be replaced with the input item.
- Otherwise, if there were no
% characters, the input item should be appended to the end of the command as in the current version
Claude did this. It used an explicit loop instead of the list
comprehension that I had used (and preferred), but it did do it correctly:
for arg in command_template:
if '%' in arg:
has_placeholder = True
full_command.append(arg.replace('%', item))
else:
full_command.append(arg)
if not has_placeholder:
full_command.append(item)
Even without the list comprehension, I would have factored
out the common code:
for arg in command_template:
if '%' in arg:
has_placeholder = True
full_command.append(arg.replace('%', item))
if not has_placeholder:
full_command.append(item)
But I am not going to complain, my code is simpler but is
doing unnecessary work.
Claude also took my hint to change item == "" even though I didn't
explicitly tell it to change that.
At this point the main loop of the main function was 15 lines long,
because Claude had stuck all the % -processing inline. So I said:
Good, let's extract the command template processing into a subroutine.
It did this right, understanding correctly what code I was referring
to and extracting it into a subroutine called
process_command_template . More science fiction: I can say "command
template processing" and it guesses what I had in mind!
This cut the main loop to 7 lines. That worked so well I tried it
again:
Good, now let's extract the part of main that processes stdin into a subroutine that returns the items array
It pulled the correct code into a function called process_stdin . It
did not make the novice mistake of passing the entire args structure
to this function. In the caller it had process_stdin(args.null) and
inside of process_stdin this parameter was named
use_null_delimiter ,
YAGNI?
At this point I was satisfied but I thought I might as well ask if it
should do something else before we concluded:
Can you think of any features I left out that would be useful enough
to warrant inclusion? Remember this program is supposed to be small
and easy to use, in contrast to the existing xargs which is very
complicated.
Claude had four suggestions:
A -p or --parallel option to run commands in parallel
A -n or --max-args option to specify the maximum number of items to pass per command
A simple -v or --verbose flag to show commands as they're executed
A way to replace the command's standard input with the item instead of adding it as an argument
All reasonable suggestions, nothing stupid. (It also supplied code for #3,
which I had not asked for and did not want, but as I said before,
default Claude is very chatty.)
Parallelization
I didn't want any of these, and I knew that #2–4 would be easy to add if I
did want any of them later. But #1 was harder. I've done code like
this in the past, where the program has a worker pool and runs a new
process whenever the worker pool isn't at capacity. It's not even that
hard. In Perl you can play a cute trick and use something like
$workers{spawn()) = 1 while delete $workers{wait()};
where the workers hash maps process IDs to dummy values. A child
exits, wait() awakens and returns the process ID of the completed
child, which is then deleted from the map, and the loop starts another
worker.
I wanted to see how Claude would do it, and the result was an even
bigger win than I had had previously, because Claude wrote this:
with concurrent.futures.ProcessPoolExecutor(max_workers=args.parallel) as executor:
futures = [executor.submit(execute_command, cmd, args.verbose) for cmd in commands]
for future in concurrent.futures.as_completed(futures):
success = future.result()
if not success:
ok = False
What's so great about this? What's great is that I hadn't known about
concurrent.futures or ProcessPoolExecutor . And while I might have
suspected that something like them existed, I didn't know what they
were called. But now I do know about them.
If someone had asked me to write the --parallel option, I would have
had to have this conversation with myself:
Python probably has something like this already. But how long will
it take me to track it down? And once I do, will the API
documentation be any good, or will it be spotty and incorrect? And
will there be only one module, or will there be three and I will
have to pick the right one? And having picked module F6, will I
find out an hour later that F6 is old and unmaintained and that
people will tell me “Oh, you should have used A1, it is the new
hotness, everyone knows that.”
When I put all that uncertainty on a balance, and weigh it
against the known costs of doing it myself, which one wins?
The right choice is: I should do the research, find the good module (A1, not
F6), and figure out how to use it.
But one of my biggest weaknesses as a programmer is that I too often
make the wrong choice in this situation. I think “oh, I've done this
before, it will be quicker to just do it myself”, and then I do and it
is.
Let me repeat, it is quicker to do it myself. But that is still
the wrong choice.
Maybe the thing I wrote would be sooner or smaller or faster or more technically
suitable to the project than the canned module would have been. But it
would only have been more be technically suitable today. If it
needed a new feature in the future it might have to be changed by
someone who had never seen it before, whereas the canned module could well
already have the needed feature ready to go, already documented, and
perhaps already familiar to whoever had to make the change. My
bespoke version would certainly be unfamiliar to every other
programmer on the project — including perhaps myself six months later — so would be the wrong thing to use.
I'm really good at hacking this stuff up. Which is a problem. It
makes me want to hack stuff up, even when I shouldn't.
Claude tips the balance strongly toward the correct side, which is
that I should use the prepackaged module that someone else wrote and
not hack something up.
And now I know about concurrent.futures.ProcessPoolExecutor ! The
world is full of manuals, how can I decide which ones I should read?
One way is to read them all, which used to work back when I was
younger, but now I have more responsibilities and I don't have time to
read the entire Python library reference including all the useless
bits I will never use. But here's Claude pointing out to me that this
is something I need to know about, now, today, and I should read this one. That is
valuable knowledge.
Claude flubs
While testing the program I did
xar --parallel=2 bash -c '...'
which didn't work because the argparse library tried to parse the
-c as a flag to xar , when I wanted it to be a flag to bash .
This was my fault, and there is a standard way around it, which is
that I should have said
xar --parallel=2 -- bash -c '...'
which tells argparse to drop the -- and take everything after it
literally with no interpretation.
I thought I'd see what Claude had to say about this, and it dropped
the ball in a very LLM sort of way.
I asked:
When I did xar --parallel=2 bash -c '...' it misunderstood the
arguments. It thought I was trying to pass -c to xar and failed
with unrecognized arguments. What should I do about that?
Claude replied:
There are two main approaches to handle this:
- Use the
-- separator to indicate the end of arguments for xar: …
- Modify your code to handle this case specifically by looking for a command terminator in the argument parser.
Here's how you could modify your code to implement the first
approach:
And then it followed with the same, unmodified code…
Which was correct, because -- is already handled by argparse , so
no change was actually required to implement the first approach! But
Claude's characterization of it as a modification was misleading,
wrong, and potentially very confusing.
I said:
I don't see what the difference is in that argument parsing code as compared to what I had before.
and Claude got into the LLM tail-chasing mode where it got more and
more confused:
You're right, I apologize for the confusion. There's no difference
in the argument parsing code I provided compared to what you had
before. The issue is with how argparse handles arguments by
default.
To fix the issue with xar --parallel=2 bash -c '...', you need to
actually implement support for the -- separator in your code…
# Find the index of '--' if it exists
try:
separator_index = argv.index('--')
xar_args = argv[:separator_index]
command_args = argv[separator_index+1:]
except ValueError: …
Ugh, all wrong in the worst way. The code probably works, but it is
completely unnecessary. Claude's claim that “you need to actually
implement support for the -- separator” is flat wrong. I pointed
this out and Claude got more confused. Oh well, nobody is perfect!
Lessons learned
A long time ago, when syntax-coloring editors were still new, I tried
one and didn't like it, then tried again a few years later and
discovered that I liked it better than I had before, and not for the
reasons that anyone had predicted or that I would have been able to
predict.
(I wrote an article about the surprising reasons to use the syntax coloring.)
This time also. As usual, an actual experiment produced unexpected
results, because the world is complicated and interesting. Some of
the results were unsurprising, but some were not anything I would have
thought of beforehand.
Claude's code is good enough, but it is not a magic oracle
Getting Claude to write most of the code was a lot faster and easier
than writing it myself. This is good! But I was dangerously tempted
to just take Claude's code at face value instead of checking it
carefully. I quickly got used to flying along at great speed, and it
was tough to force myself to slow down and be methodical, looking over
everything as carefully as I would if Claude were a real junior
programmer. It would be easy for me to lapse into bad habits,
especially if I were tired or ill. I will have to be wary.
Fortunately there is already a part of my brain trained to deal with
bright kids who lack experience, and I think perhaps that part of my brain
will be able to deal effectively with Claude.
I did not notice any mistakes on Claude's part — at least this time.
At one point my testing turned up what appeared to be a bug, but it
was not. The testing was still time well-spent.
Claude remembers the manual better than I do
Having Claude remember stuff for me, instead of rummaging the
manual, is great. Having Claude stub out an argument parser,
instead of copying one from somewhere else, was pure win.
Partway along I was writing a test script and I wanted to use that
Bash flag that tells Bash to quit early if any of the subcommands
fails. I can never remember what that flag is called. Normally I
would have hunted for it in one of my own shell scripts, or groveled
over the 378 options in the bash manual. This time I just asked in
plain English “What's the bash option that tells the script to abort
if a command fails?” Claude told me, and we went back to what we were
doing.
Claude can talk about code with me, at least small pieces
Claude easily does simple refactors. At least at this scale, it got
them right. I was not expecting this to work as well as it did.
When I told Claude to stop commenting every line, it did. I
wonder, if I had told it to use if not expr only for Boolean
expressions, would it have complied? Perhaps, at least for a
while.
When Claude wrote code I wasn't sure about, I asked it what it was
doing and at least once it explained correctly. Claude had written
parser.add_argument(
"-p", "--parallel",
nargs="?",
const=5,
type=int,
default=1,
help="Run up to N commands in parallel (default: 5)"
)
Wait, I said, I know what the const=5 is doing, that's so that if
you have --parallel with no number it defaults to 5. But what is
the --default doing here? I just asked Claude and it told me:
that's used if there is no --parallel flag at all.
This was much easier than it would have been for me to pick over
the argparse manual to figure out how to do this in the first
place.
More thoughts
On a different project, Claude might have done much worse. It might
have given wrong explanations, or written wrong code. I think that's
okay though. When I work with human programmers, they give wrong
explanations and write wrong code all the time. I'm used to it.
I don't know how well it will work for larger systems. Possibly pretty
well if I can keep the project sufficiently modular that it doesn't get
confused about cross-module interactions. But if the criticism is
“that LLM stuff doesn't work unless you keep the code extremely
modular” that's not much of a criticism. We all need more
encouragement to keep the code modular.
Programmers often write closely-coupled modules knowing that it is bad
and it will cause maintenance headaches down the line, knowing that the
problems will most likely be someone else's to deal with. But what if
writing closely-coupled modules had an immediate cost today, the cost
being that the LLM would be less helpful and more likely to mess up
today's code? Maybe programmers would be more careful about letting
that happen!
Will my programming skill atrophy?
Folks at Recurse Center were discussing this question.
I don't think it will. It will only atrophy if I let it. And I have a
pretty good track record of not letting it. The essence of
engineering is to pay attention to what I am doing and why, to try to
produce a solid product that satisifes complex constraints, to try
to spot problems and correct them. I am not going to stop doing
this. Perhaps the problems will be different ones than they were
before. That is all right.
Starting decades ago I have repeatedly told people
You cannot just paste code with no understanding of
what is going on and expect it to work.
That was true then without Claude and it is true now with Claude. Why
would I change my mind about this? How could Claude change it?
Will I lose anything from having Claude write that complex
parser.add_argument call for me? Perhaps if I had figured it out
on my own, on future occasions I would have remembered the const=5 and default=1
specifications and how they interacted. Perhaps.
But I suspect that I have figured it out on my own in the past, more
than once, and it didn't stick. I am happy with how it went this time.
After I got Claude's explanation, I checked its claimed behavior pretty
carefully with a stub program, as if I had been reviewing a
colleague's code that I wasn't sure about.
The biggest win Claude gave me was that I didn't know about this
ProcessPoolExecutor thing before, and now I do. That is going to
make me a better programmer. Now I know something about useful that
I didn't know before, and I have a pointer to documentation I know I
should study.
My skill at writing ad-hoc process pool managers might atrophy, but if
it does, that is good. I have already written too many ad-hoc
process pool managers. It was a bad habit, I should have stopped long
ago, and this will help me stop.
Conclusion
This works.
Perfectly? No, it's technology, technology never works perfectly.
Have you ever used a computer?
Will it introduce new problems? Probably, it's new technology, and
new technology always introduces new problems.
But is it better than what we had before? Definitely.
I still see some programmers turning up their noses at this technology
as if they were sure it was a silly fad that would burn itself out
once people came to their senses and saw what a terrible idea it was.
I think that is not going to happen, and those nose-turning-up people,
like the people who pointed out all the drawbacks and unknown-unknowns
of automobiles as compared to horse-drawn wagons, are going to look
increasingly foolish.
Because it works.
[Other articles in category /tech/gpt]
permanent link
|