The Universe of Disco


Sun, 20 May 2018

Proposal for turning off standard I/O buffering

Some Unix commands, such as grep, will have a command-line flag to say that you want to turn off the buffering that is normally done in the standard I/O library. Some just try to guess what you probably want. Every command is a little different and if the command you want doesn't have the flag you need, you are basically out of luck.

Maybe I should explain the putative use case here. You have some command (or pipeline) X that will produce dribbles of data at uncertain intervals. If you run it at the terminal, you see each dribble timely, as it appears. But if you put X into a pipeline, say with

    X | tee ...

or

    X | grep ...

then the dribbles are buffered and only come out of X when an entire block is ready to be written, and the dribbles could be very old before the downstream part of the pipeline, including yourself, sees them. Because this is happening in user space inside of X, there is not a damn thing anyone farther downstream can do about it. The only escape is if X has some mode in which it turns off standard I/O buffering. Since standard I/O buffering is on by default, there is a good chance that the author of X did not think to affirmatively add this feature.

Note that adding the --unbuffered flag to the downstream grep does not solve the problem; grep will produce its own output timely, but it's still getting its input from X after a long delay.

One could imagine a program which would interpose a pseudo-tty, and make X think it is writing to a terminal, and then the standard I/O library would stay in line-buffered mode by default. Instead of running

    X | tee some-file | ...

or whatever, one would do

    pseudo-tty-pipe -c X | tee some-file | ...

which allocates a pseudo-tty device, attaches standard output to it, and forks. The child runs X, which dribbles timely into the pseudo-tty while the parent runs a read loop to remove dribbles from the master end of the TTY and copy them timely into the pipe. This would work. Although tee itself also has no --unbuffered flag so you might even have to:

    pseudo-tty-pipe -c X | pseudo-tty-pipe -c 'tee some-file' | ...

I don't think such a program exists, and anyway, this is all ridiculous, a ridiculous abuse of the standard I/O library's buffering behavior: we want line buffering, the library will only give it to us if the process is attached to a TTY device, so we fake up a TTY just to fool stdio into giving us what we want. And why? Simply because stdio has no way to explicitly say what we want.

But it could easily expose this behavior as a controllable feature. Currently there is a branch in the library that says how to set up a buffering mode when a stream is opened for the first time:

  • if the stream is for writing, and is attached to descriptor 2, it should be unbuffered; otherwise …

  • if the stream is for writing, and connects descriptor 1 to a terminal device, it should be line-buffered; otherwise …

  • if the moon is waxing …

  • otherwise, the stream should be block-buffered

To this, I propose a simple change, to be inserted right at the beginning:

If the environment variable STDIO_BUF is set to "line", streams default to line buffering. If it's set to "none", streams default to no buffering. If it's set to "block", streams default to block buffered. If it's anything else, or unset, it is ignored.

Now instead of this:

    pseudo-tty-pipe --from X | tee some-file | ...

you write this:

    STDIO_BUF=line X | tee some-file | ...

Problem solved.

Or maybe you would like to do this:

    export STDIO_BUF=line

which then it affects every program in every pipeline in the rest of the session:

    X | tee some-file | ...

Control is global if you want it, and per-process if you want it.

This feature would cost around 20 lines of C code in the standard I/O library and would impose only an insigificant run-time cost. It would effectively add an --unbuffered flag to every program in the universe, retroactively, and the flag would be the same for every program. You would not have to remember that in mysql the magic option is -n and that in GNU grep it is --line-buffered and that for jq is is --unbuffered and that Python scripts can be unbuffered by supplying the -u flag and that in tee you are just SOL, etc. Setting STDIO_BUF=line would Just Work.

Programming languages would all get this for free also. Python already has PYTHONUNBUFFERED but in other languages you have to do something or other; in Perl you use some horrible Perl-4-ism like

    { my $ofh = select OUTPUT; $|++; select $ofh }

This proposal would fix every programming language everywhere. The Perl code would become:

    $ENV{STDIO_BUF} = 'line';

and every other language would be similarly simple:

    /* In C */
    putenv("STDIO_BUF=line");

[ Addendum 20180521: Mariusz Ceier corrects me, pointing out that this will not work for the process’ own standard streams, as they are pre-opened before the process gets a chance to set the variable. ]

It's easy to think of elaborations on this: STDIO_BUF=1:line might mean that only standard output gets line-buffering by default, everything else is up to the library.

This is an easy thing to do. I have wanted this for twenty years. How is it possible that it hasn't been in the GNU/Linux standard library for that long?

[ Addendum 20180521: it turns out there is quite a lot to say about the state of the art here. In particular, NetBSD has the feature very much as I described it. ]


[Other articles in category /Unix] permanent link