The Universe of Discourse


Sat, 08 Dec 2007

Corrections about sync(2)
I made some errors in today's post about sync and fsync.

Most important, I said that "the sync() system call marks all the kernel buffers as dirty". This is totally wrong, and doesn't even make sense. Dirty buffers are those with data that needs to be written out. Marking a non-dirty buffer as dirty is a waste of time, since nothing has changed in the buffer, but it will now be rewritten anyway. What sync() does is schedule all the dirty buffers to be written as soon as possible.

On some recent systems, sync() actually waits for all the dirty buffers to be written, and a bunch of people tried to correct me about this. But my original article was right: historically, it was not so, and even today it's not universally true. In former times, sync() would schedule the buffers for writing, and then return before the data was actually written.

I said that one of the duties of init was to call sync() every thirty seconds, but this was mistaken. That duty actually fell to a separate program, known as update. While discussing this with one of the readers who wrote to correct me, I looked up the source for Version 7 Unix, to make sure I was right, and it's so short I thought I might as well show it here:

        /*
         * Update the file system every 30 seconds.
         * For cache benefit, open certain system directories.
         */

        #include <signal.h>

        char *fillst[] = {
                "/bin",
                "/usr",
                "/usr/bin",
                0,
        };

        main()
        {
                char **f;

                if(fork())
                        exit(0);
                close(0);
                close(1);
                close(2);
                for(f = fillst; *f; f++)
                        open(*f, 0);
                dosync();
                for(;;)
                        pause();
        }

        dosync()
        {
                sync();
                signal(SIGALRM, dosync);
                alarm(30);
        }
The program is so simple I don't have much more to say about it. It initially invokes dosync(), which calls sync() and then schedules another call to dosync() in 30 seconds. Note that the 0 in the second argument to open had not yet been changed to O_RDONLY. The pause() call is equivalent to sleep(0): it causes the process to relinquish its time slice whenever it is active.

In various systems more recent than V7, the program was known by various names, but it was update for a very long time.

Several people wrote to correct me about the:

        # sync
        # sync
        # sync
        # halt
thing, some saying that I had the reason wrong, or that it did not make sense, or that only two syncs were used, rather than three. But I had it right. People did use three, and they did it for the reason I said, whether that makes sense or not. (Some of the people who miscorrected me were unaware that sync() would finish and exit before the data was actually written.) But for example, see this old Usenet thread for a discussion of the topic that confirms what I said.

Nobody disputed my contention that Linus was suffering from the promptings of the Evil One when he tried to change the semantics of fsync(), and nobody seems to know the proper name of the false god of false efficiency. I'll give this some thought and see what I can come up with.

Thanks to Tony Finch, Dmitry Kim, and Stefan O'Rear for discussion of these points.


[Other articles in category /Unix] permanent link

Dirty, dirty buffers!
One side issue that arose during my talk on Monday about inodes was the write-buffering normally done by Unix kernels. I wrote a pretty long note to the PLUG mailing list about it, and I thought I'd repost it here.

When your process asks the kernel to write data:

        int bytes_written = write(file_descriptor,
                                  buffer,
                                  n_bytes);
the kernel normally copies the data from your buffer into a kernel buffer, and then, instead of writing out the data to disk, it marks its buffer as "dirty" (that is, as needing to be written eventually), and reports success back to the process immediately, even though the dirty buffer has not yet been written, and the data is not yet on the disk.

Normally, the kernel writes out the dirty buffer in due time, and the data makes it to the disk, and you are happy because your process got to go ahead and do some more work without having to wait for the disk, which could take milliseconds. ("A long time", as I so quaintly called it in the talk.) If some other process reads the data before it is written, that is okay, because the kernel can give it the updated data out of the buffer.

But if there is a catastrophe, say a power failure, then you see the bad side of this asynchronous writing technique, because the data, which your process thought had been written, and which the kernel reported as having been written, has actually been lost.

There are a number of mechanisms in place to deal with this. The oldest is the sync() system call, which marks all the kernel buffers as dirty. All Unix systems run a program called init, and one of init's principal duties is to call sync() every thirty seconds or so, to make sure that the kernel buffers get flushed to disk at least every thirty seconds, and so that no crash will lose more than about thirty seconds' worth of data.

(There is also a command-line program sync which just does a sync() call and then exits, and old-time Unix sysadmins are in the habit of halting the system with:

        # sync
        # sync
        # sync
        # halt
because the second and third syncs give the kernel time to actually write out the buffers that were marked dirty by the first sync. Although I suspect that few of them know why they do this. I swear I am not making this up.)

But for really crucial data, sync() is not enough, because, although it marks the kernel buffers as dirty, it still does not actually write the data to the disk.

So there is also an fsync() call; I forget when this was introduced. The process gives fsync() a file descriptor, and the call demands that the kernel actually write the associated dirty buffers to disk, and does not return until they have been. And since, unlike write(), it actually waits for the data to go to the disk, a successful return from fsync() indicates that the data is truly safe.

The mail delivery agent will use this when it is writing your email to your mailbox, to make sure that no mail is lost.

Some systems have an O_SYNC flag than the process can supply when it opens the file for writing:

        int fd = open("blookus", O_WRONLY | O_SYNC);
This sets the O_SYNC flag in the kernel file pointer structure, which means that whenever data is written to this file pointer, the kernel, contrary to its usual practice, will implicitly fsync() the descriptor.

Well, that's not what I wanted to write about here. What I meant to discuss was...

No, wait. That is what I wanted to write about. How about that?

Anyway, there's an interesting question that arises in connection with fsync(): suppose you fsync() a file. That guarantees that the data will be written. But does it also guarantee that the mtime and the file extent of the file will be updated? That is, does it guarantee that the file's inode will be written?

On most systems, yes. But on some versions of Linux's ext2 filesystem, no. Linus himself broke this as a sacrifice to the false god of efficiency, a very bad decision in my opinion, for reasons that should be obvious to everyone but those in the thrall of Mammon. (Mammon's not right here. What is the proper name of the false god of efficiency?)

Sanity eventually prevailed. Recent versions of Linux have an fsync() call, which updates both the data and the inode, and a fdatasync() call, which only guarantees to update the data.

[ Addendum 20071208: Some of this is wrong. I posted corrections. ]


[Other articles in category /Unix] permanent link