Sat, 08 Dec 2007
Dirty, dirty buffers!
When your process asks the kernel to write data:
int bytes_written = write(file_descriptor, buffer, n_bytes);the kernel normally copies the data from your buffer into a kernel buffer, and then, instead of writing out the data to disk, it marks its buffer as "dirty" (that is, as needing to be written eventually), and reports success back to the process immediately, even though the dirty buffer has not yet been written, and the data is not yet on the disk.
Normally, the kernel writes out the dirty buffer in due time, and the data makes it to the disk, and you are happy because your process got to go ahead and do some more work without having to wait for the disk, which could take milliseconds. ("A long time", as I so quaintly called it in the talk.) If some other process reads the data before it is written, that is okay, because the kernel can give it the updated data out of the buffer.
But if there is a catastrophe, say a power failure, then you see the bad side of this asynchronous writing technique, because the data, which your process thought had been written, and which the kernel reported as having been written, has actually been lost.
There are a number of mechanisms in place to deal with this. The oldest is the sync() system call, which marks all the kernel buffers as dirty. All Unix systems run a program called init, and one of init's principal duties is to call sync() every thirty seconds or so, to make sure that the kernel buffers get flushed to disk at least every thirty seconds, and so that no crash will lose more than about thirty seconds' worth of data.
(There is also a command-line program sync which just does a sync() call and then exits, and old-time Unix sysadmins are in the habit of halting the system with:
# sync # sync # sync # haltbecause the second and third syncs give the kernel time to actually write out the buffers that were marked dirty by the first sync. Although I suspect that few of them know why they do this. I swear I am not making this up.)
But for really crucial data, sync() is not enough, because, although it marks the kernel buffers as dirty, it still does not actually write the data to the disk.
So there is also an fsync() call; I forget when this was introduced. The process gives fsync() a file descriptor, and the call demands that the kernel actually write the associated dirty buffers to disk, and does not return until they have been. And since, unlike write(), it actually waits for the data to go to the disk, a successful return from fsync() indicates that the data is truly safe.
The mail delivery agent will use this when it is writing your email to your mailbox, to make sure that no mail is lost.
Some systems have an O_SYNC flag than the process can supply when it opens the file for writing:
int fd = open("blookus", O_WRONLY | O_SYNC);This sets the O_SYNC flag in the kernel file pointer structure, which means that whenever data is written to this file pointer, the kernel, contrary to its usual practice, will implicitly fsync() the descriptor.
Well, that's not what I wanted to write about here. What I meant to discuss was...
No, wait. That is what I wanted to write about. How about that?
Anyway, there's an interesting question that arises in connection with fsync(): suppose you fsync() a file. That guarantees that the data will be written. But does it also guarantee that the mtime and the file extent of the file will be updated? That is, does it guarantee that the file's inode will be written?
On most systems, yes. But on some versions of Linux's ext2 filesystem, no. Linus himself broke this as a sacrifice to the false god of efficiency, a very bad decision in my opinion, for reasons that should be obvious to everyone but those in the thrall of Mammon. (Mammon's not right here. What is the proper name of the false god of efficiency?)
Sanity eventually prevailed. Recent versions of Linux have an fsync() call, which updates both the data and the inode, and a fdatasync() call, which only guarantees to update the data.
[ Addendum 20071208: Some of this is wrong. I posted corrections. ]