

When we create a transaction file, we need to make sure not only the data hit
the disk, but the directory metadata too, so we can be absolutely sure we will
be able to access it.

Flushing directory metadata is quite messy because it's not clearly
standarized, so it depends a lot on the OS/Filesystem combination.

On some systems, fsync() over a file is guaranteed to flush also the metadata
needed to access the file (Linux/ext3, all BSDs), so nothing else is needed.

On other systems, fsync() on the directory holding the file is needed
(Linux/ext2). This is the proper Linux way to do things.

This gets even more weird, because it is also possible that neither works and
you need a sync() to do it, but the standard allows sync() to return before
the data has really hit the disk (although nobody sane do that these days,
some old systems work this way, eg. Linux < 1.3.20). Luckily, all current
systems seem to fall within the previous two categories.

God knows what happens over NFS on different client-server combinations. It
will probably work on most tho (at least from reading the source it seems like
Linux client and server do the right thing).

What this patch do is trying to cope with all those cases by always fsync()
the parent directory, and if that fails with EINVAL or EBADF, use sync(). It's
the best I can do.

Linux, FreeBSD, NetBSD, Solaris and MacOS X return OK when doing a directory
fsync(), so this should not cause unnecesary sync()s these days.

For reference, look at the huge number of posts of the subject on lkml, and
read fsync()'s SUSv3 reference.




---

 cur-root/libjio.h |    1 +
 cur-root/trans.c  |   23 +++++++++++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff -puN trans.c~fsync_dir trans.c
--- cur/trans.c~fsync_dir	2004-07-13 10:51:21.000000000 -0300
+++ cur-root/trans.c	2004-07-13 18:10:34.107495832 -0300
@@ -324,8 +324,21 @@ int jtrans_commit(struct jtrans *ts)
 	 * everything O_SYNC, we sync at this point only, this way we avoid
 	 * doing a lot of very small writes; in case of a crash the
 	 * transaction file is only useful if it's complete (ie. after this
-	 * point) so we only flush here */
-	fsync(fd);
+	 * point) so we only flush here (both data and metadata) */
+	if (fsync(fd) != 0)
+		goto exit;
+	if (fsync(ts->fs->jdirfd) != 0) {
+		/* it seems to be legal that fsync() on directories is not
+		 * implemented, so if this fails with EINVAL or EBADF, just
+		 * call a global sync(); which is awful (and might still
+		 * return before metadata is done) but it seems to be the
+		 * saner choice; otherwise we just fail */
+		if (errno == EINVAL || errno == EBADF) {
+			sync();
+		} else {
+			goto exit;
+		}
+	}
 
 	/* now that we have a safe transaction file, let's apply it */
 	written = 0;
@@ -473,6 +486,12 @@ int jopen(struct jfs *fs, const char *na
 	if (rv < 0 || !S_ISDIR(sinfo.st_mode))
 		return -1;
 
+	/* open the directory, we will use it to flush transaction files'
+	 * metadata in jtrans_commit() */
+	fs->jdirfd = open(jdir, O_RDONLY);
+	if (fs->jdirfd < 0)
+		return -1;
+
 	snprintf(jlockfile, PATH_MAX, "%s/%s", jdir, "lock");
 	jfd = open(jlockfile, O_RDWR | O_CREAT, 0600);
 	if (jfd < 0)
diff -puN libjio.h~fsync_dir libjio.h
--- cur/libjio.h~fsync_dir	2004-07-13 10:51:21.000000000 -0300
+++ cur-root/libjio.h	2004-07-13 18:10:34.107495832 -0300
@@ -24,6 +24,7 @@ extern "C" {
 struct jfs {
 	int fd;			/* main file descriptor */
 	char *name;		/* and its name */
+	int jdirfd;		/* journal directory file descriptor */
 	int jfd;		/* journal's lock file descriptor */
 	int flags;		/* journal flags */
 	pthread_mutex_t lock;	/* a soft lock used in some operations */

_
