libjio - A library for journaled I/O 

Alberto Bertogli (albertogli@telpin.com.ar) 

Table of Contents

1 Introduction
2 General on-disk data organization
    2.1 The transaction file
3 The commit procedure
4 The rollback procedure
5 The recovery procedure
6 Advanced flags
    6.1 Avoid rollbacking
    6.2 Skip locking
    6.3 Lingering transactions
7 UNIX API
8 ACID (or How does libjio fit into theory)
9 Working from outside



1 Introduction

libjio is a library for doing journaled 
transaction-oriented I/O, providing atomicity warantees 
and a simple to use but powerful API.

This document explains the design of the library, how 
it works internally and why it works that way. You 
should read it even if you don't plan to do use the 
library in strange ways, it provides (or at least tries 
to =) an insight view on how the library performs its 
job, which can be very valuable knowledge when working 
with it. It assumes that there is some basic knowledge 
about how the library is used, which can be found in 
the manpage or in the programmer's guide.

To the user, libjio provides two groups of functions, 
one UNIX-alike that implements the journaled versions 
of the classic functions (open(), read(), write() and 
friends); and a lower-level one that center on 
transactions and allows the user to manipulate them 
directly by providing means of commiting and 
rollbacking. The former, as expected, are based on the 
latter and interact safely with them. Besides, it's 
designed in a way that allows efficient and safe 
interaction with I/O performed from outside the library 
in case you want to.

The following sections describe different concepts and 
procedures that the library bases its work on. It's not 
intended to be a replace to reading the source code: 
please do so if you have any doubts, it's not big at 
all (less than 1500 lines, including comments) and I 
hope it's readable enough. If you think that's not the 
case, please let me know and I'll try to give you a hand.

2 General on-disk data organization

On the disk, the file you are working on will look 
exactly as you expect and hasn't got a single bit 
different that what you would get using the regular 
API. But, besides the working file, you will find a 
directory named after it where the journaling 
information lives. 

Inside, there are two kind of files: the lock file and 
transaction files. The first one is used as a general 
lock and holds the next transaction ID to assign, and 
there is only one; the second one holds one 
transaction, which is composed by a header of fixed 
size and a variable-size payload, and can be as many as 
the number of in-flight transactions. 

This imposes some restrictions on the kind of 
operations you can perform over a file while it's 
currently being used: you can't move it (because the 
journal directory name depends on the filename) and you 
can't unlink it (for similar reasons). 

These warnings are no different from a normal 
simultaneous use under classic UNIX environments, but 
they are here to remind you that even though the 
library warantees a lot and eases many things for its 
user (specially from complex cases, like multiple 
threads using the file at the same time), you should 
still be careful when doing strange things with files 
while working on them. 

2.1 The transaction file

The transaction file is composed of three main parts: 
the header, the payload and the checksum.

The header holds basic information about the 
transaction itself, including the ID, some flags, and 
the amount of operations it includes. Then the payload 
has all the operations one after the other, divided in 
two parts: the first one includes static information 
about the operation (the length of the data, the offset 
of the file where it should be applied, etc.) and the 
data itself, which is saved by the library prior to 
applying the commit, so transactions can be reapplied 
if necesary. The last part is just a 32 bit integer 
with the checksum of all the previous data, used for 
integrity verification during the recovery process.

3 The commit procedure

We call "commit" to the action of safely and atomically 
write some given data to the disk.

The former, safely, means that after a commit has been 
done we can assume the data will not get lost and can 
be retrieved, unless of course some major event happens 
(like a physical hard disk crash). For us, this means 
that the data was effectively written to the disk and 
if a crash occurs after the commit operation has 
returned, the operation will be complete and data will 
be available from the file.

The latter, atomically, guarantees that the operation 
is either completely done, or not done at all. This is 
a really common word, specially if you have worked with 
multiprocessing, and should be quite familiar. We 
implement atomicity by combining fine-grained locks and 
journaling, which can assure us both to be able to 
recover from crashes, and to have exclusive access to a 
portion of the file without having any other 
transaction overlap it.

Well, so much for talking, now let's get real; libjio 
applies commits in a very simple and straightforward 
way, inside jtrans_commit():

* Open the transaction file

* Write the header

* Lock the file offsets where the commit takes place

* Read all the previous data from the file

* Write the data in the transaction

* Write the data to the file

* Mark the transaction as commited by setting a flag in 
  the header

* Unlink the transaction file

* Unlock the offsets where the commit takes place

This may look as a lot of steps, but they're not as 
much as it looks like inside the code, and allows a 
recovery from interruptions in every step of the way 
(or even in the middle of a step).

4 The rollback procedure

First of all, rollbacking is like "undo" a commit: return 
the data to the state it had exactly before a given 
commit was applied. Due to the way we handle commits, 
doing this operation becomes quite simple and straightforward.

In the previous section we said that each transaction 
held the data that was on it before commiting. That 
data is saved precisely to be able to rollback. So, to 
rollback a transaction all that has to be done is 
recover that "previous data" from the transaction we want 
to rollback, and save it to the disk. In the end, this 
ends up being a new transaction with the previous data 
as the new one, so we do that: create a new transaction 
structure, fill in the data from the transaction we 
want to rollback, and commit it. All this is performed 
by jtrans_rollback().

By doing this we can provide the same warranties a 
commit has, it's really fast, eases the recovery, and 
the code is simple and clean. What a deal.

But be aware that rollbacking is dangerous. And I 
really mean it: you should only do it if you're really 
sure it's ok. Consider, for instance, that you commit 
transaction A, then B, and then you rollback A. If A 
and B happen to touch the same portion of the file, the 
rollback will, of course, not return the state previous 
to B, but previous to A. If it's not done safely, this 
can lead to major corruption. Now, if you add to this 
transactions that extend the file (and thus rollbacking 
truncates it back), you not only have corruption but 
data loss. So, again, be aware, I can't stress this 
enough, rollback only if you really really know what 
you are doing.

5 The recovery procedure

Recovering from crashes is done by the jfsck() call (or 
the program jiofsck which is just a simple invocation 
to that function), which opens the file and goes 
through all transactions in the journal (remember that 
transactions are removed from the journal directory 
after they're applied), loading and rollbacking them if 
necessary. There are several steps where it can fail: 
there could be no journal, a given transaction file 
might be corrupted, incomplete, and so on; but in the 
end, there are two cases regarding each transaction: 
either it's complete and can be rollbacked, or not.

In the case the transaction is not complete, there is 
no possibility that it has been partially applied to 
the disk, remember that, from the commit procedure, we 
only apply the transaction after saving it in the 
journal, so there is really nothing left to be done. So 
if the transaction is complete, we only need to rollback.

In any case, after making the recovery you can simply 
remove the journal entirely and let the library create 
a new one, and you can be sure that transaction 
atomicity was preserved.

6 Advanced flags

The library allows to set flags to transactions in 
order to support special features and behaviour changes 
that might be useful in special cases. In this section, 
we describe the most relevant ones.

6.1 Avoid rollbacking

If you are completely sure that you will never need to 
rollback a transaction, there is one flag, 
J_NOROLLBACK, that will tell the library to avoid 
reading the rollback information from the file when 
applying a transaction. It can be useful when 
transactions are very very big, or there are several 
memory constraints, or reading is really synchronous. 
It is also very very dangerous because if for some 
reason the transaction fails to apply you will not be 
able to recover it.

6.2 Skip locking

In some cases, you might not want the library to lock 
the file itself, because you need to do it yourself. 
For this cases, the flag J_NOLOCK makes the commit 
procedure skip locking regions. You need to be quite 
careful with this flag because if you don't take good 
care of locking, it will lead to corruption.

6.3 Lingering transactions

We call lingering transactions to a small but 
intresting variant of the regular transactions 
described throughout this text.

If we go back at the commit procedure, we will see that 
first we save all the data to the transaction file, 
then write the file, and finally remove the transaction 
file, so data gets written twice synchronously.

The problem with this approach is performance: it's 
quite slow because all the writes and seeks involved. 
Besides, it makes no use of the OS write caching 
capabilities, and it optimizes for the uncommon case of 
a crash.

Lingering transactions is a special way of dealing with 
the transactions we have already seen. After writing 
the transaction file and making sure it has hit the 
media, the data is already safe. So then we write to 
the real file, but this time asynchronously, and let 
the OS perform the write caching and defer the real 
operation to the media. Then, instead of removing the 
transaction file, we leave it. At this point, we know 
the transaction file is safe, but as the real file has 
not been synchronized yet, the data state is still 
uncertain; however, if we crash, there will be enough 
data to recover.

Usually, OS do write caching and delay the proper write 
to the media and perform it when the time is right or 
when it's forced by a fsync(), so the performance goes 
up a lot.

In this mode, you should call jsync() frequently, which 
calls fsync() on the file making sure the data is safe, 
and after that removes all the lingering transactions.

The downside of lingering transactions are the 
additional space needed to hold them, and the fact that 
if you crash there will be more transactions to 
reapply, and might take longer. But if you jsync() 
often, that shouldn't be noticeable.

7 UNIX API

We call UNIX API to the functions provided by the 
library that emulate the good old UNIX file 
manipulation calls. Most of them are just wrappers 
around commits, and implement proper locking when 
operating in order to allow simultaneous operations 
(either across threads or processes). They are 
described in detail in the manual pages, we'll only 
list them here for completion:

* jopen()

* jread(), jpread(), jreadv()

* jwrite(), jpwrite(), jwritev()

* jtruncate()

* jclose()

8 ACID (or How does libjio fit into theory)

I haven't read much theory about this, and the library 
was implemented basically by common sense and not 
theorethical study. 

However, I'm aware that database people like ACID 
(well, that's not news for anybody ;), which they say 
mean "Atomicity, Consistency, Isolation, Durability" 
(yeah, right!). 

So, even libjio is not a purely database thing, it can 
be used to achieve those attributes in a simple and 
efficient way. 

Let's take a look one by one:

* Atomicity: In a transaction involving two or more 
  discrete pieces of information, either all of the 
  pieces are committed or none are. This has been 
  talked before and we've seen how the library achieves 
  this point, mostly based on locks and relying on a 
  commit procedure.

* Consistency: A transaction either creates a new and 
  valid state of data, or, if any failure occurs, 
  returns all data to its state before the transaction 
  was started. This, like atomicity, has been discussed 
  before, specially in the recovery section, when we 
  saw how in case of a crash we end up with a fully 
  applied transaction, or no transaction applied at all.

* Isolation: A transaction in process and not yet 
  committed must remain isolated from any other 
  transaction. This comes as a side effect of doing 
  proper locking on the sections each transaction 
  affect, and guarantees that there can't be two 
  transactions working on the same section at the same time.

* Durability: Committed data is saved by the system 
  such that, even in the event of a failure and system 
  restart, the data is available in its correct state. 
  For this point we rely on the disk as a method of 
  permanent storage, and expect that when we do 
  syncronous I/O, data is safely written and can be 
  recovered after a crash.

9 Working from outside

If you want, and are careful enough, you can safely do 
I/O without using the library. Here I'll give you some 
general guidelines that you need to follow in order to 
prevent corruption. Of course you can bend or break 
them according to your use, this is just a general 
overview on how to interact from outside. 

* Lock the sections you want to use: the library, as we 
  have already exposed, relies on fcntl locking; so, if 
  you intend to operate on parts on the file while 
  using it, you should lock them. 

* Don't tuncate, unlink or rename: these operations 
  have serious implications when they're done while 
  using the library, because the library itself assumes 
  that names don't change, and files don't dissapear 
  beneath it. It could potentially lead to corruption, 
  although most of the time you would just get errors 
  from every call.
