================
nmdb User Guide
================
:Author: Alberto Bertogli (albertito@gmail.com)


Introduction
============

nmdb_ is a simple and fast cache and database for TIPC_ clusters. It allows
applications in the cluster to use a centralized, shared cache and database in
a very easy way. It stores *(key, value)* pairs, with each key having only one
associated value.

This document explains how to setup nmdb and a simple guide to writing
clients. It also includes a "quick start" section for the anxious.


Installing nmdb
===============

If you installed nmdb using your Linux distribution package system, you can
skip this section entirely.


Prerequisites
-------------

Before you install nmdb, you will need the following software:

- libevent_, a library for fast event handling.
- `Linux kernel`_ 2.6.16 or newer, compiled with TIPC_ support.
- QDBM_, for the database backend.


Compiling and installing
------------------------

There are three components of the nmdb tarball: the server in the *nmdb/*
directory, the C library in *libnmdb/*, and the Python module in *python/*.

- To install the server, run ``cd nmdb; make install``.
- To install the C library, run ``cd libnmdb; make install; ldconfig``.
- To install the Python module, run ``cd python; python setup.py install``.


Quick start
===========

For a very quick start, using a single host, you can do the following::

  # dpmgr create /tmp/nmdb-db   # create the backend database
  # nmdb -d /tmp/nmdb-db        # start the server

At this point you have created a database and started the server. An easy and
simple way to test it is to use the python module, like this::

  # python
  Python 2.5 (r25:51908, Sep 21 2006, 20:38:23)
  [GCC 4.1.1 (Gentoo 4.1.1)] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import nmdb               # import the module
  >>> db = nmdb.DB()            # connect to the server
  >>> db['x'] = 1               # store some data
  >>> db[(1, 2)] = (2, 6)
  >>> print db['x'], db[(1, 2)] # retreive the values
  1 (3, 5)
  >>> del db['x']               # delete from the database

Everything should have worked as shown, and you are now ready to use some
nmdb application, or develop your own.

If you want to use this with several machines, read the next section to find
out how to setup a simple TIPC cluster.


TIPC setup
==========

If you want to use the server and the clients in different machines, you need
to setup your TIPC network. If you just want to run everything in one machine,
or you already have a TIPC network set up, you can skip this section.

Before we begin, all the machines should already be connected in an Ethernet
LAN, and have the tipc-config application that should come with your Linux
distribution with a package named "tipcutils" or similar (if it doesn't, you
can find it at http://tipc.sourceforge.net/download.html).

The only thing you will need to do is assign each machine a TIPC address and
specify which interface to use for the network connection. You do it like
this::

  # tipc-config -a=1.1.10 -be=eth:eth0

The *-a* parameter specifies the address, and *-be* the type and name of the
interface to use.

Addresses are composed of three integers. They represent the zone number, the
cluster number, and the node number respectively. The zone number and cluster
number should be the same for all nodes in your network, so you should change
the last one for each machine. Each machine can have only one address.

That should be enough to get you started for a small network. If you have a
very big network, or want to use some of the advanced TIPC features like link
redundancy, you should read TIPC's docs.


Example
-------

If you have five machines, you can assign each one their address like this::

  box1# tipc-config -a=1.1.1 -be=eth:eth0
  box2# tipc-config -a=1.1.2 -be=eth:eth0
  box3# tipc-config -a=1.1.3 -be=eth:eth0
  box4# tipc-config -a=1.1.4 -be=eth:eth0
  box5# tipc-config -a=1.1.5 -be=eth:eth0


Starting the server
===================

Before starting the server, there are some things you need to know about it:

Port numbers
  Each server instance in your network (even the ones running in the same
  machine) should get a **unique** port to listen to requests. Ports identify
  an application instance inside the whole network, not just the machine as in
  TCP/IP.

  The port space is very very large, and it's private to nmdb, so you can
  choose numbers without fear of colliding with other TIPC applications. The
  default port is 10.

  So, if you are going to start more than one nmdb server, **be careful**. If
  you assign two active servers the same port you will get no error, but
  everything will act weird.

Cache size
  nmdb's cache is a main component of the server. In fact you can use it
  exclusively for caching purposes, like memcached_. So the size becomes an
  important issue if you have performance requirements.
  
  It is only possible to limit the cache size by the maximum number of objects
  in the cache.

Backend database
  You will need to create a backend database using QDBM_'s utilities. This is
  quite simple, just run ``dpmgr create /path/to/the/database`` and you're
  done.

  If for some reason (hardware failure, for instance) the database becomes
  corrupt, you should use QDBM's utilities to fix it. It shouldn't happen, so
  it's a good idea to report it if it does.

  QDBM databases are not meant to be shared among processes, so avoid having
  other processes using them.

Database redundancy
  If you want to have redundancy over the database, you can start a "passive
  server" along a normal one using the same port number. It will listen to
  database requests and act upon them, but it will not reply anything.

  It is only useful to keep a live mirror of the database. Note that it does
  not do replication or failure detection, it's just a mirror.

  This is the only case where you want to start two servers with the same port.

Distributed queries
  If you have more than one server in the network, the library can distribute
  the queries among them. This is entirely done on the client side and the
  server doesn't know about it.


Now that you know all that, starting a server should be quite simple: first
create the database as explained above, and then run the daemon with
``nmdb -d /path/to/the/database``.

To change the port, use ``-l port``, to change the cache size, use ``-c nobj``
(where *nobj* is the number of objects in thousands), to make the server
passive, use ``-p``. Of course you won't remember all that (I know I don't),
that's why ``-h`` is your friend.

Nothing prevents you from starting more than one server in the same machine,
so be careful to select different ports and databases for each one.


Example
-------

Following the previous example, if you want to start three servers you can do
it like this::

  box1# ndbm -d /var/lib/nmdb/db-1 -l 11
  box2# ndbm -d /var/lib/nmdb/db-2 -l 12
  box3# ndbm -d /var/lib/nmdb/db-3 -l 13


Writing clients
===============

At the moment you can write clients in C (documented in the *libnmdb*'s
manpage) and in Python (documented using Python docstrings). In this guide we
will give some examples of common use as an introduction, you should consult
the appropriate documentation when doing serious development.

Before we begin, you should know about the following things:

Thread safety
  While the library itself is thread safe, neither the C library connections
  nor the Python objects are. So don't share *nmdb_t* variables (C) or
  *nmdb.** objects (Python) among threads; instead, create one for each thread
  that needs it.

Available operations
  You can request the server to do three operations: *set* a value to a key,
  *get* the value associated with the given key, and *delete* a given key
  (with its associated value).

Request modes
  For each operation, you will have three different modes available:

  - A *normal mode*, which makes the operation impact on the database
    asynchronously (ie.  the functions return right after the operation was
    queued, there is no completion notification).
  - A *synchronous mode* similar to the previous one, but when the functions
    return, the operation has hit the disk.
  - A *cache-only mode* where the operations do not impact the database, only
    the cache, and can be used to implement distributed caching in a similar
    way to memcached_.

  Be careful with the last one, because mixing cache-only with database
  operations is a recipe for disaster.

Atomicity and coherence
  All operations are atomic, and synchronous and asynchronous operations are
  fully coherent.

Distributed queries
  You can distribute your queries among several servers, and this is entirely
  done on the client side. To do this, you should add each server (identified
  by their port numbers) to the connection **before beginning to interact with
  them**.


For all examples we will assume that you have three servers running in your
network, in ports 11, 12 and 13.


The Python module
------------------

The Python module it's quite easy to use, because its interface is very
similar to a dictionary. It has similar limitations regarding the key (it must
be an object you can use as a key in a dictionary), and the values must be
pickable objects (see the *pickle* module documentation for more information).
In short, you should only use number, strings or tuples as keys, and simple
objects as values, unless you know what you are doing.

To start a connection to the servers, you must first decide which mode you are
going to use: the normal database-backed mode, database-backed with
synchronous access, or cache only. Let's say you want to use the normal mode
and connect to the server at port 11, and then add the other two servers::

  import nmdb
  db = nmdb.DB(11)
  db.add_server(12)
  db.add_server(13)

Now you're ready to use it. Let's suppose you want to write a recursive
function to calculate the factorial of a number. But before doing the
calculation, you can check if the previous factorial already is in the
database to avoid recalculating it::

  def fact(n):
      if n == 1:
          return 1
      if db.has_key(n):
          return db[n]

      result = n * fact(n - 1)
      db[n] = result
      return result

That was easy, wasn't it? You can use the same trick for SQL queries, complex
distributed calculations, geographical data processing, whatever you want.

Now let's have some fun and do something a little advanced: a decorator for a
distributed function cache. If Python magic scares you, look away and skip to
the next section.

Some functions (usually the mathematical ones) have the property that the
value they return depends only on the parameters, and not on the context.  So
they can be cached, using the parameters as keys, with the function's result
as their associated values. Applying this technique is commonly known as
*memoization*, and when we apply it to a function we say we're *memoizing* it.

We can use a local dictionary to cache the data, but that would mean we would
have to write some cache management code to avoid using too much memory, and,
worse of all, each instance of the code running in the network would have its
own private cache and can't reuse calculations performed by other instances.
Instead, we can use nmdb to make a cache that is shared among the network.

The functions are usually restricted to using simple types as input, like
numbers, strings, tuples or dictionaries. We will take advantage of this by
using as a key to the cache the string ``<function module>-<function
name>-<string representation of the arguments>``. So to cache an invocation
like ``mod.f(1, (2, 6))`` that returns ``26``, we want to have the following
association in the database: ``mod-f-(1, (2, 6)) = 26``.

We will use nmdb in cache-only mode, where the things we store are not saved
permanently to a database, but live in the server's memory. This is very
similar to what we did before, and has the advantage of not having to write
our own cache management routines::

  import nmdb
  db = nmdb.Cache(11)
  db.add_server(12)
  db.add_server(13)

Let's write the decorator::

  def shared_memoize(f):
      def newf(*args, **kwargs):
          key = '%s-%s-%s-%s' % (f.__module__, f.__name__,
                                 repr(args), repr(kwargs))
          if key in db:
              return db[key]
          r = f(*args, **kwargs)
          db[key] = r
          return r
      return newf

Now we can use it with a normal implementation of the recursive factorial
function like we did before, and a function that calculates tetrations_::

  @shared_memoize
  def fact(n):
      if n == 1:
          return 1
      return n * fact(n - 1)

  @shared_memoize
  def tetration(a, b):
      if b == 1:
          return a
      return pow(a, tetration(a, b - 1))

As you can see, the module is very easy to use, but you can do useful things
with it. For more information you can read the module's built-in
documentation.


The C library
-------------

The C library is in essence similar to the Python module, so we won't make a
very long example here, only a brief display of the available functions.

Let's begin by creating a "nmdb descriptor" which is of type *nmdb_t*, and
connecting it to your three servers::

  unsigned char *key, *val;
  size_t ksize, vsize;
  nmdb_t *db;

  db = nmdb_connect(11);
  nmdb_add_server(db, 12);
  nmdb_add_server(db, 13);

Now you can do some operations (allocations and checks are not shown for brevity)::

  r = nmdb_set(db, key, ksize, val, vsize);
  ...
  r = nmdb_get(db, key, ksize, val, vsize);
  ...
  r = nmdb_del(db, key, ksize);

And finally close and free the connection::

  nmdb_free(db);

The operation functions have variants for cache-only (*nmdb_cache_**) and synchronous
operation (*nmdb_sync_**). For more information you should check the manpage.


Where to go from here
=====================

The best place to go from here is to your text editor, to start writing some
simple clients to play with.

If you are in doubt about something, you can consult the manpages or the
documentation inside the *doc/* directory. And if you can't find an answer to
your question there, you can ask me, Alberto Bertogli, at
*albertito@gmail.com*.


.. _nmdb: http://auriga.wearlab.de/~alb/nmdb/
.. _libevent: http://www.monkey.org/~provos/libevent/
.. _TIPC: http://tipc.sf.net
.. _memcached: http://www.danga.com/memcached/
.. _QDBM: http://qdbm.sf.net
.. _`Linux kernel`: http://kernel.org
.. _tetrations: http://en.wikipedia.org/wiki/Tetration