Installation

Install Chicken Scheme using their installation instructions.

Ugarit can then be installed by typing (as root):

chicken-install ugarit

See the chicken-install manual for details if you have any trouble, or wish to install into your home directory.

Setting up a vault

Firstly, you need to know the vault identifier for the place you'll be storing your vaults. This depends on your backend. The vault identifier is actually the command line used to invoke the backend for a particular vault; communication with the vault is via standard input and output, which is how it's easy to tunnel via ssh.

Local filesystem backends

These backends use the local filesystem to store the vaults. Of course, the "local filesystem" on a given server might be an NFS mount or mounted from a storage-area network.

Logfile backend

The logfile backend works much like the original Venti system. It's append-only - you won't be able to delete old snapshots from a logfile vault, even when I implement deletion. It stores the vault in two sets of files; one is a log of data blocks, split at a specified maximum size, and the other is the metadata: an sqlite database used to track the location of blocks in the log files, the contents of tags, and a count of the logs so a filename can be chosen for a new one.

To set up a new logfile vault, just choose where to put the two parts. It would be nice to put the metadata file on a different physical disk to the logs directory, to reduce seeking. If you only have one disk, you can put the metadata file in the log directory ("metadata" is a good name).

You can then refer to it using the following vault identifier:

"backend-fs splitlog ...log directory... ...metadata file..."

SQLite backend

The sqlite backend works a bit like a Fossil repository; the storage is implemented as a single file, which is actually an SQLite database containing blocks as blobs, along with tags and configuration data in their own tables.

It supports unlinking objects, and the use of a single file to store everything is convenient; but storing everything in a single file with random access is slightly riskier than the simple structure of an append-only log file; it is less tolerant of corruption, which can easily render the entire storage unusable. Also, that one file can get very large.

SQLite has internal limits on the size of a database, but they're quite large - you'll probably hit a size limit at about 140 terabytes.

To set up an SQLite storage, just choose a place to put the file. I usually use an extension of .vault; note that SQLite will create additional temporary files alongside it with additional extensions, too.

Then refer to it with the following vault identifier:

"backend-sqlite ...path to vault file..."

Filesystem backend

The filesystem backend creates vaults by storing each block or tag in its own file, in a directory. To keep the objects-per-directory count down, it'll split the files into subdirectories. Because of this, it uses a stupendous number of inodes (more than the filesystem being backed up). Only use it if you don't mind that; splitlog is much more efficient.

To set up a new filesystem-backend vault, just create an empty directory that Ugarit will have write access to when it runs. It will probably run as root in order to be able to access the contents of files that aren't world-readable (although that's up to you), so unless you access your storage via ssh or sudo to use another user to run the backend under, be careful of NFS mounts that have maproot=nobody set!

You can then refer to it using the following vault identifier:

"backend-fs fs ...path to directory..."

Proxying backends

These backends wrap another vault identifier which the actual storage task is delegated to, but add some value along the way.

SSH tunnelling

It's easy to access a vault stored on a remote server. The caveat is that the backend then needs to be installed on the remote server! Since vaults are accessed by running the supplied command, and then talking to them via stdin and stdout, the vault identified needs only be:

"ssh ...hostname... '...remote vault identifier...'"

Cache backend

The cache backend is used to cache a list of what blocks exist in the proxied backend, so that it can answer queries as to the existance of a block rapidly, even when the proxied backend is on the end of a high-latency link (eg, the Internet). This should speed up snapshots, as existing files are identified by asking the backend if the vault already has them.

The cache backend works by storing the cache in a local sqlite file. Given a place for it to store that file, usage is simple:

"backend-cache ...path to cachefile... '...proxied vault identifier...'"

The cache file will be automatically created if it doesn't already exist, so make sure there's write access to the containing directory.

- WARNING - WARNING - WARNING - WARNING - WARNING - WARNING -

If you use a cache on a vault shared between servers, make sure that you either:

Never delete things from the vault

Make sure all access to the vault is via the same cache

If a block is deleted from a vault, and a cache on that vault is not aware of the deletion (as it did not go "through" the caching proxy), then the cache will record that the block exists in the vault when it does not. This will mean that if a snapshot is made through the cache that would use that block, then it will be assumed that the block already exists in the vault when it does not. Therefore, the block will not be uploaded, and a dangling reference will result!

Some setups which *are* safe:

A single server using a vault via a cache, not sharing it with anyone else.

A pool of servers using a vault via the same cache.

A pool of servers using a vault via one or more caches, and maybe some not via the cache, where nothing is ever deleted from the vault.

A pool of servers using a vault via one cache, and maybe some not via the cache, where deletions are only performed on servers using the cache, so the cache is always aware.

Writing a `ugarit.conf`

ugarit.conf should look something like this:

(storage <vault identifier>)
(hash tiger "<salt>")
[double-check]
[(compression [deflate|lzma])]
[(encryption aes <key>)]
[(cache "<path>")|(file-cache "<path>")]
[(rule ...)]

Hashing

The hash line chooses a hash algorithm. Currently Tiger-192 (tiger), SHA-256 (sha256), SHA-384 (sha384) and SHA-512 (sha512) are supported; if you omit the line then Tiger will still be used, but it will be a simple hash of the block with the block type appended, which reveals to attackers what blocks you have (as the hash is of the unencrypted block, and the hash is not encrypted). This is useful for development and testing or for use with trusted vaults, but not advised for use with vaults that attackers may snoop at. Providing a salt string produces a hash function that hashes the block, the type of block, and the salt string, producing hashes that attackers who can snoop the vault cannot use to find known blocks (see the "Security model" section below for more details).

I would recommend that you create a salt string from a secure entropy source, such as:

dd if=/dev/random bs=1 count=64 | base64 -w 0

Whichever hash function you use, you will need to install the required Chicken egg with one of the following commands:

chicken-install -s tiger-hash  # for tiger
chicken-install -s sha2        # for the SHA hashes

Compression

lzma is the recommended compression option for low-bandwidth backends or when space is tight, but it's very slow to compress; deflate or no compression at all are better for fast local vaults. To have no compression at all, just remove the (compression ...) line entirely. Likewise, to use compression, you need to install a Chicken egg:

chicken-install -s z3       # for deflate
chicken-install -s lzma     # for lzma

WARNING: The lzma egg is currently rather difficult to install, and needs rewriting to fix this problem.

Encryption

Likewise, the (encryption ...) line may be omitted to have no encryption; the only currently supported algorithm is aes (in CBC mode) with a key given in hex, as a passphrase (hashed to get a key), or a passphrase read from the terminal on every run. The key may be 16, 24, or 32 bytes for 128-bit, 192-bit or 256-bit AES. To specify a hex key, just supply it as a string, like so:

(encryption aes "00112233445566778899AABBCCDDEEFF")

...for 128-bit AES,

(encryption aes "00112233445566778899AABBCCDDEEFF0011223344556677")

...for 192-bit AES, or

(encryption aes "00112233445566778899AABBCCDDEEFF00112233445566778899AABBCCDDEEFF")

...for 256-bit AES.

Alternatively, you can provide a passphrase, and specify how large a key you want it turned into, like so:

(encryption aes (24|32 "We three kings of Orient are, one in a taxi one in a car, one on a scooter honking his hooter and smoking a fat cigar. Oh, star of wonder, star of light; star with royal dynamite"))

I would recommend that you generate a long passphrase from a secure entropy source, such as:

dd if=/dev/random bs=1 count=64 | base64 -w 0

Finally, the extra-paranoid can request that Ugarit prompt for a passphrase on every run and hash it into a key of the specified length, like so:

(encryption aes (24|32 prompt))

(note the lack of quotes around prompt, distinguishing it from a passphrase)

Please read the Security model documentationfor details on the implications of different encryption setups.

Again, as it is an optional feature, to use encryption, you must install the appropriate Chicken egg:

chicken-install -s aes

Caching

Ugarit can use a local cache to speed up various operations. If a path to a file is provided through the cache or file-cache directives, then a file will be created at that location and used as a cache. If not, then a default path of ~/.ugarit-cache will be used instead.

WARNING: If you use multiple different vaults from the same UNIX account, and the same tag names are used in those different vaults, and you use the default cache path (or explicitly specify cache paths that point to the same file), you will get a somewhat confused cache. The effects of this will be annoying (searches finding things that then can't be fetched) rather than damaging, but it's still best avoided!

The cache is used to cache snapshot records and archive import records. This is used by operations that extract snapshot history and archive objects; snapshots are stored in a linked list of snapshot objects, each referring to the previous snapshot. Therefore, reading the history of a snapshot tag requires reading many objects from the storage, which can be time-consuming for a remote storage! Similarly, archives are represented as a linked list of imports, and searching for an object in the archive can involve traversing the chain of imports until a match is found (and then searching on until the end to see if any further matches can be found!). The cache is even more important for archive imports, as it not only keeps a local copy of all the import information, it also records the "current" metadata of every object in the archive (so that we don't need to search through superceded previous versions of the metadata of an object when looking for something), and uses B-tree indexes to enable fast searching of the cached metadata.

If you configure the cache path with file-cache rather than just cache</cache>, then as well as the snapshot/archive metadata caching, you will also enable file hash caching.

This significantly speeds up subsequent snapshots of a filesystem tree. The file cache maps filenames to (mtime,size,hash) tuples; as it scans the filesystem, if it finds a file in the cache and the mtime and size have not changed, it will assume it is already stored under the specified hash. This saves it from having to read the entire file to hash it and then check if the hash is present in the vault. In other words, if only a few files have changed since the last snapshot, then snapshotting a directory tree becomes an O(N) operation, where N is the number of files, rather than an O(M) operation, where M is the total size of files involved.

WARNING: If you use a file cache, and a file is cached in it but then subsequently deleted from the vault, Ugarit will fail to re-upload it at the next snapshot. If you are using a file cache and you go deleting things from your vault (should that be implemented in future), you'll want to flush the cache afterwards. We might implement automatic removal of deleted files from the local cache, but file caches on other Ugarit installations that use the same vault will not be aware of the deletion.

`Other options`



double-check, if present, causes Ugarit to perform extra
internal consistency checks during backups, which will detect bugs but
may slow things down.

Example

For example:

(storage "ssh ugarit@spiderman 'backend-fs splitlog /mnt/ugarit-data /mnt/ugarit-metadata/metadata'")
(hash tiger "i3HO7JeLCSa6Wa55uqTRqp4jppUYbXoxme7YpcHPnuoA+11ez9iOIA6B6eBIhZ0MbdLvvFZZWnRgJAzY8K2JBQ")
(encryption aes (32 "FN9m34J4bbD3vhPqh6+4BjjXDSPYpuyskJX73T1t60PP0rPdC3AxlrjVn4YDyaFSbx5WRAn4JBr7SBn2PLyxJw"))
(compression lzma)
(file-cache "/var/ugarit/cache")

Be careful to put a set of parentheses around each configuration
entry. White space isn't significant, so feel free to indent things
and wrap them over lines if you want.

Keep copies of this file safe - you'll need it to do extractions! Print a copy out and lock it in your fire safe! Ok, currently, you might be able to recreate it if you remember where you put the storage, but encryption keys and hash salts are harder to remember...