Index: DOWNLOAD.wiki ================================================================== --- DOWNLOAD.wiki +++ DOWNLOAD.wiki @@ -8,12 +8,13 @@ * [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.5.tar.gz?uuid=1.0.5|1.0.5] * [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.6.tar.gz?uuid=1.0.6|1.0.6] * [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.7.tar.gz?uuid=1.0.7|1.0.7] * [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.8.tar.gz?uuid=1.0.8|1.0.8] * [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.9.tar.gz?uuid=1.0.9|1.0.9] + * [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-2.0.tar.gz?uuid=2.0|2.0]

Source Control

You can obtain the latest sources, all history, and a local copy of the ticket database using [http://www.fossil-scm.org/|Fossil], like so: fossil clone https://www.kitten-technologies.co.uk/project/ugarit ugarit.fossil Index: README.wiki ================================================================== --- README.wiki +++ README.wiki @@ -1,11 +1,11 @@

Introduction

Ugarit is a backup/archival system based around content-addressible -storage. +storage. [./docs/intro.wiki|Learn more...]

News

Development priorities are: Performance, better error handling, and fixing bugs! After I've cleaned house a little, I'll be focussing on @@ -12,11 +12,11 @@ replicated backend storage (ticket [f1f2ce8cdc]), as I now have a cluster of storage devices at home.

-

About Ugarit

- -

What's content-addressible storage?

- -Traditional backup systems work by storing copies of your files -somewhere. Perhaps they go onto tapes, or perhaps they're in archive -files written to disk. They will either be full dumps, containing a -complete copy of your files, or incrementals or differentials, which -only contain files that have been modified since some point. This -saves making repeated copies of unchanging files, but it means that to -do a full restore, you need to start by extracting the last full dump -then applying one or more incrementials, or the latest differential, -to get the latest state. - -Not only do differentials and incrementals let you save space, they -also give you a history - you can restore up to a previous point in -time, which is invaluable if the file you want to restore was deleted -a few backup cycles ago! - -This technology was developed when the best storage technology for -backups was magnetic tape, because each dump is written sequentially -(and restores are largely sequential, unless you're skipping bits to -pull out specific files). - -However, these days, random-access media such as magnetic disks and -SSDs are cheap enough to compete with magnetic tape for long-term bulk -storage (especially when one considers the cost of a tape drive or -two). And having fast random access means we can take advantage of -different storage techniques. - -A content-addressible store is a key-value store, except that the keys -are always computed from the values. When a given object is stored, it -is hashed, and the hash used as the key. This means you can never -store the same object twice; the second time you'll get the same hash, -see the object is already present, and re-use the existing -copy. Therefore, you get deduplication of your data for free. - -But, I hear you ask, how do you find things again, if you can't choose -the keys? - -When an object is stored, you need to record the key so you can find -it again later. In Ugarit, everything is stored in a tree-like -directory structure. Files are uploaded and their hashes obtained, and -then a directory object is constructed containing a list of the files -in the directory, and listing the key of the Ugarit objects storing -the contents of each file. This directory object itself has a hash, -which is stored inside the directory entry in the parent directory, -and so on up to the root. The root of a tree stored in a Ugarit vault -has no parent directory to contain it, so at that point, we store the -key of the root in a named "tag" that we can look up by name when we -want it. - -Therefore, everything in a Ugarit vault can be found by starting with -a named tag and retrieving the object whose key it contains, then -finding keys inside that object and looking up the objects they refer -to, until we find the object we want. - -When you use Ugarit to back up your filesystem, it uploads a complete -snapshot of every file in the filesystem, like a full dump. But -because the vault is content-addressed, it automatically avoids -uploading anything it already has a copy of, so all we upload is an -incremental dump - but in the vault, it looks like a full dump, and so -can be restored on its own without having to restore a chain of incrementals. - -Also, the same storage can be shared between multiple systems that all -back up to it - and the incremental upload algorithm will mean that -any files shared between the servers will only need to be uploaded -once. If you back up a complete server, than go and back up another -that is running the same distribution, then all the files in /bin -and so on that are already in the storage will not need to be backed -up again; the system will automatically spot that they're already -there, and not upload them again. - -As well as storing backups of filesystems, Ugarit can also be used as -the primary storage for read-only files, such as music and photos. The -principle is exactly the same; the only difference is in how the files -are organised - rather than as a directory structure, the files are -referenced from metadata objects that specify information about the -file (so it can be found) and a reference to the contents. Sets of -metadata objects are pointed to by tags as well, so they can also be -found. - -

So what's that mean in practice?

- -

Backups

-You can run Ugarit to back up any number of filesystems to a shared -storage area (known as a vault, and on every backup, Ugarit -will only upload files or parts of files that aren't already in the -vault - be they from the previous snapshot, earlier snapshots, -snapshot of entirely unrelated filesystems, etc. Every time you do a -snapshot, Ugarit builds an entire complete directory tree of the -snapshot in the vault - but reusing any parts of files, files, or -entire directories that already exist anywhere in the vault, and -only uploading what doesn't already exist. - -The support for parts of files means that, in many cases, gigantic -files like database tables and virtual disks for virtual machines will -not need to be uploaded entirely every time they change, as the -changed sections will be identified and uploaded. - -Because a complete directory tree exists in the vault for any -snapshot, the extraction algorithm is incredibly simple - and, -therefore, incredibly reliable and fast. Simple, reliable, and fast -are just what you need when you're trying to reconstruct the -filesystem of a live server. - -Also, it means that you can do lots of small snapshots. If you run a -snapshot every hour, then only a megabyte or two might have changed in -your filesystem, so you only upload a megabyte or two - yet you end up -with a complete history of your filesystem at hourly intervals in the -vault. - -Conventional backup systems usually either store a full backup then -incrementals to their archives, meaning that doing a restore involves -reading the full backup then reading every incremental since and -applying them - so to do a restore, you have to download *every -version* of the filesystem you've ever uploaded, or you have to do -periodic full backups (even though most of your filesystem won't have -changed since the last full backup) to reduce the number of -incrementals required for a restore. Better results are had from -systems that use a special backup server to look after the archive -storage, which accept incremental backups and apply them to the -snapshot they keep in order to maintain a most-recent snapshot that -can be downloaded in a single run; but they then restrict you to using -dedicated servers as your archive stores, ruling out cheaply scalable -solutions like Amazon S3, or just backing up to a removable USB or -eSATA disk you attach to your system whenever you do a backup. And -dedicated backup servers are complex pieces of software; can you rely -on something complex for the fundamental foundation of your data -security system? - -

Archives

- -You can also use Ugarit as the primary storage for read-only -files. You do this by creating an archive in the vault, and importing -batches of files into it along with their metadata (arbitrary -attributes, such as "author", "creation date" or "subject"). - -Just as you can keep snapshots of multiple systems in a Ugarit vault, -you can also keep multiple separate archives, each identified by a -named tag. - -However, as it's all within the same vault, the usual de-duplication -rules apply. The same file may be in multiple archives, with different -metadata in each, as the file contents and metadata are stored -separately (and associated only within the context of each -archive). And, of course, the same file may appear in snapshots and in -archives; perhaps a file was originally downloaded into your home -directory, where it was backed up into Ugarit snapshots, and then you -imported it into your archive. The archive import would not have had -to re-upload the file, as its contents would have already been found -in the vault, so all that needs to be uploaded is the metadata. - -Although we have mainly spoken of storing files in archives, the -objects in archives can be files or directories full of files, as -well. This is useful for storing MacOS-style files that are actually -directories, or for archiving things like completed projects for -clients, which can be entire directory structures. - -

System Requirements

- -Ugarit should run on any POSIX-compliant system that can run -[http://www.call-with-current-continuation.org/|Chicken Scheme]. It -stores and restores all the file attributes reported by the stat -system call - POSIX mode permissions, UID, GID, mtime, and optionally -atime and ctime (although the ctime cannot be restored due to POSIX -restrictions). Ugarit will store files, directories, device and -character special files, symlinks, and FIFOs. - -Support for extended filesystem attributes - ACLs, alternative -streams, forks and other metadata - is possible, due to the extensible -directory entry format; support for such metadata will be added as -required. - -Currently, only local filesystem-based vault storage backends are -complete: these are suitable for backing up to a removable hard disk -or a filesystem shared via NFS or other protocols. However, the -backend can be accessed via an SSH tunnel, so a remote server you are -able to install Ugarit on to run the backends can be used as a remote -vault. - -However, the next backend to be implemented will be one for Amazon S3, -and an SFTP backend for storing vaults anywhere you can ssh -to. Other backends will be implemented on demand; a vault can, in -principle, be stored on anything that can store files by name, report -on whether a file already exists, and efficiently download a file by -name. This rules out magnetic tapes due to their requirement for -sequential access. - -Although we need to trust that a backend won't lose data (for now), we -don't need to trust the backend not to snoop on us, as Ugarit -optionally encrypts everything sent to the vault. - -

Terminology

- -A Ugarit backend is the software module that handles backend -storage. An actual storage area - managed by a backend - is called a -storage, and is used to implement a vault; currently, every storage is -a valid vault, but the planned future introduction of a distributed -storage backend will enable multiple storages (which are not, -themselves, valid vaults as they only contain some subset of the -information required) to be combined into an aggregrate storage, which -then holds the actual vault. Note that the contents of a storage is -purely a set of blocks, and a series of named tags containing -references to them; the storage does not know the details of -encryption and hashing, so cannot make any sense of its contents. - -For example, if you use the recommended "splitlog" filesystem backend, -your vault might be /mnt/bigdisk on the server -prometheus. The backend (which is compiled along with the -other filesystem backends in the backend-fs binary) must -be installed on prometheus, and Ugarit clients all over -the place may then use it via ssh to prometheus. However, -even with the filesystem backends, the actual storage might not be on -prometheus where the backend runs - -/mnt/bigdisk might be an NFS mount, or a mount from a -storage-area network. This ability to delegate via SSH is particularly -useful with the "cache" backend, which reduces latency by storing a -cache of what blocks exist in a backend, thereby making it quicker to -identify already-stored files; a cluster of servers all sharing the -same vault might all use SSH tunnels to access an instance of the -"cache" backend on one of them (using some local disk to store the -cache), which proxies the actual vault storage to a vault on the other -end of a high-latency Internet link, again via an SSH tunnel. - -A vault is where Ugarit stores backups (as chains of snapshots) and -archives (as chains of archive imports). Backups and archives are -identified by tags, which are the top-level named entry points into a -vault. A vault is based on top of a storage, along with a choice of -hash function, compression algorithm, and encryption that are used to -map the logical world of snapshots and archive imports into the -physical world of blocks stored in the storage. - -A snapshot is a copy of a filesystem tree in the vault, with a header -block that gives some metadata about it. A backup consists of a number -of snapshots of a given filesystem. - -An archive import is a set of filesystem trees, each along with -metadata about it. Whereas a backup is organised around a series of -timed snapshots, an archive is organised around the metadata; the -filesystem trees in the archive are identified by their properties. - -

So what, exactly, is in a vault?

- -A Ugarit vault contains a load of blocks, each up to a maximum size -(usually 1MiB, although other backends might impose smaller -limits). Each block is identified by the hash of its contents; this is -how Ugarit avoids ever uploading the same data twice, by checking to -see if the data to be uploaded already exists in the vault by -looking up the hash. The contents of the blocks are compressed and -then encrypted before upload. - -Every file uploaded is, unless it's small enough to fit in a single -block, chopped into blocks, and each block uploaded. This way, the -entire contents of your filesystem can be uploaded - or, at least, -only the parts of it that aren't already there! The blocks are then -tied together to create a snapshot by uploading blocks full of the -hashes of the data blocks, and directory blocks are uploaded listing -the names and attributes of files in directories, along with the -hashes of the blocks that contain the files' contents. Even the blocks -that contain lists of hashes of other blocks are subject to checking -for pre-existence in the vault; if only a few MiB of your -hundred-GiB filesystem has changed, then even the index blocks and -directory blocks are re-used from previous snapshots. - -Once uploaded, a block in the vault is never again changed. After all, -if its contents changed, its hash would change, so it would no longer -be the same block! However, every block has a reference count, -tracking the number of index blocks that refer to it. This means that -the vault knows which blocks are shared between multiple snapshots (or -shared *within* a snapshot - if a filesystem has more than one copy of -the same file, still only one copy is uploaded), so that if a given -snapshot is deleted, then the blocks that only that snapshot is using -can be deleted to free up space, without corrupting other snapshots by -deleting blocks they share. Keep in mind, however, that not all -storage backends may support this - there are certain advantages to -being an append-only vault. For a start, you can't delete something by -accident! The supplied fs and sqlite backends support deletion, while -the splitlog backend does not yet. However, the actual snapshot -deletion command in the user interface hasn't been implemented yet -either, so it's a moot point for now... - -Finally, the vault contains objects called tags. Unlike the blocks, -the tags' contents can change, and they have meaningful names rather -than being identified by hash. Tags identify the top-level blocks of -snapshots within the system, from which (by following the chain of -hashes down through the index blocks) the entire contents of a -snapshot may be found. Unless you happen to have recorded the hash of -a snapshot somewhere, the tags are where you find snapshots from when -you want to do a restore. - -Whenever a snapshot is taken, as soon as Ugarit has uploaded all the -files, directories, and index blocks required, it looks up the tag you -have identified as the target of the snapshot. If the tag already -exists, then the snapshot it currently points to is recorded in the -new snapshot as the "previous snapshot"; then the snapshot header -containing the previous snapshot hash, along with the date and time -and any comments you provide for the snapshot, and is uploaded (as -another block, identified by its hash). The tag is then updated to -point to the new snapshot. - -This way, each tag actually identifies a chronological chain of -snapshots. Normally, you would use a tag to identify a filesystem -being backed up; you'd keep snapshotting the filesystem to the same -tag, resulting in all the snapshots of that filesystem hanging from -the tag. But if you wanted to remember any particular snapshot -(perhaps if it's the snapshot you take before a big upgrade or other -risky operation), you can duplicate the tag, in effect 'forking' the -chain of snapshots much like a branch in a version control system. - -Archive imports cause the creation of one or more archive metadata -blocks, each of which lists the hashes of files or filesystem trees in -the archive, along with their metadata. Each import then has a single -archive import block pointing to the sequence of metadata blocks, and -pointing to the previous archive import block in that archive. The -same filesystem tree can be imported more than once to the same -archive, and the "latest" metadata always wins. - -Generally, you should create lots of small archives for different -categories of things - such as one for music, one for photos, and so -on. You might well create separate archives for the music collections -of different people in your household, unless they overlap, and -another for Christmas music so it doesn't crop up in random shuffle -play! It's easy to merge archives if you over-compartmentalise them, -but harder to split an archive if you find it too cluttered with -unrelated things. - -I've spoken of archive imports, and backup snapshots, each having a -"previous" reference to the last import or snapshot in the chain, but -it's actually more complex than that: they have an arbitrary list of -zero or more previous objects. As such, it's possible for several -imports or snapshots to have the same "previous", known as a "fork", -and it's possible to have an import or snapshot that merges multiple -previous ones. - -Forking is handy if you want to basically duplicate an archive, -creating two new archives with the same contents to begin with, but -each then capable of diverging thereafter. You might do this to keep -the state of an archive before doing a bit import, so you can go back -to the original state if you regret the import, for instance. - -Forking a backup tag is a more unusual operation, but also -useful. Perhaps you have a server running many stateful services, and -the hardware becomes overloaded, so you clone the basic setup onto -another server, and run half of the services on the original and half -on the new one; if you fork the backup tag of the original server to -create a backup tag for the new server, then both servers' snapshot -history will share the original shared state. - -Merging is most useful for archives; you might merge several archives -into one, as mentioned. - -And, of course, you can merge backup tags, as well. If your earlier -splitting of one server into two doesn't work out (perhaps your -workload reduces, or you can now afford a single, more powerful, -server to handle everything in one place), you might rsync back the -service state from the two servers onto the new server, so it's all -merged in the new server's filesystem. To preserve this in the -snapshot history, you can merge the two backup tags of the two servers -to create a backup tag for the single new server, which will -accurately reflect the history of the filesystem. - -Also, tags might fork by accident - I plan to introduce a distributed -storage backend, which will replicate blocks and tags across multiple -storages to create a single virtual storage to build a vault on top -of; in the event of the network of actual storages suffering a -failure, it may be that snapshots and imports are only applied to some -of the storages - and then subsequent snapshots and imports only get -applied to some other subset of the storages. When the network is -repaired and all the storages are again visible, they will have -diverged, inconsistent, states for their tags, and the distributed -storage system will resolve the situation by keeping the majority -state as the state of the tag on all the backends, but preserving any -other states by creating new tags, with the original name plus a -suffix. These can then be merged to "heal" the conflict. - -

Using Ugarit

- -

Installation

- -Install [http://www.call-with-current-continuation.org/|Chicken Scheme] using their [http://wiki.call-cc.org/man/4/Getting%20started|installation instructions]. - -Ugarit can then be installed by typing (as root): - - chicken-install ugarit - -See the [http://wiki.call-cc.org/manual/Extensions#chicken-install-reference|chicken-install manual] for details if you have any trouble, or wish to install into your home directory. - -

Setting up a vault

- -Firstly, you need to know the vault identifier for the place you'll -be storing your vaults. This depends on your backend. The vault -identifier is actually the command line used to invoke the backend for -a particular vault; communication with the vault is via standard -input and output, which is how it's easy to tunnel via ssh. - -

Local filesystem backends

- -These backends use the local filesystem to store the vaults. Of -course, the "local filesystem" on a given server might be an NFS mount -or mounted from a storage-area network. - -

Logfile backend

- -The logfile backend works much like the original Venti system. It's -append-only - you won't be able to delete old snapshots from a logfile -vault, even when I implement deletion. It stores the vault in two -sets of files; one is a log of data blocks, split at a specified -maximum size, and the other is the metadata: an sqlite database used -to track the location of blocks in the log files, the contents of -tags, and a count of the logs so a filename can be chosen for a new one. - -To set up a new logfile vault, just choose where to put the two -parts. It would be nice to put the metadata file on a different -physical disk to the logs directory, to reduce seeking. If you only -have one disk, you can put the metadata file in the log directory -("metadata" is a good name). - -You can then refer to it using the following vault identifier: - - "backend-fs splitlog ...log directory... ...metadata file..." - -

SQLite backend

- -The sqlite backend works a bit like a -[http://www.fossil-scm.org/|Fossil] repository; the storage is -implemented as a single file, which is actually an SQLite database -containing blocks as blobs, along with tags and configuration data in -their own tables. - -It supports unlinking objects, and the use of a single file to store -everything is convenient; but storing everything in a single file with -random access is slightly riskier than the simple structure of an -append-only log file; it is less tolerant of corruption, which can -easily render the entire storage unusable. Also, that one file can get -very large. - -SQLite has internal limits on the size of a database, but they're -quite large - you'll probably hit a size limit at about 140 -terabytes. - -To set up an SQLite storage, just choose a place to put the file. I -usually use an extension of .vault; note that SQLite will -create additional temporary files alongside it with additional -extensions, too. - -Then refer to it with the following vault identifier: - - "backend-sqlite ...path to vault file..." - -

Filesystem backend

- -The filesystem backend creates vaults by storing each block or tag -in its own file, in a directory. To keep the objects-per-directory -count down, it'll split the files into subdirectories. Because of -this, it uses a stupendous number of inodes (more than the filesystem -being backed up). Only use it if you don't mind that; splitlog is much -more efficient. - -To set up a new filesystem-backend vault, just create an empty -directory that Ugarit will have write access to when it runs. It will -probably run as root in order to be able to access the contents of -files that aren't world-readable (although that's up to you), so -unless you access your storage via ssh or sudo to use another user to -run the backend under, be careful of NFS mounts that have -maproot=nobody set! - -You can then refer to it using the following vault identifier: - - "backend-fs fs ...path to directory..." - -

Proxying backends

- -These backends wrap another vault identifier which the actual -storage task is delegated to, but add some value along the way. - -

SSH tunnelling

- -It's easy to access a vault stored on a remote server. The caveat -is that the backend then needs to be installed on the remote server! -Since vaults are accessed by running the supplied command, and then -talking to them via stdin and stdout, the vault identified needs -only be: - - "ssh ...hostname... '...remote vault identifier...'" - -

Cache backend

- -The cache backend is used to cache a list of what blocks exist in the -proxied backend, so that it can answer queries as to the existance of -a block rapidly, even when the proxied backend is on the end of a -high-latency link (eg, the Internet). This should speed up snapshots, -as existing files are identified by asking the backend if the vault -already has them. - -The cache backend works by storing the cache in a local sqlite -file. Given a place for it to store that file, usage is simple: - - "backend-cache ...path to cachefile... '...proxied vault identifier...'" - -The cache file will be automatically created if it doesn't already -exist, so make sure there's write access to the containing directory. - - - WARNING - WARNING - WARNING - WARNING - WARNING - WARNING - - -If you use a cache on a vault shared between servers, make sure -that you either: - - * Never delete things from the vault - -or - - * Make sure all access to the vault is via the same cache - -If a block is deleted from a vault, and a cache on that vault is -not aware of the deletion (as it did not go "through" the caching -proxy), then the cache will record that the block exists in the -vault when it does not. This will mean that if a snapshot is made -through the cache that would use that block, then it will be assumed -that the block already exists in the vault when it does -not. Therefore, the block will not be uploaded, and a dangling -reference will result! - -Some setups which *are* safe: - - * A single server using a vault via a cache, not sharing it with - anyone else. - - * A pool of servers using a vault via the same cache. - - * A pool of servers using a vault via one or more caches, and - maybe some not via the cache, where nothing is ever deleted from - the vault. - - * A pool of servers using a vault via one cache, and maybe some - not via the cache, where deletions are only performed on servers - using the cache, so the cache is always aware. - -

Writing a ugarit.conf

- -ugarit.conf should look something like this: - -(storage ) -(hash tiger "") -[double-check] -[(compression [deflate|lzma])] -[(encryption aes )] -[(file-cache "")] -[(rule ...)] - -The hash line chooses a hash algorithm. Currently Tiger-192 -(tiger), SHA-256 (sha256), SHA-384 -(sha384) and SHA-512 (sha512) are supported; -if you omit the line then Tiger will still be used, but it will be a -simple hash of the block with the block type appended, which reveals -to attackers what blocks you have (as the hash is of the unencrypted -block, and the hash is not encrypted). This is useful for development -and testing or for use with trusted vaults, but not advised for use -with vaults that attackers may snoop at. Providing a salt string -produces a hash function that hashes the block, the type of block, and -the salt string, producing hashes that attackers who can snoop the -vault cannot use to find known blocks (see the "Security model" -section below for more details). - -I would recommend that you create a salt string from a secure entropy -source, such as: - - dd if=/dev/random bs=1 count=64 | base64 -w 0 - -Whichever hash function you use, you will need to install the required -Chicken egg with one of the following commands: - - chicken-install -s tiger-hash # for tiger - chicken-install -s sha2 # for the SHA hashes - -double-check, if present, causes Ugarit to perform extra -internal consistency checks during backups, which will detect bugs but -may slow things down. - -lzma is the recommended compression option for -low-bandwidth backends or when space is tight, but it's very slow to -compress; deflate or no compression at all are better for fast local -vaults. To have no compression at all, just remove the -(compression ...) line entirely. Likewise, to use -compression, you need to install a Chicken egg: - - chicken-install -s z3 # for deflate - chicken-install -s lzma # for lzma - -WARNING: The lzma egg is currently rather difficult to install, and -needs rewriting to fix this problem. - -Likewise, the (encryption ...) line may be omitted to have no -encryption; the only currently supported algorithm is aes (in CBC -mode) with a key given in hex, as a passphrase (hashed to get a key), -or a passphrase read from the terminal on every run. The key may be -16, 24, or 32 bytes for 128-bit, 192-bit or 256-bit AES. To specify a -hex key, just supply it as a string, like so: - - (encryption aes "00112233445566778899AABBCCDDEEFF") - -...for 128-bit AES, - - (encryption aes "00112233445566778899AABBCCDDEEFF0011223344556677") - -...for 192-bit AES, or - - (encryption aes "00112233445566778899AABBCCDDEEFF00112233445566778899AABBCCDDEEFF") - -...for 256-bit AES. - -Alternatively, you can provide a passphrase, and specify how large a -key you want it turned into, like so: - - (encryption aes ([16|24|32] "We three kings of Orient are, one in a taxi one in a car, one on a scooter honking his hooter and smoking a fat cigar. Oh, star of wonder, star of light; star with royal dynamite")) - -I would recommend that you generate a long passphrase from a secure -entropy source, such as: - - dd if=/dev/random bs=1 count=64 | base64 -w 0 - -Finally, the extra-paranoid can request that Ugarit prompt for a -passphrase on every run and hash it into a key of the specified -length, like so: - - (encryption aes ([16|24|32] prompt)) - -(note the lack of quotes around prompt, distinguishing it from a passphrase) - -Please read the "Security model" section below for details on the -implications of different encryption setups. - -Again, as it is an optional feature, to use encryption, you must -install the appropriate Chicken egg: - - chicken-install -s aes - -A file cache, if enabled, significantly speeds up subsequent snapshots -of a filesystem tree. The file cache is a file (which Ugarit will -create if it doesn't already exist) mapping filenames to -(mtime,size,hash) tuples; as it scans the filesystem, if it finds a -file in the cache and the mtime and size have not changed, it will -assume it is already stored under the specified hash. This saves it -from having to read the entire file to hash it and then check if the -hash is present in the vault. In other words, if only a few files -have changed since the last snapshot, then snapshotting a directory -tree becomes an O(N) operation, where N is the number of files, rather -than an O(M) operation, where M is the total size of files involved. - -For example: - - (storage "ssh ugarit@spiderman 'backend-fs splitlog /mnt/ugarit-data /mnt/ugarit-metadata/metadata'") - (hash tiger "i3HO7JeLCSa6Wa55uqTRqp4jppUYbXoxme7YpcHPnuoA+11ez9iOIA6B6eBIhZ0MbdLvvFZZWnRgJAzY8K2JBQ") - (encryption aes (32 "FN9m34J4bbD3vhPqh6+4BjjXDSPYpuyskJX73T1t60PP0rPdC3AxlrjVn4YDyaFSbx5WRAn4JBr7SBn2PLyxJw")) - (compression lzma) - (file-cache "/var/ugarit/cache") - -Be careful to put a set of parentheses around each configuration -entry. White space isn't significant, so feel free to indent things -and wrap them over lines if you want. - -Keep copies of this file safe - you'll need it to do extractions! -Print a copy out and lock it in your fire safe! Ok, currently, you -might be able to recreate it if you remember where you put the -storage, but encryption keys and hash salts are harder to remember... - -

Your first backup

- -Think of a tag to identify the filesystem you're backing up. If it's -/home on the server gandalf, you might call it gandalf-home. If -it's the entire filesystem of the server bilbo, you might just call -it bilbo. - -Then from your shell, run (as root): - - # ugarit snapshot [-c] [-a] - -For example, if we have a ugarit.conf in the current directory: - - # ugarit snapshot ugarit.conf -c localhost-etc /etc - -Specify the -c flag if you want to store ctimes in the vault; -since it's impossible to restore ctimes when extracting from an -vault, doing this is useful only for informational purposes, so it's -not done by default. Similarly, atimes aren't stored in the vault -unless you specify -a, because otherwise, there will be a lot of -directory blocks uploaded on every snapshot, as the atime of every -file will have been changed by the previous snapshot - so with -a -specified, on every snapshot, every directory in your filesystem will -be uploaded! Ugarit will happily restore atimes if they are found in -a vault; their storage is made optional simply because uploading -them is costly and rarely useful. - -

Exploring the vault

- -Now you have a backup, you can explore the contents of the -vault. This need not be done as root, as long as you can read -ugarit.conf; however, if you want to extract files, run it as root -so the uids and gids can be set. - - $ ugarit explore ugarit.conf - -This will put you into an interactive shell exploring a virtual -filesystem. The root directory contains an entry for every tag; if you -type ls you should see your tag listed, and within that -tag, you'll find a list of snapshots, in descending date order, with a -special entry current for the most recent -snapshot. Within a snapshot, you'll find the root directory of your -snapshot under contents, and the detailts of the snapshot itself in -propreties.sexpr, and will be able to cd into -subdirectories, and so on: - - > ls - localhost-etc/ - > cd localhost-etc - /localhost-etc> ls - current/ - 2015-06-12 22:49:34/ - 2015-06-12 22:49:25/ - /localhost-etc> cd current - /localhost-etc/current> ls - log.sexpr - properties.sexpr - contents/ - /localhost-etc/current> cat properties.sexpr - ((previous . "a140e6dbe0a7a38f8b8c381323997c23e51a39e2593afb61") - (mtime . 1434102574.0) - (contents . "34eccf1f5141187e4209cfa354fdea749a0c3c1c4682ec86") - (stats (blocks-stored . 12) - (bytes-stored . 16889) - (blocks-skipped . 50) - (bytes-skipped . 6567341) - (file-cache-hits . 0) - (file-cache-bytes . 0)) - (log . "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a") - (hostname . "ahe") - (source-path . "/etc") - (notes) - (files . 112) - (size . 6563588)) - /localhost-etc/current> cd contents - /localhost-etc/current/contents> ls - zoneinfo - vconsole.conf - udev/ - tmpfiles.d/ - systemd/ - sysctl.d/ - sudoers.tmp~ - sudoers - subuid - subgid - static - ssl/ - ssh/ - shells - shadow- - shadow - services - samba/ - rpc - resolvconf.conf - resolv.conf - -- Press q then enter to stop or enter for more... - q - /localhost-etc/current/contents> ls -ll resolv.conf - -rw-r--r-- 0 0 [2015-05-23 23:22:41] 78B/-: resolv.conf - key: #f - contents: "e33ea1394cd2a67fe6caab9af99f66a4a1cc50e8929d3550" - size: 78 - ctime: 1432419761.0 - -As well as exploring around, you can also extract files or directories -(or entire snapshots) by using the get command. Ugarit -will do its best to restore the metadata of files, subject to the -rights of the user you run it as. - -Type help to get help in the interactive shell. - -The interactive shell supports command-line editing, history and tab -completion for your convenience. - -

Extracting things directly

- -As well as using the interactive explore mode, it is also possible to -directly extract something from the vault, given a path. - -Given the sample vault from the previous example, it would be possible -to extract the README.txt file with the following -command: - - ugarit extract ugarit.conf /Test/current/contents/README.txt - -

Forking tags

- -As mentioned above, you can fork a tag, creating two tags that -refer to the same snapshot and its history but that can then have -their own subsequent history of snapshots applied to each -independently, with the following command: - - $ ugarit fork - -

Merging tags

- -And you can also merge two or more tags into one. It's possible to -merge a bunch of tags to make an entirely new tag, or you can merge a -tag into an existing tag, by having the "output" tag also be one of -the "input" tags. - -The command to do this is: - - $ ugarit merge - -For instance, to import your classical music collection into your main -musical collection, you might do: - - $ ugarit merge ugarit.conf my-music my-music classical-music - -Or if you want to create a new all-music archive from the archives -bobs-music and petes-music, you might do: - - $ ugarit merge ugarit.conf all-music bobs-music petes-music - -

Archive operations

- -

Importing

- -To import some files into an archive, you must create a manifest file -listing them, and their metadata. The manifest can also list -metadata for the import as a whole, perhaps naming the source of the -files, or the reason for importing them. - -The metadata for a file (or an import) is a series of named -properties. The value of a property can be any Scheme value, written -in Scheme syntax (with strings double-quoted unless they are to be -interpreted as symbols), but strings and numbers are the most useful -types. - -You can use whatever names you like for properties in metadata, but -there are some that the system applies automatically, and an informal -standard of sorts, which is documented in [docs/archive-schema.wiki]. - -You can produce a manifest file by hand, or use the Ugarit Manifest -Maker to produce one for you. You do this by installing it like so: - - $ chicken-install ugarit-manifest-maker - -And then running it, giving it any number of file and directory names -on the command line. When given directories, it will recursively scan -them to find all the files contained therein and put them in the -manifest; it will not put directories in the manifest, although it is -perfectly legal for you to do so when writing a manifest by hand. This -is because the manifest maker can't do much useful analysis on a -directory to suggest default metadata for them (so there isn't much -point in using it), and it's far more useful for it to make it easy -for you to import a large number of files individually by referencing -the directory containing them. - -The manifest is sent to standard output, so you need to redirect it to -a file, like so: - - $ ugarit-manifest-maker ~/music > music.manifest - -You can specify command-line options, as well. -e PATTERN -or --exclude=PATTERN introduces a glob pattern for files -to exclude from the manifest, and -D KEY=VALUE or ---define=KEY=VALUE provides a property to be added to -every file in the manifest (as opposed to an import property, that is -part of the metadata of the overall import). Note that -VALUE must be double-quoted if it's a string, as per -Scheme value syntax. - -One might use this like so: - - $ ugarit-manifest-maker -e *.txt -D rating=5 ~/favourite-music > music.manifest - -The manifest maker simplifies the writing of manifests for files, by -listing the files in manifest format along with useful metadata -extracted from the filename and the file itself. For supported file -types (currently, MP3 and OGG music files), it will even look inside -the file to extract metadata. - -The manifest file it generates will contain lots of comments -mentioning things it couldn't automatically analyse (such as unknown -OGG/ID3 tags, or unknown types of files); and for metadata properties -it thinks might be relevant but can't automatically provide, it -suggests them with an empty property declaration, commented out. The -idea is that, after generating a manifest, you read it by hand in a -text editor to attempt to improve it. - -

The format of a manifest file

- -Manifest files have a relatively simple format. The are based on -Scheme s-expressions, so can contain comments. From any semicolon (not -in a string or otherwise quoted) to the end of the line is a comment, -and #; in front of something comments out that something. - -Import metadata properties are specified like so: - - (KEY = VALUE) - -...where, as usual, VALUE must be double-quoted if it's a -string. - -Files to import, with their metadata, are specified like so: - - (object "PATH OF FILE TO IMPORT" - (KEY = VALUE) - (KEY = VALUE)... - ) - -The closing parenthesis need not be on a line of its own, it's -conventionally placed after the closing parenthesis of the final -property. - -Ugarit, when importing the files in the manifest, will add the -following properties if they are not already specified: - -
-
import-path
-
The path the file was imported from
- -
dc:format
-
A guess at the file's MIME type, based on the extension
- -
mtime
-
The file's modification time (as the number of seconds since the -UNIX epoch)
- -
ctime
-
The file's change time (as the number of seconds since the UNIX -epoch)
- -
filename
-
The name of the file, stripped of any directory components, and -including the extension.
- -
- -The following properties are placed in the import metadata, -automatically: - -
-
hostname
-
The hostname the import was performed on.
- -
manifest-path
-
The path to the manifest file used for the import.
- -
mtime
-
The time (in seconds since the UNIX epoch) at which the import was -committed.
- -
stats
-
A Scheme alist of statistics about the import (number of -files/blocks uploaded, etc).
-
- -So, to wrap that all up, here's a sample import manifest file: - - -(notes = "A bunch of old CDs I've finally ripped") - -(object "/home/alaric/newrip/track01.mp3" - (filename = "track01.mp3") - (dc:format = "audio/mpeg") - - (dc:publisher = "Go! Beat Records") - (dc:created = "1994") - (dc:contributor = "Portishead") - (dc:subject = "Trip-Hop") - (superset:size = 1) - (superset:index = 1) - (set:title = "Dummy") - (set:size = 11) - (set:index = 1) - (dc:creator = "Portishead") - (dc:title = "Wandering Star") - - (mtime = 1428962299.0) - (ctime = 1428962299.0) - (file-size = 4703055)) - -;;... and so on, for ten more MP3s on this CD, then several other CDs... - - -

Actually importing a manifest

- -Well, when you finally have a manifest file, importing it is easy: - - $ ugarit import - -

How do I change the metadata of an already-imported file?

- -That's easy; the "current" metadata of a file is the metadata of its -most recent. Just import the file again, in a new manifest, with new -metadata, and it will overwrite the old. However, the old metadata is -still preserved in the archive's history; tags forked from the archive -tag before the second import will still see the original state of the -archive, by design. - -

Exploring

- -Archives are visible in the explore interface. For instance, an import -of some music I did looks like this: - -
-> ls
-localhost-etc/ <tag>
-archive-tag/ <tag>
-> cd archive-tag
-/archive-tag> ls
-history/ <archive-history>
-/archive-tag> cd history
-/archive-tag/history> ls
-2015-06-12 22:53:13/ <import>
-/archive-tag/history> cd 2015-06-12 22:53:13
-/archive-tag/history/2015-06-12 22:53:13> ls
-log.sexpr <file>
-properties.sexpr <inline>
-manifest/ <import-manifest>
-/archive-tag/history/2015-06-12 22:53:13> cat properties.sexpr
-((stats (blocks-stored . 2046)
-        (bytes-stored . 1815317503)
-        (blocks-skipped . 9)
-        (bytes-skipped . 8388608)
-        (file-cache-hits . 0)
-        (file-cache-bytes . 0))
- (log . "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a")
- (mtime . 1434135993.0)
- (contents . "fcdd5b996914fdcac1e8a6cfbc67663e08f6eaf0cc952e21")
- (hostname . "ahe")
- (notes . "A bunch of music, imported as a demo")
- (manifest-path . "/home/alaric/tmp/test.manifest"))
-/archive-tag/history/2015-06-12 22:53:13> cd manifest
-/archive-tag/history/2015-06-12 22:53:13/manifest> ls
-1d4269099189234eefeb80b95370eaf280730cf4d591004d:03 The Lemon Song.mp3 <file>
-7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3 <file>
-64092fa12c2800dda474b41e5ebe8c948f39a59ee91c120b:09 How Many More Times.mp3 <file>
-1d79148d1e1e8947c50b44cf2d5690588787af328e82eeef:2-07 Going to California.mp3 <file>
-e3685148d0d12213074a9fdb94a00e05282aeabe77fa60d5:1-01 You Shook Me.mp3 <file>
-d73904f371af8d7ca2af1076881230f2dc1c2cf82416880a:03 Strangers.mp3 <file>
-9c5a0efb7d397180a1e8d42356d8f04c6c26a83d3b05d34a:09 Uptight.mp3 <file>
-01a069aec2e731e18fcdd4ecb0e424f346a2f0e16910f5e9:07 Numb.mp3 <file>
-7ea1ab7fbd525c40e21d6dd25130e8c70289ad56c09375b0:08 She.mp3 <file>
-009dacd8f3185b7caeb47050002e584ab86d08cf9e9aceec:1-03 Communication Breakdown.mp3 <file>
-26d264d629e22709f664ed891741f690900d45cd4fd44326:1-03 Dazed and Confused.mp3 <file>
-d879761195faf08e4e95a5a2398ea6eefb79920710bfeab6:1-10 Band Introduction _ How Many More Times.mp3 <file>
-83244601db42677d110fc8522c6a3cbbc1f22966a779f876:06 All My Love.mp3 <file>
-5eebee9a2ad79d04e4f69e9e2a92c4e0a8d5f21e670f89da:07 Tangerine.mp3 <file>
-dd6f1203b5973ecd00d2c0cee18087030490230727591746:2-08 That's the Way.mp3 <file>
-c0acea15aa27a6dd1bcaff1c13d4f3d741a40a46abeca3fc:04 The Crunge.mp3 <file>
-ea7727ad07c6c82e5c9c7218ee1b059cd78264c131c1438d:1-02 I Can't Quit You Baby.mp3 <file>
-10fda5f46b8f505ca965bcaf12252eedf5ab44514236f892:14 F.O.D..mp3 <file>
-a99ca9af5a83bde1c676c388dc273051defa88756df26e95:1-03 Good Times Bad Times.mp3 <file>
-b5d7cfe9808c7fc0dedbd656d44e4c56159cbd3c2ed963bb:1-15 Stairway to Heaven.mp3 <file>
-79c87e3c49ffdac175c95aae071f63d3a9efdf2ddb84998c:08.Batmilk.ogg <file>
--- Press q then enter to stop or enter for more...
-q
-/archive-tag/history/2015-06-12 22:53:13/manifest> ls -ll 7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3
--r--------     -     - [2015-04-13 21:46:39] -/-: 7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3
-key: #f
-contents: "7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382"
-import-path: "/home/alaric/archive/sorted-music/Led Zeppelin/Led Zeppelin/04 Dazed and Confused.mp3"
-filename: "04 Dazed and Confused.mp3"
-dc:format: "audio/mpeg"
-dc:publisher: "Atlantic"
-dc:subject: "Classic Rock"
-dc:title: "Dazed and Confused"
-dc:creator: "Led Zeppelin"
-dc:created: "1982"
-dc:contributor: "Led Zeppelin"
-set:title: "Led Zeppelin"
-set:index: 4
-set:size: 9
-superset:index: 1
-superset:size: 1
-ctime: 1428957999.0
-file-size: 15448903
-
-
-

Searching

- -However, the explore interface to an archive is far from pleasant. You -need to go to the correct import, and find your file by name, and then -identify it with a big long name composed of its hash and the original -filename to find its properties and extract. - -I hope to add property-based searching to explore mode in future -(which is why you need to go into a history directory -within the archive directory, as other ways of exploring the archive -will appear alongside). This will be particularly useful when the -explore-mode virtual filesystem is mounted over 9P! - -However, even that interface, being constrained to look like a -filesystem, will be limited. The ugarit command-line tool -provides a very powerful search interface that exposes the full power -of the archive metadata. - -

Metadata filters

- -Files (and directories) in an archive can be searched for using -"metadata filters", which are descriptions of what you're looking for -that the computer can understand. They are represented as Scheme -s-expressions, and can be made up of the following components: - -
-
#t
-
This filter matches everything. It's not very useful.
- -
#f
-
This filter matches nothing. It's not very useful.
- -
(and FILTER FILTER...)
-
This filter matches files for which all of the inner filters match.
- -
(or FILTER FILTER...)
-
This filter matches files for which any of the inner filters match.
- -
(not FILTER)
-
This filter matches files which do not match the inner filter.
- -
(= ($ PROP) VALUE)
-
This filter matches files which have the given -PROPerty equal to that VALUE in their metadata.
- -
(= key HASH)
-
This filter matches the file with the given hash.
- -
(= ($import PROP) VALUE)
-
This filter matches files which have the given -PROPerty equal to that VALUE in the metadata -of the import that last imported them.
-
- -

Searching an archive

- -For a start, you can search for files matching a given metadata filter -in a given archive. This is done with: - - $ ugarit search - -For instance, let's look for music by Led Zeppelin: - - $ ugarit search ugarit.conf music '(or - (= ($ dc:creator) "Led Zeppelin") - (= ($ dc:contributor) "Led Zeppelin"))' - -The result looks like the explore-mode view of an archive manifest, -listing the file's hash followed by its title and extension: - - -7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3 -834a1619a59835e0c27b22801e3c829b40be583dadd19770:2-08 No Quarter.mp3 -9e8bc4954838bd9c671f275eb48595089257185750d63894:1-12 I Can't Quit You Baby.mp3 -6742b3bebcdd9cae5ec5403c585935403fa74d16ed076cf2:02 Friends (1).mp3 -07d161f4bd684e283f7f2cf26e0b732157a8e95ef66939c3:05 Carouselambra.mp3 -[...] - - -What of all our lovely metadata? You can view that if you add the word -"verbose" to the end of the command line, which allows you to specify -alternate output formats: - - $ ugarit search ugarit.conf music '(or - (= ($ dc:creator) "Led Zeppelin") - (= ($ dc:contributor) "Led Zeppelin"))' verbose - -Now the output looks like: - - -object a444ff6ef807b080b536155f58d246d633cab4a0eabef5bf - (ctime = 1428958660.0) - (dc:contributor = "Led Zeppelin") - (dc:created = "2008") - (dc:creator = "Led Zeppelin") -[... all the usual file properties omitted ...] - import a43f7a7268ee8b18381c20d7573add5dbf8781f81377279c - (stats = ((blocks-stored . 2046) (bytes-stored . 1815317503) (blocks-skipped . 9) (bytes-skipped . 8388608) (file-cache-hits . 0) (file-cache-bytes . 0))) - (log = "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a") -[... all the usual import properties omitted ...] -object b4cadf48b2c07ccf0303fc4064b292cb222980b0d4223641 - (ctime = 1428958673.0) - (dc:contributor = "Led Zeppelin") - (dc:created = "2008") - (dc:creator = "Led Zeppelin") - (dc:creator = "Jimmy Page/John Paul Jones/Robert Plant") -[...and so on...] - - -As you can see, it lists the hash of each file, its metadata, the hash -of the import that last imported it, and the metadata of that import. - -That's quite verbose, so you'd probably be wanting to take that as -input to another program to do something nicer with it. But it's laid -out for human reading, not for machine parsing. Thankfully, we have -other formats for that, alist and -alist-with-imports. - -The look like: - - $ ugarit search ugarit.conf music '(or - (= ($ dc:creator) "Led Zeppelin") - (= ($ dc:contributor) "Led Zeppelin"))' alist - -This outputs one Scheme s-expression list per match, the first element -of which is the hash as a string, the rest of which is an alist of properties: - - -("7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382" - (ctime . 1428957999.0) - (dc:contributor . "Led Zeppelin") - (dc:created . "1982") - (dc:creator . "Led Zeppelin") -[... elided file properties ...] - (superset:index . 1) - (superset:size . 1)) -("77c960d09eb21ed72e434ddcde0bd3781a4f3d6ee7a6eb66" - (ctime . 1428958981.0) - (dc:contributor . "Led Zeppelin") -[...] - - - $ ugarit search ugarit.conf music '(or - (= ($ dc:creator) "Led Zeppelin") - (= ($ dc:contributor) "Led Zeppelin"))' alist-with-imports - -This outputs one s-expression per list per match, with four -elements. The first is the key string, the second is an alist of file -properties, the third is the import's hash, and the last is an alist -containing the import's properties. It looks like: - - -("64fa08a0080aee6ef501c408fd44dfcc634cfcafd8006fc4" - ((ctime . 1428958683.0) - (dc:contributor . "Led Zeppelin") - (dc:created . "2008") - (dc:creator . "Led Zeppelin") -[... elided file properties ...] - (superset:index . 1) - (superset:size . 1)) - "a43f7a7268ee8b18381c20d7573add5dbf8781f81377279c" - ((stats (blocks-stored . 2046) - (bytes-stored . 1815317503) -[... elided manifest properties ...] - (manifest-path . "test.manifest"))) -("4cd56f916a63399b252976e842dcae0b87f058b5a60c93a4" - ((ctime . 1428958437.0) - (dc:contributor . "Led Zeppelin") -[...] - - -And finally, you might just want to get the hashes of matching files -(which are particularly useful for extraction operations, which we'll -come to next). To do this, specify a format of "keys", which outputs -one line per match, containing just the hash: - - $ ugarit search ugarit.conf music '(or - (= ($ dc:creator) "Led Zeppelin") - (= ($ dc:contributor) "Led Zeppelin"))' keys - - -ce6f6484337de772de9313038cb25d1b16e28028136cc291 -6af5c664cbfa1acb22a377e97aee35d94c0fc003d239dd0c -92e91e79b384478b5aab31bf1b2ff9e25e7e2c4b48575185 -6ddb9a41d4968468a904f05ecf7e0e73d2c7c7ad76bc394b -a074dddcef67cd93d92c6ffce845894aa56594674023f6e1 -4f65f735bbb00a6fda4bc887b370b3160f55e5e07ec37ffa -97cc8b8ba70c39387fc08ef62311b751aea4340d636eb421 -72358dbe3eb60da42eadcf6de325b2a6686f4e17ea41fa60 -[...] - - -However, to write filter expressions, you need to know what properties -you have available to search on. You might remember, or go for -standard properties, or look at existing files in verbose mode to find -some; but you can also just ask Ugarit what properties it has in an -archive, like so: - - $ ugarit search-props - -You can even ask what properties are available for files matching an -existing filter: - - $ ugarit search-props - -This is useful if you're interested in further narrowing down a -filter, and so only care about properties that files already matching -that filter have. - -For a bunch of music files imported with the Ugarit Manifest Maker, -you can expect to see something like this: - - -ctime -dc:contributor -dc:created -dc:creator -dc:format -dc:publisher -dc:subject -dc:title -file-size -filename -import-path -mtime -set:index -set:size -set:title -superset:index -superset:size - - -Now you know what properties to search, next you'll be wanting to know -what values to look for. Again, Ugarit has a command to query the -available values of any given property: - - $ ugarit search-values - -And you can limit that just to files matching a given filter: - - $ ugarit search-values - -The resulting list of values is ordered by popularity, so the most -widely-used values will be listed first. Let's see what genres of -music were in my sample of music files I imported: - - $ ugarit search-values test.conf archive-tag dc:subject - -The result is: - - -Classic Rock -Alternative & Punk -Electronic -Trip-Hop - - -Ok, let's now use a filter to find out what artists -(dc:creator) I have that made Trip-Hop music (what even -IS that?): - - $ ugarit search-values test.conf archive-tag \ - '(= ($ dc:subject) "Trip-Hop")' \ - dc:creator - -The result is: - - -Portishead - - -Ah, OK, now I know what "Trip-Hop" is. - -

Extracting

- -All this searching is lovely, but what it gets us, in the end, is a -bunch of file hashes. Perhaps we might want to actually play some -music, or look at a photo, or something. To do that, we need to -extract from the archive. - -We've already seen the contents of an archive in the explore mode -virtual filesystem, so we could go into the archive history, find the -import, go into the manifest, pick the file out there, and use -get to extract it, but that would be yucky. Thankfully, -we have a command-line interface to get things from archives, in one -of two ways. - -Firstly, we can extract a file (or a directory tree) from an archive, -out into the local filesystem: - - $ ugarit archive-extract - -The "target" is the name to give it in the local filesystem. We could -pull out that Led Zeppelin song from our search results above, like so: - - $ ugarit archive-extract test.conf archive-tag \ - ce6f6484337de772de9313038cb25d1b16e28028136cc291 foo.mp3 - -We now have a foo.mp3 file in the current directory. - -However, sometimes it would be nicer to have it streamed to standard -output, which can be done like so: - - $ ugarit archive-stream - -This lets us write a command such as: - - $ ugarit archive-stream test.conf archive-tag \ - ce6f6484337de772de9313038cb25d1b16e28028136cc291 | mpg123 - - -...to play it in real time. - -

Storage administration

- -Each backend offers a number of administrative commands for -administering the storage underlying vaults. These are accessible via -the ugarit-storage-admin command line interface. - -To use it, run it with the following command: - - $ ugarit-storage-admin '' - -The available commands differ between backends, but all backends -support the info and help commands, which -give basic information about the vault, and list all available -commands, respectively. Some offer a stats command that -examines the vault state to give interesting statistics, but which may -be a time-consuming operation. - -

Administering splitlog storages

- -The splitlog backend offers a wide selection of administrative -commands. See the help command on a splitlog vault for -details. The following commands are available: - -
- -
help
-
List the available commands.
- -
info
-
List some basic information about the storage.
- -
stats
-
Examine the metadata to provide overall statistics about the -archive. This may be a time-consuming operation on large -storages.
- -
set-block-size! BYTES
-
Sets the block size to the given number of bytes. This will affect -new blocks written to the storage, and leave existing blocks -untouched, even if they are larger than the new block size.
- -
set-max-logfile-size! BYTES
-
Sets the size at which a log file is finished and a new one -started (likewise, existing log files will be untouched; this will -only affect new log files)
- -
set-commit-interval! UPDATES
-
Sets the frequency of automatic synching of the storage -state to disk. Lowering this harms performance when writing to the -storage, but decreases the number of in-progress block writes that -can fail in a crash.
- -
write-protect!
-
Disables updating of the storage.
- -
write-unprotect!
-
Re-enables updating of the storage.
- -
reindex!
-
Reindex the storage, rebuilding the block and tag state from the -contents of the log. If the metadata file is damaged or lost, -reindexing can rebuild it (although any configuration changes made -via other admin commands will need manually repeating as they are -not logged).
-
- -

Administering sqlite storages

- -The sqlite backend has a similar administrative interface to the -splitlog backend, except that it does not have log files, so lacks the -set-max-logfile-size! and reindex! commands. - -

Administering cache storages

- -The cache backend provides a minimalistic interface: - -
- -
help
-
List the available commands.
- -
info
-
List some basic information about the storage.
- -
stats
-
Report on how many entries are in the cache.
- -
clear!
-
Clears the cache, dropping all the entries in it.
- -
- -

.ugarit files

- -By default, Ugarit will vault everything it finds in the filesystem -tree you tell it to snapshot. However, this might not always be -desired; so we provide the facility to override this with .ugarit -files, or global rules in your .conf file. - -Note: The syntax of these files is provisional, as I want to -experiment with usability, as the current syntax is ugly. So please -don't be surprised if the format changes in incompatible ways in -subsequent versions! - -In quick summary, if you want to ignore all files or directories -matching a glob in the current directory and below, put the following -in a .ugarit file in that directory: - - (* (glob "*~") exclude) - -You can write quite complex expressions as well as just globs. The -full set of rules is: - - * (glob "pattern") matches files and directories whose names - match the glob pattern - - * (name "name") matches files and directories with exactly that - name (useful for files called *...) - - * (modified-within number seconds) matches files and - directories modified within the given number of seconds - - * (modified-within number minutes) matches files and - directories modified within the given number of minutes - - * (modified-within number hours) matches files and directories - modified within the given number of hours - - * (modified-within number days) matches files and directories - modified within the given number of days - - * (not rule) matches files and directories that do not match - the given rule - - * (and rule rule...) matches files and directories that match - all the given rules - - * (or rule rule...) matches files and directories that match - any of the given rules - -Also, you can override a previous exclusion with an explicit include -in a lower-level directory: - - (* (glob "*~") include) - -You can bind rules to specific directories, rather than to "this -directory and all beneath it", by specifying an absolute or relative -path instead of the `*`: - - ("/etc" (name "passwd") exclude) - -If you use a relative path, it's taken relative to the directory of -the .ugarit file. - -You can also put some rules in your .conf file, although relative -paths are illegal there, by adding lines of this form to the file: - - (rule * (glob "*~") exclude) - -

Questions and Answers

- -

What happens if a snapshot is interrupted?

- -Nothing! Whatever blocks have been uploaded will be uploaded, but the -snapshot is only added to the tag once the entire filesystem has been -snapshotted. So just start the snapshot again. Any files that have -already be uploaded will then not need to be uploaded again, so the -second snapshot should proceed quickly to the point where it failed -before, and continue from there. - -Unless the vault ends up with a partially-uploaded corrupted block -due to being interrupted during upload, you'll be fine. The filesystem -backend has been written to avoid this by writing the block to a file -with the wrong name, then renaming it to the correct name when it's -entirely uploaded. - -Actually, there is *one* caveat: blocks that were uploaded, but never -make it into a finished snapshot, will be marked as "referenced" but -there's no snapshot to delete to un-reference them, so they'll never -be removed when you delete snapshots. (Not that snapshot deletion is -implemented yet, mind). If this becomes a problem for people, we could -write a "garbage collect" tool that regenerates the reference counts -in a vault, leading to unused blocks (with a zero refcount) being -unlinked. - -

Should I share a single large vault between all my filesystems?

- -I think so. Using a single large vault means that blocks shared -between servers - eg, software installed from packages and that sort -of thing - will only ever need to be uploaded once, saving storage -space and upload bandwidth. However, do not share a vault between -servers that do not mutually trust each other, as they can all update -the same tags, so can meddle with each other's snapshots - and read -each other's snapshots. - -

CAVEAT

- -It's not currently safe to have multiple concurrent snapshots to the -same split log backend; this will soon be fixed, however. - -

Security model

- -I have designed and implemented Ugarit to be able to handle cases -where the actual vault storage is not entirely trusted. - -However, security involves tradeoffs, and Ugarit is configurable in -ways that affect its resistance to different kinds of attacks. Here I -will list different kinds of attack and explain how Ugarit can deal -with them, and how you need to configure it to gain that -protection. - -

Vault snoopers

- -This might be somebody who can intercept Ugarit's communication with -the vault at any point, or who can read the vault itself at their -leisure. - -Ugarit's splitlog backend creates files with "rw-------" permissions -out of the box to try and prevent this. This is a pain for people who -want to share vaults between UIDs, but we can add a configuration -option to override this if that becomes a problem. - -

Reading your data

- -If you enable encryption, then all the blocks sent to the vault are -encrypted using a secret key stored in your Ugarit configuration -file. As long as that configuration file is kept safe, and the AES -algorithm is secure, then attackers who can snoop the vault cannot -decode your data blocks. Enabling compression will also help, as the -blocks are compressed before encrypting, which is thought to make -cryptographic analysis harder. - -Recommendations: Use compression and encryption when there is a risk -of vault snooping. Keep your Ugarit configuration file safe using -UNIX file permissions (make it readable only by root), and maybe store -it on a removable device that's only plugged in when -required. Alternatively, use the "prompt" passphrase option, and be -prompted for a passphrase every time you run Ugarit, so it isn't -stored on disk anywhere. - -

Looking for known hashes

- -A block is identified by the hash of its content (before compression -and encryption). If an attacker was trying to find people who own a -particular file (perhaps a piece of subversive literature), they could -search Ugarit vaults for its hash. - -However, Ugarit has the option to "key" the hash with a "salt" stored -in the Ugarit configuration file. This means that the hashes used are -actually a hash of the block's contents *and* the salt you supply. If -you do this with a random salt that you keep secret, then attackers -can't check your vault for known content just by comparing the hashes. - -Recommendations: Provide a secret string to your hash function in your -Ugarit configuration file. Keep the Ugarit configuration file safe, as -per the advice in the previous point. - -

Vault modifiers

- -These folks can modify Ugarit's writes into the vault, its reads -back from the vault, or can modify the vault itself at their leisure. - -Modifying an encrypted block without knowing the encryption key can at -worst be a denial of service, corrupting the block in an unknown -way. An attacker who knows the encryption key could replace a block -with valid-seeming but incorrect content. In the worst case, this -could exploit a bug in the decompression engine, causing a crash or -even an exploit of the Ugarit process itself (thereby gaining the -powers of a process inspector, as documented below). We can but hope -that the decompression engine is robust. Exploits of the decryption -engine, or other parts of Ugarit, are less likely due to the nature of -the operations performed upon them. - -However, if a block is modified, then when Ugarit reads it back, the -hash will no longer match the hash Ugarit requested, which will be -detected and an error reported. The hash is checked after -decryption and decompression, so this check does not protect us -against exploits of the decompression engine. - -This protection is only afforded when the hash Ugarit asks for is not -tampered with. Most hashes are obtained from within other blocks, -which are therefore safe unless that block has been tampered with; the -nature of the hash tree conveys the trust in the hashes up to the -root. The root hashes are stored in the vault as "tags", which an -vault modifier could alter at will. Therefore, the tags cannot be -trusted if somebody might modify the vault. This is why Ugarit -prints out the snapshot hash and the root directory hash after -performing a snapshot, so you can record them securely outside of the -vault. - -The most likely threat posed by vault modifiers is that they could -simply corrupt or delete all of your vault, without needing to know -any encryption keys. - -Recommendations: Secure your vaults against modifiers, by whatever -means possible. If vault modifiers are still a potential threat, -write down a log of your root directory hashes from each snapshot, and keep -it safe. When extracting your backups, use the ls -ll command in the -interface to check the "contents" hash of your snapshots, and check -they match the root directory hash you expect. - -

Process inspectors

- -These folks can attach debuggers or similar tools to running -processes, such as Ugarit itself. - -Ugarit backend processes only see encrypted data, so people who can -attach to that process gain the powers of vault snoopers and -modifiers, and the same conditions apply. - -People who can attach to the Ugarit process itself, however, will see -the original unencrypted content of your filesystem, and will have -full access to the encryption keys and hashing keys stored in your -Ugarit configuration. When Ugarit is running with sufficient -permissions to restore backups, they will be able to intercept and -modify the data as it comes out, and probably gain total write access -to your entire filesystem in the process. - -Recommendations: Ensure that Ugarit does not run under the same user -ID as untrusted software. In many cases it will need to run as root in -order to gain unfettered access to read the filesystems it is backing -up, or to restore the ownership of files. However, when all the files -it backs up are world-readable, it could run as an untrusted user for -backups, and where file ownership is trivially reconstructible, it can -do restores as a limited user, too. - -

Attackers in the source filesystem

- -These folks create files that Ugarit will back up one day. By having -write access to your filesystem, they already have some level of -power, and standard Unix security practices such as storage quotas -should be used to control them. They may be people with logins on your -box, or more subtly, people who can cause servers to writes files; -somebody who sends an email to your mailserver will probably cause -that message to be written to queue files, as will people who can -upload files via any means. - -Such attackers might use up your available storage by creating large -files. This creates a problem in the actual filesystem, but that -problem can be fixed by deleting the files. If those files get -stored into Ugarit, then they are a part of that snapshot. If you -are using a backend that supports deletion, then (when I implement -snapshot deletion in the user interface) you could delete that entire -snapshot to recover the wasted space, but that is a rather serious -operation. - -More insidiously, such attackers might attempt to abuse a hash -collision in order to fool the vault. If they have a way of creating -a file that, for instance, has the same hash as your shadow password -file, then Ugarit will think that it already has that file when it -attempts to snapshot it, and store a reference to the existing -file. If that snapshot is restored, then they will receive a copy of -your shadow password file. Similarly, if they can predict a future -hash of your shadow password file, and create a shadow password file -of their own (perhaps one giving them a root account with a known -password) with that hash, they can then wait for the real shadow -password file to have that hash. If the system is later restored from -that snapshot, then their chosen content will appear in the shadow -password file. However, doing this requires a very fundamental break -of the hash function being used. - -Recommendations: Think carefully about who has write access to your -filesystems, directly or indirectly via a network service that stores -received data to disk. Enforce quotas where appropriate, and consider -not backing up "queue directories" where untrusted content might -appear; migrate incoming content that passes acceptance tests to an -area that is backed up. If necessary, the queue might be backed up to -a non-snapshotting system, such as rsyncing to another server, so that -any excessive files that appear in there are removed from the backup -in due course, while still affording protection. +

Documentation

+ + * [./docs/intro.wiki|Introduction to Ugarit] + * [./docs/installation.wiki|Installation and Configuration] + * [./docs/commands.wiki|Command reference] + * [./docs/storage-admin.wiki|Storage backend administration] + * [./docs/dot-ugarit.wiki|Fine-tuning snapshots with .ugarit files] + * [./docs/archive-schema.wiki|Archive metadata schema] + * [./docs/faq.wiki|Frequently Asked Questions] + * [./docs/security.wiki|Security guide]

Acknowledgements

The Ugarit implementation contained herein is the work of Alaric Snell-Pym and Christian Kellermann, with advice, ideas, encouragement @@ -1863,166 +110,8 @@ And I'd like to thank my wife for putting up with me spending several evenings and weekends and holiday days working on this thing...

Version history

- * 2.0: Archival mode [dae5e21ffc], and to support its integration - into Ugarit, implemented typed tags [08bf026f5a], displaying tag - types in the VFS [30054df0b6], refactoring the Ugarit internals - [5fa161239c], made the storage of logs in the vault better - [68bb75789f], made it possible to view logs from within the VFS - [4e3673e0fe], supported hidden tags [cf5ef4691c], recording - configuration information in the vault (and providing instant - notification if your vault hashing/encryption setup is incorrect, - thanks to a clever idea by Andy Bennett) [0500d282fc], rearranged - how local caching is handled [b5911d321a], and added support for - the history of a snapshot or archive tag to have arbitrary - branches and merges [a987e28fef], which (as a side-effect) - improved the performance of running "ls" in long snapshot - histories [fcf8bc942a]. Also added an sqlite backend - [8719dfb84f], which makes testing easier but is useful in its own - right as it's fully-featured and crash-safe, while storing the - vault in a single file; and improved the appearance of the - explore mode ls command, as the VFS layout has become more - complex with the new log/properties views and all the archive - mode stuff. + * 2015-06-12: [./docs/release-2.0.wiki|Version 2.0] - * 1.0.9: More humane display of sizes in explore's directory - listings, using low-level I/O to reduce CPU usage. Myriad small - bug fixes and some internal structural improvements. - - * 1.0.8: Bug fixes to work with the latest chicken master, and - increased unit test coverage to test stuff that wasn't working - due to chicken bugs. Looking good! - - * 1.0.7: Fixed bug with directory rules (errors arose when files - were skipped). I need to improve the test suite coverage of - high-level components to stop this happening! - - * 1.0.6: Fixed missing features from v1.0.5 due to a fluffed merge - (whoops), added tracking of directory sizes (files+bytes) in the - vault on snapshot and the use of this information to display - overall percentage completion when extracting. Directory sizes - can be seen in the explore interface when doing "ls -l" or "ls -ll". - - * 1.0.5: Changed the VFS layout slightly, making the existence of - snapshot objects explicit (when you go into a tag, then go into a - snapshot, you now need to go into "contents" to see the actual - file tree; the snapshot object itself now exists as a node in the - tree). Added traverse-vault-* functions to the core API, and tests - for same, and used traverse-vault-node to drive the cd and get - functions in the interactive explore mode (speeding them up in the - process!). Added "extract" command. Added a progress reporting - callback facility for snapshots and extractions, and used it to - provide progress reporting in the front-end, every 60 seconds or - so by default, not at all with -q, and every time something - happens with -v. Added tab completion in explore mode. - - * 1.0.4: Resurrected support for compression and encryption and SHA2 - hashes, which had been broken by the failure of the - autoload egg to continue to work as it used to. Tidying - up error and ^C handling somewhat. - - * 1.0.3: Installed sqlite busy handlers to retry when the database is - locked due to concurrent access (affects backend-fs, backend-cache, - and the file cache), and gained an EXCLUSIVE lock when locking a - tag in backend-fs; I'm not clear if it's necessary, but it can't - hurt. - - BUGFIX: Logging of messages from storage backends wasn't - happening correctly in the Ugarit core, leading to errors when the - cache backend (which logs an info message at close time) was closed - and the log message had nowhere to go. - - * 1.0.2: Made the file cache also commit periodically, rather than on - every write, in order to improve performance. Counting blocks and - bytes uploaded / reused, and file cache bytes as well as hits; - reporting same in snapshot UI and logging same to snapshot - metadata. Switched to the posix-extras egg and ditched our own - posixextras.scm wrappers. Used the parley egg in the ugarit - explore CLI for line editing. Added logging infrastructure, - recording of snapshot logs in the snapshot. Added recovery from - extraction errors. Listed lock state of tags in explore - mode. Backend protocol v2 introduced (retaining v1 for - compatability) allowing for an error on backend startup, and logging - nonfatal errors, warnings, and info on startup and all protocol - calls. Added ugarit-archive-admin command line interface to - backend-specific administrative interfaces. Configuration of the - splitlog backend (write protection, adjusting block size and logfile - size limit and commit interval) is now possible via the admin - interface. The admin interface also permits rebuilding the metadata - index of a splitlog vault with the reindex! admin command. - - BUGFIX: Made file cache check the file hashes it finds in the - cache actually exist in the vault, to protect against the case - where a crash of some kind has caused unflushed changes to be - lost; the file cache may well have committed changes that the - backend hasn't, leading to references to nonexistant blocks. Note - that we assume that vaults are sequentially safe, eg if the - final indirect block of a large file made it, all the partial - blocks must have made it too. - - BUGFIX: Added an explicit flush! command to the backend - protocol, and put explicit flushes at critical points in higher - layers (backend-cache, the vault abstraction in the Ugarit - core, and when tagging a snapshot) so that we ensure the blocks we - point at are flushed before committing references to them in the - backend-cache or file caches, or into tags, to ensure crash - safety. - - BUGFIX: Made the splitlog backend never exceed the file size limit - (except when passed blocks that, plus a header, are larger than - it), rather than letting a partial block hang over the 'end'. - - BUGFIX: Fixed tag locking, which was broken all over the - place. Concurrent snapshots to the same tag should now block for - one another, although why you'd want to *do* that is questionable. - - BUGFIX: Fixed generation of non-keyed hashes, which was - incorrectly appending the type to the hash without an outer - hash. This breaks backwards compatability, but nobody was using - the old algorithm, right? I'll introduce it as an option if - required. - - * 1.0.1: Consistency check on read blocks by default. Removed warning - about deletions from backend-cache; we need a new mechanism to - report warnings from backends to the user. Made backend-cache and - backend-fs/splitlog commit periodically rather than after every - insert, which should speed up snapshotting a lot, and reused the - prepared statements rather than re-preparing them all the - time. - - BUGFIX: splitlog backend now creates log files with - "rw-------" rather than "rwx------" permissions; and all sqlite - databases (splitlog metadata, cache file, and file-cache file) are - created with "rw-------" rather then "rw-r--r--". - - * 1.0: Migrated from gdbm to sqlite for metadata storage, removing the - GPL taint. Unit test suite. backend-cache made into a separate - backend binary. Removed backend-log. - - BUGFIX: file caching uses mtime *and* - size now, rather than just mtime. Error handling so we skip objects - that we cannot do something with, and proceed to try the rest of the - operation. - - * 0.8: decoupling backends from the core and into separate binaries, - accessed via standard input and output, so they can be run over SSH - tunnels and other such magic. - - * 0.7: file cache support, sorting of directories so they're archived - in canonical order, autoloading of hash/encryption/compression - modules so they're not required dependencies any more. - - * 0.6: .ugarit support. - - * 0.5: Keyed hashing so attackers can't tell what blocks you have, - markers in logs so the index can be reconstructed, sha2 support, and - passphrase support. - - * 0.4: AES encryption. - - * 0.3: Added splitlog backend, and fixed a .meta file typo. - - * 0.2: Initial public release. - - * 0.1: Internal development release. + * [./docs/release-old.wiki|Previous Releases] Index: RELEASE.wiki ================================================================== --- RELEASE.wiki +++ RELEASE.wiki @@ -1,18 +1,25 @@ +The tip of trunk is "what's live"; the documentation, ugarit.setup, +and ugarit.release-info from there is what gets served to the public +at the canonical URLs. + +Do not merge documentation changes onto the trunk until you're +releasing, or the live docs will be ahead of the available version! + How to do a release: * Merge desired changes onto the trunk * Update ugarit.setup to set the new version * Install and test to make sure you didn't break it! - * Update ugarit.release-info to refer to the new release * Commit, and tag the commit with the version number * Run ../kitten-technologies/bin/generate-download-page to update DOWNLOAD.wiki + * Update ugarit.release-info to refer to the new release * Commit again * Announce on Google Plus etc. See also: http://www.kitten-technologies.co.uk/project/kitten-technologies/doc/trunk/README.wiki In future, expand this with a way of tagging a pre-release beta in Fossil for fossil followers to try out, before we tag it for henrietta. Index: docs/archive-schema.wiki ================================================================== --- docs/archive-schema.wiki +++ docs/archive-schema.wiki @@ -1,73 +1,201 @@ +

Ugarit Archive Metadata Schema

+ Any symbol can be used as an archive metadata property name, but here are some -standard ones, defined for the sake of interoperability. - -

System-provided import properties

- -previous -contents -mtime -log -stats -hostname -manifest-path - -

System-provided object properties

- -import-path - full path to imported file -filename - filename and extension -dc:format - guessed MIME type - -

Object properties provided by the manifest maker

- -file-size -mtime -ctime -filename -dc:title - made from file name, or in-file metadata -dc:format - MIME type +suggested ones, defined for the sake of interoperability. + +Where possible, we have used the +[http://dublincore.org/documents/2001/04/12/usageguide/generic.shtml|Dublin +Core] vocabulary, as it's a good fit for the kinds of things archive +mode is designed for. Properties imported from Dublin Core are +identified with a dc: prefix. + +Some of these properties are automatically applied by the import +process. However, if these properties are specified in the import +manifest file, then the specified value from the manifest overrides +the default. + +

Import properties

+ +These are properties applied to an import object, rather than to an +individual object in an archive. + +

Internal

+ +These properties are all provided by the system itself, and must not +be specified in an import manifest. + +
+
previous (hash)
+
The hash of a previous import. If there is +no instance of this property, then this is the first import in a +sequence. If there are more than one instances, then this is a +merge.
+ +
contents (hash)
+
The hash of the imported archive manifest. This is probably not of +much interest beyond the Ugarit internals.
+ +
mtime (number)
+
The UNIX timestamp of the import.
+ +
log (hash)
+
The hash of the import log file.
+ +
stats (alist)
+
An alist of import statistics.
+ +
manifest-path
+
The path to the manifest filename that was used for the import.
+ +
hostname
+
The hostname on which the import was performed.
+ +
+ +

Core object properties

+ +These object properties apply usefully to almost anything in an archive. + +
import-path
+
The path the file was imported from, as taken from the import +manifest file. (DEFAULT: The path from the manifest file)
+ +
filename
+
The name of the file, including the extension (if applicable), but +not any directory path. This is usually the name the file had when it +was imported (eg, the latter part of import-path), but if +it was imported from some temporary file name while the system knows +of a "proper" filename other than that, they may differ. (DEFAULT: The +import path, minus any directory path)
+ +
dc:format
+
The MIME type of the file. (DEFAULT: A MIME type guessed from the +file extension)
+ +
file-size
+
The size of the file. If it's a directory, then this is the sum of +the sizes of the files within it, not including any directory +metadata.
+ +
mtime (number)
+
The mtime of the file when it was imported, as a UNIX +timestamp.
+ +
ctime
+
The ctime of the file when it was imported, as a UNIX +timestamp.
+ +
dc:title
+
The title of the object. This should be a proper human-readable +title, not just a filename, where possible.
+ +
dc:description
+
A longer description of the object.

Object properties for music

-dc:title -dc:creator -dc:contributor -dc:publisher -dc:created - date -dc:subject - genre -set:title - title of album -set:index - track number -set:size - track count -superset:index - disc number -superset:size - number of discs +Music files should put the song title in dc:title. + +
dc:creator
+
The creator of the piece, generally the artist name.
+ +
dc:contributor
+
Some other contributor to the piece, other than the artist.
+ +
dc:publisher
+
The name of the publisher.
+ +
dc:created
+
The creation date, in YYYY-MM-DD form.
+ +
dc:subject
+
The name of the genre.
+ +
set:title
+
The title of the album.
+ +
set:index
+
Track number within the album.
+ +
set:size
+
Track count within the album.
+ +
superset:index
+
For multi-disk albums, the disk number.
+ +
superset:size
+
For multi-disk albums, the number of disks.

Object properties for photographs

-dc:creator - photographer -dc:description -dc:subject - keyword, person/thing in photo -dc:spatial - place name, or lat/long/alt -dc:temporal - name of event featured -dc:created - timestamp +Use dc:description for a description of the photo. + +
dc:creator
+
The name of the photographer.
+ +
dc:subject
+
Something in the photograph (names of photographed people or +things, or more general keywords)
+ +
dc:spatial
+
The name of the place the photo was taken, or coordinates as a +[https://en.wikipedia.org/wiki/Geo_URI|geo: URL].
+ +
dc:temporal
+
The name of the event the photograph was from.
+ +
dc:created
+
The creation timestamp of the photo, in YYYY-MM-DD format, +optionally with a 24-hour UTC HH:MM:SS time.

Object properties for PDF/PS/ebooks

-dc:title -dc:creator -dc:subject -dc:description -dc:created -dc:publisher -dc:identifier - ISBN -dc:source - download from URL +Use dc:title for the title of the work. + +
dc:creator
+
The name of the author.
+ +
dc:subject
+
A subject or keyword.
+ +
dc:created
+
The creation date in YYYY-MM-DD format.
+ +
dc:publisher
+
The name of the publisher.
+ +
dc:identifier
+
An ISBN, ISSN, or similar identifier, in +[https://en.wikipedia.org/wiki/Uniform_resource_name|URN format] (eg: +urn:isbn:0451450523).
+ +
dc:source
+
The original URL the thing was downloaded from.

Other useful Dublin Core properties

-See -[http://dublincore.org/documents/2001/04/12/usageguide/generic.shtml] -for inspiration. +
dc:alternative
+
An alternative title.
+ +
dc:extent
+
Size, duration, etc. Not the size of the file in bytes, but the +duration of a recording, the size of an image in pixels, etc.
+ +
dc:language
+
The language of the object. en, en-GB, +jbo, etc.
+ +
dc:license
+
A description of the license the file is under.
+ +
dc:accessRights
+
A space-separted list of names of groups that should be allowed to +access the object, under some means of publishing all or part of an +archive. public should refer to unrestricted access.
+ +

Please contribute!

-dc:alternative - alternative name -dc:extent - size, duration, etc. -dc:language - "en", "jbo", etc -dc:license - licensing statement -dc:accessRights - "public" or "private" (the latter being the default) +The above are the conventions I have started to settle towards with +the kinds of things I am using Ugarit archives for. If you use it for +something else, please drop me a line and I'll be glad to help you +choose a good schema, and publish the results here for others to share! ADDED docs/commands.wiki Index: docs/commands.wiki ================================================================== --- docs/commands.wiki +++ docs/commands.wiki @@ -0,0 +1,726 @@ +

Ugarit command-line reference

+ +

Your first backup

+ +Think of a tag to identify the filesystem you're backing up. If it's +/home on the server gandalf, you might call it gandalf-home. If +it's the entire filesystem of the server bilbo, you might just call +it bilbo. + +Then from your shell, run (as root): + +
# ugarit snapshot  [-c] [-a]  
+ +For example, if we have a ugarit.conf in the current directory: + +
# ugarit snapshot ugarit.conf -c localhost-etc /etc
+ +Specify the -c flag if you want to store ctimes in the vault; +since it's impossible to restore ctimes when extracting from an +vault, doing this is useful only for informational purposes, so it's +not done by default. Similarly, atimes aren't stored in the vault +unless you specify -a, because otherwise, there will be a lot of +directory blocks uploaded on every snapshot, as the atime of every +file will have been changed by the previous snapshot - so with -a +specified, on every snapshot, every directory in your filesystem will +be uploaded! Ugarit will happily restore atimes if they are found in +a vault; their storage is made optional simply because uploading +them is costly and rarely useful. + +

Exploring the vault

+ +Now you have a backup, you can explore the contents of the +vault. This need not be done as root, as long as you can read +ugarit.conf; however, if you want to extract files, run it as root +so the uids and gids can be set. + +
$ ugarit explore ugarit.conf
+ +This will put you into an interactive shell exploring a virtual +filesystem. The root directory contains an entry for every tag; if you +type ls you should see your tag listed, and within that +tag, you'll find a list of snapshots, in descending date order, with a +special entry current for the most recent +snapshot. Within a snapshot, you'll find the root directory of your +snapshot under contents, and the detailts of the snapshot itself in +propreties.sexpr, and will be able to cd into +subdirectories, and so on: + +
> ls
+localhost-etc/ 
+> cd localhost-etc
+/localhost-etc> ls
+current/ 
+2015-06-12 22:49:34/ 
+2015-06-12 22:49:25/ 
+/localhost-etc> cd current
+/localhost-etc/current> ls
+log.sexpr 
+properties.sexpr 
+contents/ 
+/localhost-etc/current> cat properties.sexpr
+((previous . "a140e6dbe0a7a38f8b8c381323997c23e51a39e2593afb61")
+ (mtime . 1434102574.0)
+ (contents . "34eccf1f5141187e4209cfa354fdea749a0c3c1c4682ec86")
+ (stats (blocks-stored . 12)
+  (bytes-stored . 16889)
+  (blocks-skipped . 50)
+  (bytes-skipped . 6567341)
+  (file-cache-hits . 0)
+  (file-cache-bytes . 0))
+ (log . "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a")
+ (hostname . "ahe")
+ (source-path . "/etc")
+ (notes)
+ (files . 112)
+ (size . 6563588))
+/localhost-etc/current> cd contents
+/localhost-etc/current/contents> ls
+zoneinfo 
+vconsole.conf 
+udev/ 
+tmpfiles.d/ 
+systemd/ 
+sysctl.d/ 
+sudoers.tmp~ 
+sudoers 
+subuid 
+subgid 
+static 
+ssl/ 
+ssh/ 
+shells 
+shadow- 
+shadow 
+services 
+samba/ 
+rpc 
+resolvconf.conf 
+resolv.conf 
+-- Press q then enter to stop or enter for more...
+q
+/localhost-etc/current/contents> ls -ll resolv.conf
+-rw-r--r--     0     0 [2015-05-23 23:22:41] 78B/-: resolv.conf
+key: #f
+contents: "e33ea1394cd2a67fe6caab9af99f66a4a1cc50e8929d3550"
+size: 78
+ctime: 1432419761.0
+ +As well as exploring around, you can also extract files or directories +(or entire snapshots) by using the get command. Ugarit +will do its best to restore the metadata of files, subject to the +rights of the user you run it as. + +Type help to get help in the interactive shell. + +The interactive shell supports command-line editing, history and tab +completion for your convenience. + +

Extracting things directly

+ +As well as using the interactive explore mode, it is also possible to +directly extract something from the vault, given a path. + +Given the sample vault from the previous example, it would be possible +to extract the README.txt file with the following +command: + +
$ ugarit extract ugarit.conf /Test/current/contents/README.txt
+ +

Forking tags

+ +As mentioned above, you can fork a tag, creating two tags that +refer to the same snapshot and its history but that can then have +their own subsequent history of snapshots applied to each +independently, with the following command: + +
$ ugarit fork   
+ +

Merging tags

+ +And you can also merge two or more tags into one. It's possible to +merge a bunch of tags to make an entirely new tag, or you can merge a +tag into an existing tag, by having the "output" tag also be one of +the "input" tags. + +The command to do this is: + +
$ ugarit merge   
+ +For instance, to import your classical music collection into your main +musical collection, you might do: + +
$ ugarit merge ugarit.conf my-music my-music classical-music
+ +Or if you want to create a new all-music archive from the archives +bobs-music and petes-music, you might do: + +
$ ugarit merge ugarit.conf all-music bobs-music petes-music
+ +

Archive operations

+ +

Importing

+ +To import some files into an archive, you must create a manifest file +listing them, and their metadata. The manifest can also list +metadata for the import as a whole, perhaps naming the source of the +files, or the reason for importing them. + +The metadata for a file (or an import) is a series of named +properties. The value of a property can be any Scheme value, written +in Scheme syntax (with strings double-quoted unless they are to be +interpreted as symbols), but strings and numbers are the most useful +types. + +You can use whatever names you like for properties in metadata, but +there are some that the system applies automatically, and an informal +standard of sorts, which is documented in [docs/archive-schema.wiki]. + +You can produce a manifest file by hand, or use the Ugarit Manifest +Maker to produce one for you. You do this by installing it like so: + +
$ chicken-install ugarit-manifest-maker
+ +And then running it, giving it any number of file and directory names +on the command line. When given directories, it will recursively scan +them to find all the files contained therein and put them in the +manifest; it will not put directories in the manifest, although it is +perfectly legal for you to do so when writing a manifest by hand. This +is because the manifest maker can't do much useful analysis on a +directory to suggest default metadata for them (so there isn't much +point in using it), and it's far more useful for it to make it easy +for you to import a large number of files individually by referencing +the directory containing them. + +The manifest is sent to standard output, so you need to redirect it to +a file, like so: + +
$ ugarit-manifest-maker ~/music > music.manifest
+ +You can specify command-line options, as well. -e PATTERN +or --exclude=PATTERN introduces a glob pattern for files +to exclude from the manifest, and -D KEY=VALUE or +--define=KEY=VALUE provides a property to be added to +every file in the manifest (as opposed to an import property, that is +part of the metadata of the overall import). Note that +VALUE must be double-quoted if it's a string, as per +Scheme value syntax. + +One might use this like so: + +
$ ugarit-manifest-maker -e *.txt -D rating=5 ~/favourite-music > music.manifest
+ +The manifest maker simplifies the writing of manifests for files, by +listing the files in manifest format along with useful metadata +extracted from the filename and the file itself. For supported file +types (currently, MP3 and OGG music files), it will even look inside +the file to extract metadata. + +The manifest file it generates will contain lots of comments +mentioning things it couldn't automatically analyse (such as unknown +OGG/ID3 tags, or unknown types of files); and for metadata properties +it thinks might be relevant but can't automatically provide, it +suggests them with an empty property declaration, commented out. The +idea is that, after generating a manifest, you read it by hand in a +text editor to attempt to improve it. + +

The format of a manifest file

+ +Manifest files have a relatively simple format. The are based on +Scheme s-expressions, so can contain comments. From any semicolon (not +in a string or otherwise quoted) to the end of the line is a comment, +and #; in front of something comments out that something. + +Import metadata properties are specified like so: + +
(KEY = VALUE)
+ +...where, as usual, VALUE must be double-quoted if it's a +string. + +Files to import, with their metadata, are specified like so: + +
(object "PATH OF FILE TO IMPORT"
+  (KEY = VALUE)
+  (KEY = VALUE)...
+)
+ +The closing parenthesis need not be on a line of its own, it's +conventionally placed after the closing parenthesis of the final +property. + +Ugarit, when importing the files in the manifest, will add the +following properties if they are not already specified: + +
+
import-path
+
The path the file was imported from
+ +
dc:format
+
A guess at the file's MIME type, based on the extension
+ +
mtime
+
The file's modification time (as the number of seconds since the +UNIX epoch)
+ +
ctime
+
The file's change time (as the number of seconds since the UNIX +epoch)
+ +
filename
+
The name of the file, stripped of any directory components, and +including the extension.
+ +
+ +The following properties are placed in the import metadata, +automatically: + +
+
hostname
+
The hostname the import was performed on.
+ +
manifest-path
+
The path to the manifest file used for the import.
+ +
mtime
+
The time (in seconds since the UNIX epoch) at which the import was +committed.
+ +
stats
+
A Scheme alist of statistics about the import (number of +files/blocks uploaded, etc).
+
+ +So, to wrap that all up, here's a sample import manifest file: + + +(notes = "A bunch of old CDs I've finally ripped") + +(object "/home/alaric/newrip/track01.mp3" + (filename = "track01.mp3") + (dc:format = "audio/mpeg") + + (dc:publisher = "Go! Beat Records") + (dc:created = "1994") + (dc:contributor = "Portishead") + (dc:subject = "Trip-Hop") + (superset:size = 1) + (superset:index = 1) + (set:title = "Dummy") + (set:size = 11) + (set:index = 1) + (dc:creator = "Portishead") + (dc:title = "Wandering Star") + + (mtime = 1428962299.0) + (ctime = 1428962299.0) + (file-size = 4703055)) + +;;... and so on, for ten more MP3s on this CD, then several other CDs... + + +

Actually importing a manifest

+ +Well, when you finally have a manifest file, importing it is easy: + +
$ ugarit import   
+ +

How do I change the metadata of an already-imported file?

+ +That's easy; the "current" metadata of a file is the metadata of its +most recent. Just import the file again, in a new manifest, with new +metadata, and it will overwrite the old. However, the old metadata is +still preserved in the archive's history; tags forked from the archive +tag before the second import will still see the original state of the +archive, by design. + +

Exploring

+ +Archives are visible in the explore interface. For instance, an import +of some music I did looks like this: + +
> ls
+localhost-etc/ <tag>
+archive-tag/ <tag>
+> cd archive-tag
+/archive-tag> ls
+history/ <archive-history>
+/archive-tag> cd history
+/archive-tag/history> ls
+2015-06-12 22:53:13/ <import>
+/archive-tag/history> cd 2015-06-12 22:53:13
+/archive-tag/history/2015-06-12 22:53:13> ls
+log.sexpr <file>
+properties.sexpr <inline>
+manifest/ <import-manifest>
+/archive-tag/history/2015-06-12 22:53:13> cat properties.sexpr
+((stats (blocks-stored . 2046)
+        (bytes-stored . 1815317503)
+        (blocks-skipped . 9)
+        (bytes-skipped . 8388608)
+        (file-cache-hits . 0)
+        (file-cache-bytes . 0))
+ (log . "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a")
+ (mtime . 1434135993.0)
+ (contents . "fcdd5b996914fdcac1e8a6cfbc67663e08f6eaf0cc952e21")
+ (hostname . "ahe")
+ (notes . "A bunch of music, imported as a demo")
+ (manifest-path . "/home/alaric/tmp/test.manifest"))
+/archive-tag/history/2015-06-12 22:53:13> cd manifest
+/archive-tag/history/2015-06-12 22:53:13/manifest> ls
+1d4269099189234eefeb80b95370eaf280730cf4d591004d:03 The Lemon Song.mp3 <file>
+7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3 <file>
+64092fa12c2800dda474b41e5ebe8c948f39a59ee91c120b:09 How Many More Times.mp3 <file>
+1d79148d1e1e8947c50b44cf2d5690588787af328e82eeef:2-07 Going to California.mp3 <file>
+e3685148d0d12213074a9fdb94a00e05282aeabe77fa60d5:1-01 You Shook Me.mp3 <file>
+d73904f371af8d7ca2af1076881230f2dc1c2cf82416880a:03 Strangers.mp3 <file>
+9c5a0efb7d397180a1e8d42356d8f04c6c26a83d3b05d34a:09 Uptight.mp3 <file>
+01a069aec2e731e18fcdd4ecb0e424f346a2f0e16910f5e9:07 Numb.mp3 <file>
+7ea1ab7fbd525c40e21d6dd25130e8c70289ad56c09375b0:08 She.mp3 <file>
+009dacd8f3185b7caeb47050002e584ab86d08cf9e9aceec:1-03 Communication Breakdown.mp3 <file>
+26d264d629e22709f664ed891741f690900d45cd4fd44326:1-03 Dazed and Confused.mp3 <file>
+d879761195faf08e4e95a5a2398ea6eefb79920710bfeab6:1-10 Band Introduction _ How Many More Times.mp3 <file>
+83244601db42677d110fc8522c6a3cbbc1f22966a779f876:06 All My Love.mp3 <file>
+5eebee9a2ad79d04e4f69e9e2a92c4e0a8d5f21e670f89da:07 Tangerine.mp3 <file>
+dd6f1203b5973ecd00d2c0cee18087030490230727591746:2-08 That's the Way.mp3 <file>
+c0acea15aa27a6dd1bcaff1c13d4f3d741a40a46abeca3fc:04 The Crunge.mp3 <file>
+ea7727ad07c6c82e5c9c7218ee1b059cd78264c131c1438d:1-02 I Can't Quit You Baby.mp3 <file>
+10fda5f46b8f505ca965bcaf12252eedf5ab44514236f892:14 F.O.D..mp3 <file>
+a99ca9af5a83bde1c676c388dc273051defa88756df26e95:1-03 Good Times Bad Times.mp3 <file>
+b5d7cfe9808c7fc0dedbd656d44e4c56159cbd3c2ed963bb:1-15 Stairway to Heaven.mp3 <file>
+79c87e3c49ffdac175c95aae071f63d3a9efdf2ddb84998c:08.Batmilk.ogg <file>
+-- Press q then enter to stop or enter for more...
+q
+/archive-tag/history/2015-06-12 22:53:13/manifest> ls -ll 7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3
+-r--------     -     - [2015-04-13 21:46:39] -/-: 7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3
+key: #f
+contents: "7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382"
+import-path: "/home/alaric/archive/sorted-music/Led Zeppelin/Led Zeppelin/04 Dazed and Confused.mp3"
+filename: "04 Dazed and Confused.mp3"
+dc:format: "audio/mpeg"
+dc:publisher: "Atlantic"
+dc:subject: "Classic Rock"
+dc:title: "Dazed and Confused"
+dc:creator: "Led Zeppelin"
+dc:created: "1982"
+dc:contributor: "Led Zeppelin"
+set:title: "Led Zeppelin"
+set:index: 4
+set:size: 9
+superset:index: 1
+superset:size: 1
+ctime: 1428957999.0
+file-size: 15448903
+
+ +

Searching

+ +However, the explore interface to an archive is far from pleasant. You +need to go to the correct import, and find your file by name, and then +identify it with a big long name composed of its hash and the original +filename to find its properties and extract. + +I hope to add property-based searching to explore mode in future +(which is why you need to go into a history directory +within the archive directory, as other ways of exploring the archive +will appear alongside). This will be particularly useful when the +explore-mode virtual filesystem is mounted over 9P! + +However, even that interface, being constrained to look like a +filesystem, will be limited. The ugarit command-line tool +provides a very powerful search interface that exposes the full power +of the archive metadata. + +

Metadata filters

+ +Files (and directories) in an archive can be searched for using +"metadata filters", which are descriptions of what you're looking for +that the computer can understand. They are represented as Scheme +s-expressions, and can be made up of the following components: + +
+
#t
+
This filter matches everything. It's not very useful.
+ +
#f
+
This filter matches nothing. It's not very useful.
+ +
(and FILTER FILTER...)
+
This filter matches files for which all of the inner filters match.
+ +
(or FILTER FILTER...)
+
This filter matches files for which any of the inner filters match.
+ +
(not FILTER)
+
This filter matches files which do not match the inner filter.
+ +
(= ($ PROP) VALUE)
+
This filter matches files which have the given +PROPerty equal to that VALUE in their metadata.
+ +
(= key HASH)
+
This filter matches the file with the given hash.
+ +
(= ($import PROP) VALUE)
+
This filter matches files which have the given +PROPerty equal to that VALUE in the metadata +of the import that last imported them.
+
+ +

Searching an archive

+ +For a start, you can search for files matching a given metadata filter +in a given archive. This is done with: + +
$ ugarit search   
+ +For instance, let's look for music by Led Zeppelin: + +
$ ugarit search ugarit.conf music '(or
+   (= ($ dc:creator) "Led Zeppelin")
+   (= ($ dc:contributor) "Led Zeppelin"))'
+ +The result looks like the explore-mode view of an archive manifest, +listing the file's hash followed by its title and extension: + + +7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3 +834a1619a59835e0c27b22801e3c829b40be583dadd19770:2-08 No Quarter.mp3 +9e8bc4954838bd9c671f275eb48595089257185750d63894:1-12 I Can't Quit You Baby.mp3 +6742b3bebcdd9cae5ec5403c585935403fa74d16ed076cf2:02 Friends (1).mp3 +07d161f4bd684e283f7f2cf26e0b732157a8e95ef66939c3:05 Carouselambra.mp3 +[...] + + +What of all our lovely metadata? You can view that if you add the word +"verbose" to the end of the command line, which allows you to specify +alternate output formats: + +
$ ugarit search ugarit.conf music '(or
+   (= ($ dc:creator) "Led Zeppelin")
+   (= ($ dc:contributor) "Led Zeppelin"))' verbose
+ +Now the output looks like: + + +object a444ff6ef807b080b536155f58d246d633cab4a0eabef5bf + (ctime = 1428958660.0) + (dc:contributor = "Led Zeppelin") + (dc:created = "2008") + (dc:creator = "Led Zeppelin") +[... all the usual file properties omitted ...] + import a43f7a7268ee8b18381c20d7573add5dbf8781f81377279c + (stats = ((blocks-stored . 2046) (bytes-stored . 1815317503) (blocks-skipped . 9) (bytes-skipped . 8388608) (file-cache-hits . 0) (file-cache-bytes . 0))) + (log = "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a") +[... all the usual import properties omitted ...] +object b4cadf48b2c07ccf0303fc4064b292cb222980b0d4223641 + (ctime = 1428958673.0) + (dc:contributor = "Led Zeppelin") + (dc:created = "2008") + (dc:creator = "Led Zeppelin") + (dc:creator = "Jimmy Page/John Paul Jones/Robert Plant") +[...and so on...] + + +As you can see, it lists the hash of each file, its metadata, the hash +of the import that last imported it, and the metadata of that import. + +That's quite verbose, so you'd probably be wanting to take that as +input to another program to do something nicer with it. But it's laid +out for human reading, not for machine parsing. Thankfully, we have +other formats for that, alist and +alist-with-imports. + +Try this: + +
$ ugarit search ugarit.conf music '(or
+   (= ($ dc:creator) "Led Zeppelin")
+   (= ($ dc:contributor) "Led Zeppelin"))' alist
+ +This outputs one Scheme s-expression list per match, the first element +of which is the hash as a string, the rest of which is an alist of properties: + + +("7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382" + (ctime . 1428957999.0) + (dc:contributor . "Led Zeppelin") + (dc:created . "1982") + (dc:creator . "Led Zeppelin") +[... elided file properties ...] + (superset:index . 1) + (superset:size . 1)) +("77c960d09eb21ed72e434ddcde0bd3781a4f3d6ee7a6eb66" + (ctime . 1428958981.0) + (dc:contributor . "Led Zeppelin") +[...] + + +
$ ugarit search ugarit.conf music '(or
+   (= ($ dc:creator) "Led Zeppelin")
+   (= ($ dc:contributor) "Led Zeppelin"))' alist-with-imports
+ +This outputs one s-expression per list per match, with four +elements. The first is the key string, the second is an alist of file +properties, the third is the import's hash, and the last is an alist +containing the import's properties. It looks like: + + +("64fa08a0080aee6ef501c408fd44dfcc634cfcafd8006fc4" + ((ctime . 1428958683.0) + (dc:contributor . "Led Zeppelin") + (dc:created . "2008") + (dc:creator . "Led Zeppelin") +[... elided file properties ...] + (superset:index . 1) + (superset:size . 1)) + "a43f7a7268ee8b18381c20d7573add5dbf8781f81377279c" + ((stats (blocks-stored . 2046) + (bytes-stored . 1815317503) +[... elided manifest properties ...] + (manifest-path . "test.manifest"))) +("4cd56f916a63399b252976e842dcae0b87f058b5a60c93a4" + ((ctime . 1428958437.0) + (dc:contributor . "Led Zeppelin") +[...] + + +And finally, you might just want to get the hashes of matching files +(which are particularly useful for extraction operations, which we'll +come to next). To do this, specify a format of "keys", which outputs +one line per match, containing just the hash: + +
$ ugarit search ugarit.conf music '(or
+   (= ($ dc:creator) "Led Zeppelin")
+   (= ($ dc:contributor) "Led Zeppelin"))' keys
+ + +ce6f6484337de772de9313038cb25d1b16e28028136cc291 +6af5c664cbfa1acb22a377e97aee35d94c0fc003d239dd0c +92e91e79b384478b5aab31bf1b2ff9e25e7e2c4b48575185 +6ddb9a41d4968468a904f05ecf7e0e73d2c7c7ad76bc394b +a074dddcef67cd93d92c6ffce845894aa56594674023f6e1 +4f65f735bbb00a6fda4bc887b370b3160f55e5e07ec37ffa +97cc8b8ba70c39387fc08ef62311b751aea4340d636eb421 +72358dbe3eb60da42eadcf6de325b2a6686f4e17ea41fa60 +[...] + + +However, to write filter expressions, you need to know what properties +you have available to search on. You might remember, or go for +standard properties, or look at existing files in verbose mode to find +some; but you can also just ask Ugarit what properties it has in an +archive, like so: + +
$ ugarit search-props  
+ +You can even ask what properties are available for files matching an +existing filter: + +
$ ugarit search-props   
+ +This is useful if you're interested in further narrowing down a +filter, and so only care about properties that files already matching +that filter have. + +For a bunch of music files imported with the Ugarit Manifest Maker, +you can expect to see something like this: + + +ctime +dc:contributor +dc:created +dc:creator +dc:format +dc:publisher +dc:subject +dc:title +file-size +filename +import-path +mtime +set:index +set:size +set:title +superset:index +superset:size + + +Now you know what properties to search, next you'll be wanting to know +what values to look for. Again, Ugarit has a command to query the +available values of any given property: + +
$ ugarit search-values   
+ +And you can limit that just to files matching a given filter: + +
$ ugarit search-values    
+ +The resulting list of values is ordered by popularity, so the most +widely-used values will be listed first. Let's see what genres of +music were in my sample of music files I imported: + +
$ ugarit search-values test.conf archive-tag dc:subject
+ +The result is: + + +Classic Rock +Alternative & Punk +Electronic +Trip-Hop + + +Ok, let's now use a filter to find out what artists +(dc:creator) I have that made Trip-Hop music (what even +IS that?): + +
$ ugarit search-values test.conf archive-tag \
+    '(= ($ dc:subject) "Trip-Hop")' \
+    dc:creator
+ +The result is: + +Portishead + +Ah, OK, now I know what "Trip-Hop" is. + +

Extracting

+ +All this searching is lovely, but what it gets us, in the end, is a +bunch of file hashes. Perhaps we might want to actually play some +music, or look at a photo, or something. To do that, we need to +extract from the archive. + +We've already seen the contents of an archive in the explore mode +virtual filesystem, so we could go into the archive history, find the +import, go into the manifest, pick the file out there, and use +get to extract it, but that would be yucky. Thankfully, +we have a command-line interface to get things from archives, in one +of two ways. + +Firstly, we can extract a file (or a directory tree) from an archive, +out into the local filesystem: + +
$ ugarit archive-extract    
+ +The "target" is the name to give it in the local filesystem. We could +pull out that Led Zeppelin song from our search results above, like so: + +
$ ugarit archive-extract test.conf archive-tag \
+    ce6f6484337de772de9313038cb25d1b16e28028136cc291 foo.mp3
+ +We now have a foo.mp3 file in the current directory. + +However, sometimes it would be nicer to have it streamed to standard +output, which can be done like so: + +
$ ugarit archive-stream   
+ +This lets us write a command such as: + +
$ ugarit archive-stream test.conf archive-tag \
+    ce6f6484337de772de9313038cb25d1b16e28028136cc291 | mpg123 -
+ +...to play it in real time. + ADDED docs/dot-ugarit.wiki Index: docs/dot-ugarit.wiki ================================================================== --- docs/dot-ugarit.wiki +++ docs/dot-ugarit.wiki @@ -0,0 +1,70 @@ +

.ugarit files

+ +By default, Ugarit will vault everything it finds in the filesystem +tree you tell it to snapshot. However, this might not always be +desired; so we provide the facility to override this with .ugarit +files, or global rules in your .conf file. + +Note: All of this only applies to snapshots. Archive mode imports are +not affected by .ugarit files, or global rules. + +Note: The syntax of these files is provisional, as I want to +experiment with usability, as the current syntax is ugly. So please +don't be surprised if the format changes in incompatible ways in +subsequent versions! + +In quick summary, if you want to ignore all files or directories +matching a glob in the current directory and below, put the following +in a .ugarit file in that directory: + +
(* (glob "*~") exclude)
+ +You can write quite complex expressions as well as just globs. The +full set of rules is: + + * (glob "pattern") matches files and directories whose names + match the glob pattern + + * (name "name") matches files and directories with exactly that + name (useful for files called *...) + + * (modified-within number seconds) matches files and + directories modified within the given number of seconds + + * (modified-within number minutes) matches files and + directories modified within the given number of minutes + + * (modified-within number hours) matches files and directories + modified within the given number of hours + + * (modified-within number days) matches files and directories + modified within the given number of days + + * (not rule) matches files and directories that do not match + the given rule + + * (and rule rule...) matches files and directories that match + all the given rules + + * (or rule rule...) matches files and directories that match + any of the given rules + +Also, you can override a previous exclusion with an explicit include +in a lower-level directory: + +
(* (glob "*~") include)
+ +You can bind rules to specific directories, rather than to "this +directory and all beneath it", by specifying an absolute or relative +path instead of the `*`: + +
("/etc" (name "passwd") exclude)
+ +If you use a relative path, it's taken relative to the directory of +the .ugarit file. + +You can also put some rules in your .conf file, although relative +paths are illegal there, by adding lines of this form to the file: + +
(rule * (glob "*~") exclude)
+ ADDED docs/faq.wiki Index: docs/faq.wiki ================================================================== --- docs/faq.wiki +++ docs/faq.wiki @@ -0,0 +1,40 @@ +

Questions and Answers

+ +

What happens if a snapshot is interrupted?

+ +Nothing! Whatever blocks have been uploaded will be uploaded, but the +snapshot is only added to the tag once the entire filesystem has been +snapshotted. So just start the snapshot again. Any files that have +already be uploaded will then not need to be uploaded again, so the +second snapshot should proceed quickly to the point where it failed +before, and continue from there. + +Unless the vault ends up with a partially-uploaded corrupted block +due to being interrupted during upload, you'll be fine. The filesystem +backend has been written to avoid this by writing the block to a file +with the wrong name, then renaming it to the correct name when it's +entirely uploaded. + +Actually, there is *one* caveat: blocks that were uploaded, but never +make it into a finished snapshot, will be marked as "referenced" but +there's no snapshot to delete to un-reference them, so they'll never +be removed when you delete snapshots. (Not that snapshot deletion is +implemented yet, mind). If this becomes a problem for people, we could +write a "garbage collect" tool that regenerates the reference counts +in a vault, leading to unused blocks (with a zero refcount) being +unlinked. + +

Should I share a single large vault between all my filesystems?

+ +I think so. Using a single large vault means that blocks shared +between servers - eg, software installed from packages and that sort +of thing - will only ever need to be uploaded once, saving storage +space and upload bandwidth. However, do not share a vault between +servers that do not mutually trust each other, as they can all update +the same tags, so can meddle with each other's snapshots - and read +each other's snapshots. + +

CAVEAT

+ +It's not currently safe to have multiple concurrent snapshots to the +same split log backend; this will soon be fixed, however. ADDED docs/installation.wiki Index: docs/installation.wiki ================================================================== --- docs/installation.wiki +++ docs/installation.wiki @@ -0,0 +1,342 @@ +

Installation

+ +Install [http://www.call-with-current-continuation.org/|Chicken Scheme] using their [http://wiki.call-cc.org/man/4/Getting%20started|installation instructions]. + +Ugarit can then be installed by typing (as root): + + chicken-install ugarit + +See the [http://wiki.call-cc.org/manual/Extensions#chicken-install-reference|chicken-install manual] for details if you have any trouble, or wish to install into your home directory. + +

Setting up a vault

+ +Firstly, you need to know the vault identifier for the place you'll +be storing your vaults. This depends on your backend. The vault +identifier is actually the command line used to invoke the backend for +a particular vault; communication with the vault is via standard +input and output, which is how it's easy to tunnel via ssh. + +

Local filesystem backends

+ +These backends use the local filesystem to store the vaults. Of +course, the "local filesystem" on a given server might be an NFS mount +or mounted from a storage-area network. + +

Logfile backend

+ +The logfile backend works much like the original Venti system. It's +append-only - you won't be able to delete old snapshots from a logfile +vault, even when I implement deletion. It stores the vault in two +sets of files; one is a log of data blocks, split at a specified +maximum size, and the other is the metadata: an sqlite database used +to track the location of blocks in the log files, the contents of +tags, and a count of the logs so a filename can be chosen for a new one. + +To set up a new logfile vault, just choose where to put the two +parts. It would be nice to put the metadata file on a different +physical disk to the logs directory, to reduce seeking. If you only +have one disk, you can put the metadata file in the log directory +("metadata" is a good name). + +You can then refer to it using the following vault identifier: + + "backend-fs splitlog ...log directory... ...metadata file..." + +

SQLite backend

+ +The sqlite backend works a bit like a +[http://www.fossil-scm.org/|Fossil] repository; the storage is +implemented as a single file, which is actually an SQLite database +containing blocks as blobs, along with tags and configuration data in +their own tables. + +It supports unlinking objects, and the use of a single file to store +everything is convenient; but storing everything in a single file with +random access is slightly riskier than the simple structure of an +append-only log file; it is less tolerant of corruption, which can +easily render the entire storage unusable. Also, that one file can get +very large. + +SQLite has internal limits on the size of a database, but they're +quite large - you'll probably hit a size limit at about 140 +terabytes. + +To set up an SQLite storage, just choose a place to put the file. I +usually use an extension of .vault; note that SQLite will +create additional temporary files alongside it with additional +extensions, too. + +Then refer to it with the following vault identifier: + + "backend-sqlite ...path to vault file..." + +

Filesystem backend

+ +The filesystem backend creates vaults by storing each block or tag +in its own file, in a directory. To keep the objects-per-directory +count down, it'll split the files into subdirectories. Because of +this, it uses a stupendous number of inodes (more than the filesystem +being backed up). Only use it if you don't mind that; splitlog is much +more efficient. + +To set up a new filesystem-backend vault, just create an empty +directory that Ugarit will have write access to when it runs. It will +probably run as root in order to be able to access the contents of +files that aren't world-readable (although that's up to you), so +unless you access your storage via ssh or sudo to use another user to +run the backend under, be careful of NFS mounts that have +maproot=nobody set! + +You can then refer to it using the following vault identifier: + + "backend-fs fs ...path to directory..." + +

Proxying backends

+ +These backends wrap another vault identifier which the actual +storage task is delegated to, but add some value along the way. + +

SSH tunnelling

+ +It's easy to access a vault stored on a remote server. The caveat +is that the backend then needs to be installed on the remote server! +Since vaults are accessed by running the supplied command, and then +talking to them via stdin and stdout, the vault identified needs +only be: + + "ssh ...hostname... '...remote vault identifier...'" + +

Cache backend

+ +The cache backend is used to cache a list of what blocks exist in the +proxied backend, so that it can answer queries as to the existance of +a block rapidly, even when the proxied backend is on the end of a +high-latency link (eg, the Internet). This should speed up snapshots, +as existing files are identified by asking the backend if the vault +already has them. + +The cache backend works by storing the cache in a local sqlite +file. Given a place for it to store that file, usage is simple: + + "backend-cache ...path to cachefile... '...proxied vault identifier...'" + +The cache file will be automatically created if it doesn't already +exist, so make sure there's write access to the containing directory. + + - WARNING - WARNING - WARNING - WARNING - WARNING - WARNING - + +If you use a cache on a vault shared between servers, make sure +that you either: + + * Never delete things from the vault + +or + + * Make sure all access to the vault is via the same cache + +If a block is deleted from a vault, and a cache on that vault is +not aware of the deletion (as it did not go "through" the caching +proxy), then the cache will record that the block exists in the +vault when it does not. This will mean that if a snapshot is made +through the cache that would use that block, then it will be assumed +that the block already exists in the vault when it does +not. Therefore, the block will not be uploaded, and a dangling +reference will result! + +Some setups which *are* safe: + + * A single server using a vault via a cache, not sharing it with + anyone else. + + * A pool of servers using a vault via the same cache. + + * A pool of servers using a vault via one or more caches, and + maybe some not via the cache, where nothing is ever deleted from + the vault. + + * A pool of servers using a vault via one cache, and maybe some + not via the cache, where deletions are only performed on servers + using the cache, so the cache is always aware. + +

Writing a ugarit.conf

+ +ugarit.conf should look something like this: + +(storage ) +(hash tiger "") +[double-check] +[(compression [deflate|lzma])] +[(encryption aes )] +[(cache "")|(file-cache "")] +[(rule ...)] + +

Hashing

+ +The hash line chooses a hash algorithm. Currently Tiger-192 +(tiger), SHA-256 (sha256), SHA-384 +(sha384) and SHA-512 (sha512) are supported; +if you omit the line then Tiger will still be used, but it will be a +simple hash of the block with the block type appended, which reveals +to attackers what blocks you have (as the hash is of the unencrypted +block, and the hash is not encrypted). This is useful for development +and testing or for use with trusted vaults, but not advised for use +with vaults that attackers may snoop at. Providing a salt string +produces a hash function that hashes the block, the type of block, and +the salt string, producing hashes that attackers who can snoop the +vault cannot use to find known blocks (see the "Security model" +section below for more details). + +I would recommend that you create a salt string from a secure entropy +source, such as: + +
dd if=/dev/random bs=1 count=64 | base64 -w 0
+ +Whichever hash function you use, you will need to install the required +Chicken egg with one of the following commands: + +
chicken-install -s tiger-hash  # for tiger
+chicken-install -s sha2        # for the SHA hashes
+ +

Compression

+ +lzma is the recommended compression option for +low-bandwidth backends or when space is tight, but it's very slow to +compress; deflate or no compression at all are better for fast local +vaults. To have no compression at all, just remove the +(compression ...) line entirely. Likewise, to use +compression, you need to install a Chicken egg: + +
chicken-install -s z3       # for deflate
+chicken-install -s lzma     # for lzma
+ +WARNING: The lzma egg is currently rather difficult to install, and +needs rewriting to fix this problem. + +

Encryption

+ +Likewise, the (encryption ...) line may be omitted to have no +encryption; the only currently supported algorithm is aes (in CBC +mode) with a key given in hex, as a passphrase (hashed to get a key), +or a passphrase read from the terminal on every run. The key may be +16, 24, or 32 bytes for 128-bit, 192-bit or 256-bit AES. To specify a +hex key, just supply it as a string, like so: + +
(encryption aes "00112233445566778899AABBCCDDEEFF")
+ +...for 128-bit AES, + +
(encryption aes "00112233445566778899AABBCCDDEEFF0011223344556677")
+ +...for 192-bit AES, or + +
(encryption aes "00112233445566778899AABBCCDDEEFF00112233445566778899AABBCCDDEEFF")
+ +...for 256-bit AES. + +Alternatively, you can provide a passphrase, and specify how large a +key you want it turned into, like so: + +
(encryption aes ([16|24|32] "We three kings of Orient are, one in a taxi one in a car, one on a scooter honking his hooter and smoking a fat cigar. Oh, star of wonder, star of light; star with royal dynamite"))
+ +I would recommend that you generate a long passphrase from a secure +entropy source, such as: + +
dd if=/dev/random bs=1 count=64 | base64 -w 0
+ +Finally, the extra-paranoid can request that Ugarit prompt for a +passphrase on every run and hash it into a key of the specified +length, like so: + +
(encryption aes ([16|24|32] prompt))
+ +(note the lack of quotes around prompt, distinguishing it from a passphrase) + +Please read the [./security.wiki|Security model] documentationfor +details on the implications of different encryption setups. + +Again, as it is an optional feature, to use encryption, you must +install the appropriate Chicken egg: + +
chicken-install -s aes
+ +

Caching

+ +Ugarit can use a local cache to speed up various operations. If a path +to a file is provided through the cache or +file-cache directives, then a file will be created at +that location and used as a cache. If not, then a default path of +~/.ugarit-cache will be used instead. + +WARNING: If you use multiple different vaults from the same UNIX +account, and the same tag names are used in those different vaults, +and you use the default cache path (or explicitly specify cache paths +that point to the same file), you will get a somewhat confused +cache. The effects of this will be annoying (searches finding things +that then can't be fetched) rather than damaging, but it's still best +avoided! + +The cache is used to cache snapshot records and archive import +records. This is used by operations that extract snapshot history and +archive objects; snapshots are stored in a linked list of snapshot +objects, each referring to the previous snapshot. Therefore, reading +the history of a snapshot tag requires reading many objects from the +storage, which can be time-consuming for a remote storage! Similarly, +archives are represented as a linked list of imports, and searching +for an object in the archive can involve traversing the chain of +imports until a match is found (and then searching on until the end to +see if any further matches can be found!). The cache is even more +important for archive imports, as it not only keeps a local copy of +all the import information, it also records the "current" metadata of +every object in the archive (so that we don't need to search through +superceded previous versions of the metadata of an object when looking +for something), and uses B-tree indexes to enable fast searching of +the cached metadata. + +If you configure the cache path with file-cache rather +than just cache, then as well as the snapshot/archive +metadata caching, you will also enable file hash caching. + +This significantly speeds up subsequent snapshots of a filesystem +tree. The file cache maps filenames to (mtime,size,hash) tuples; as it +scans the filesystem, if it finds a file in the cache and the mtime +and size have not changed, it will assume it is already stored under +the specified hash. This saves it from having to read the entire file +to hash it and then check if the hash is present in the vault. In +other words, if only a few files have changed since the last snapshot, +then snapshotting a directory tree becomes an O(N) operation, where N +is the number of files, rather than an O(M) operation, where M is the +total size of files involved. + +WARNING: If you use a file cache, and a file is cached in it but then +subsequently deleted from the vault, Ugarit will fail to re-upload it +at the next snapshot. If you are using a file cache and you go +deleting things from your vault (should that be implemented in +future), you'll want to flush the cache afterwards. We might implement +automatic removal of deleted files from the local cache, but file +caches on other Ugarit installations that use the same vault will not +be aware of the deletion. + +

Other options

+ +double-check, if present, causes Ugarit to perform extra +internal consistency checks during backups, which will detect bugs but +may slow things down. + +

Example

+ +For example: + +
(storage "ssh ugarit@spiderman 'backend-fs splitlog /mnt/ugarit-data /mnt/ugarit-metadata/metadata'")
+(hash tiger "i3HO7JeLCSa6Wa55uqTRqp4jppUYbXoxme7YpcHPnuoA+11ez9iOIA6B6eBIhZ0MbdLvvFZZWnRgJAzY8K2JBQ")
+(encryption aes (32 "FN9m34J4bbD3vhPqh6+4BjjXDSPYpuyskJX73T1t60PP0rPdC3AxlrjVn4YDyaFSbx5WRAn4JBr7SBn2PLyxJw"))
+(compression lzma)
+(file-cache "/var/ugarit/cache")
+ +Be careful to put a set of parentheses around each configuration +entry. White space isn't significant, so feel free to indent things +and wrap them over lines if you want. + +Keep copies of this file safe - you'll need it to do extractions! +Print a copy out and lock it in your fire safe! Ok, currently, you +might be able to recreate it if you remember where you put the +storage, but encryption keys and hash salts are harder to remember... ADDED docs/intro.wiki Index: docs/intro.wiki ================================================================== --- docs/intro.wiki +++ docs/intro.wiki @@ -0,0 +1,375 @@ +

About Ugarit

+ +

What's content-addressible storage?

+ +Traditional backup systems work by storing copies of your files +somewhere. Perhaps they go onto tapes, or perhaps they're in archive +files written to disk. They will either be full dumps, containing a +complete copy of your files, or incrementals or differentials, which +only contain files that have been modified since some point. This +saves making repeated copies of unchanging files, but it means that to +do a full restore, you need to start by extracting the last full dump +then applying one or more incrementials, or the latest differential, +to get the latest state. + +Not only do differentials and incrementals let you save space, they +also give you a history - you can restore up to a previous point in +time, which is invaluable if the file you want to restore was deleted +a few backup cycles ago! + +This technology was developed when the best storage technology for +backups was magnetic tape, because each dump is written sequentially +(and restores are largely sequential, unless you're skipping bits to +pull out specific files). + +However, these days, random-access media such as magnetic disks and +SSDs are cheap enough to compete with magnetic tape for long-term bulk +storage (especially when one considers the cost of a tape drive or +two). And having fast random access means we can take advantage of +different storage techniques. + +A content-addressible store is a key-value store, except that the keys +are always computed from the values. When a given object is stored, it +is hashed, and the hash used as the key. This means you can never +store the same object twice; the second time you'll get the same hash, +see the object is already present, and re-use the existing +copy. Therefore, you get deduplication of your data for free. + +But, I hear you ask, how do you find things again, if you can't choose +the keys? + +When an object is stored, you need to record the key so you can find +it again later. In Ugarit, everything is stored in a tree-like +directory structure. Files are uploaded and their hashes obtained, and +then a directory object is constructed containing a list of the files +in the directory, and listing the key of the Ugarit objects storing +the contents of each file. This directory object itself has a hash, +which is stored inside the directory entry in the parent directory, +and so on up to the root. The root of a tree stored in a Ugarit vault +has no parent directory to contain it, so at that point, we store the +key of the root in a named "tag" that we can look up by name when we +want it. + +Therefore, everything in a Ugarit vault can be found by starting with +a named tag and retrieving the object whose key it contains, then +finding keys inside that object and looking up the objects they refer +to, until we find the object we want. + +When you use Ugarit to back up your filesystem, it uploads a complete +snapshot of every file in the filesystem, like a full dump. But +because the vault is content-addressed, it automatically avoids +uploading anything it already has a copy of, so all we upload is an +incremental dump - but in the vault, it looks like a full dump, and so +can be restored on its own without having to restore a chain of incrementals. + +Also, the same storage can be shared between multiple systems that all +back up to it - and the incremental upload algorithm will mean that +any files shared between the servers will only need to be uploaded +once. If you back up a complete server, than go and back up another +that is running the same distribution, then all the files in /bin +and so on that are already in the storage will not need to be backed +up again; the system will automatically spot that they're already +there, and not upload them again. + +As well as storing backups of filesystems, Ugarit can also be used as +the primary storage for read-only files, such as music and photos. The +principle is exactly the same; the only difference is in how the files +are organised - rather than as a directory structure, the files are +referenced from metadata objects that specify information about the +file (so it can be found) and a reference to the contents. Sets of +metadata objects are pointed to by tags as well, so they can also be +found. + +

So what's that mean in practice?

+ +

Backups

+You can run Ugarit to back up any number of filesystems to a shared +storage area (known as a vault, and on every backup, Ugarit +will only upload files or parts of files that aren't already in the +vault - be they from the previous snapshot, earlier snapshots, +snapshot of entirely unrelated filesystems, etc. Every time you do a +snapshot, Ugarit builds an entire complete directory tree of the +snapshot in the vault - but reusing any parts of files, files, or +entire directories that already exist anywhere in the vault, and +only uploading what doesn't already exist. + +The support for parts of files means that, in many cases, gigantic +files like database tables and virtual disks for virtual machines will +not need to be uploaded entirely every time they change, as the +changed sections will be identified and uploaded. + +Because a complete directory tree exists in the vault for any +snapshot, the extraction algorithm is incredibly simple - and, +therefore, incredibly reliable and fast. Simple, reliable, and fast +are just what you need when you're trying to reconstruct the +filesystem of a live server. + +Also, it means that you can do lots of small snapshots. If you run a +snapshot every hour, then only a megabyte or two might have changed in +your filesystem, so you only upload a megabyte or two - yet you end up +with a complete history of your filesystem at hourly intervals in the +vault. + +Conventional backup systems usually either store a full backup then +incrementals to their archives, meaning that doing a restore involves +reading the full backup then reading every incremental since and +applying them - so to do a restore, you have to download *every +version* of the filesystem you've ever uploaded, or you have to do +periodic full backups (even though most of your filesystem won't have +changed since the last full backup) to reduce the number of +incrementals required for a restore. Better results are had from +systems that use a special backup server to look after the archive +storage, which accept incremental backups and apply them to the +snapshot they keep in order to maintain a most-recent snapshot that +can be downloaded in a single run; but they then restrict you to using +dedicated servers as your archive stores, ruling out cheaply scalable +solutions like Amazon S3, or just backing up to a removable USB or +eSATA disk you attach to your system whenever you do a backup. And +dedicated backup servers are complex pieces of software; can you rely +on something complex for the fundamental foundation of your data +security system? + +

Archives

+ +You can also use Ugarit as the primary storage for read-only +files. You do this by creating an archive in the vault, and importing +batches of files into it along with their metadata (arbitrary +attributes, such as "author", "creation date" or "subject"). + +Just as you can keep snapshots of multiple systems in a Ugarit vault, +you can also keep multiple separate archives, each identified by a +named tag. + +However, as it's all within the same vault, the usual de-duplication +rules apply. The same file may be in multiple archives, with different +metadata in each, as the file contents and metadata are stored +separately (and associated only within the context of each +archive). And, of course, the same file may appear in snapshots and in +archives; perhaps a file was originally downloaded into your home +directory, where it was backed up into Ugarit snapshots, and then you +imported it into your archive. The archive import would not have had +to re-upload the file, as its contents would have already been found +in the vault, so all that needs to be uploaded is the metadata. + +Although we have mainly spoken of storing files in archives, the +objects in archives can be files or directories full of files, as +well. This is useful for storing MacOS-style files that are actually +directories, or for archiving things like completed projects for +clients, which can be entire directory structures. + +

System Requirements

+ +Ugarit should run on any POSIX-compliant system that can run +[http://www.call-with-current-continuation.org/|Chicken Scheme]. It +stores and restores all the file attributes reported by the stat +system call - POSIX mode permissions, UID, GID, mtime, and optionally +atime and ctime (although the ctime cannot be restored due to POSIX +restrictions). Ugarit will store files, directories, device and +character special files, symlinks, and FIFOs. + +Support for extended filesystem attributes - ACLs, alternative +streams, forks and other metadata - is possible, due to the extensible +directory entry format; support for such metadata will be added as +required. + +Currently, only local filesystem-based vault storage backends are +complete: these are suitable for backing up to a removable hard disk +or a filesystem shared via NFS or other protocols. However, the +backend can be accessed via an SSH tunnel, so a remote server you are +able to install Ugarit on to run the backends can be used as a remote +vault. + +However, the next backend to be implemented will be one for Amazon S3, +and an SFTP backend for storing vaults anywhere you can ssh +to. Other backends will be implemented on demand; a vault can, in +principle, be stored on anything that can store files by name, report +on whether a file already exists, and efficiently download a file by +name. This rules out magnetic tapes due to their requirement for +sequential access. + +Although we need to trust that a backend won't lose data (for now), we +don't need to trust the backend not to snoop on us, as Ugarit +optionally encrypts everything sent to the vault. + +

Terminology

+ +A Ugarit backend is the software module that handles backend +storage. An actual storage area - managed by a backend - is called a +storage, and is used to implement a vault; currently, every storage is +a valid vault, but the planned future introduction of a distributed +storage backend will enable multiple storages (which are not, +themselves, valid vaults as they only contain some subset of the +information required) to be combined into an aggregrate storage, which +then holds the actual vault. Note that the contents of a storage is +purely a set of blocks, and a series of named tags containing +references to them; the storage does not know the details of +encryption and hashing, so cannot make any sense of its contents. + +For example, if you use the recommended "splitlog" filesystem backend, +your vault might be /mnt/bigdisk on the server +prometheus. The backend (which is compiled along with the +other filesystem backends in the backend-fs binary) must +be installed on prometheus, and Ugarit clients all over +the place may then use it via ssh to prometheus. However, +even with the filesystem backends, the actual storage might not be on +prometheus where the backend runs - +/mnt/bigdisk might be an NFS mount, or a mount from a +storage-area network. This ability to delegate via SSH is particularly +useful with the "cache" backend, which reduces latency by storing a +cache of what blocks exist in a backend, thereby making it quicker to +identify already-stored files; a cluster of servers all sharing the +same vault might all use SSH tunnels to access an instance of the +"cache" backend on one of them (using some local disk to store the +cache), which proxies the actual vault storage to a vault on the other +end of a high-latency Internet link, again via an SSH tunnel. + +A vault is where Ugarit stores backups (as chains of snapshots) and +archives (as chains of archive imports). Backups and archives are +identified by tags, which are the top-level named entry points into a +vault. A vault is based on top of a storage, along with a choice of +hash function, compression algorithm, and encryption that are used to +map the logical world of snapshots and archive imports into the +physical world of blocks stored in the storage. + +A snapshot is a copy of a filesystem tree in the vault, with a header +block that gives some metadata about it. A backup consists of a number +of snapshots of a given filesystem. + +An archive import is a set of filesystem trees, each along with +metadata about it. Whereas a backup is organised around a series of +timed snapshots, an archive is organised around the metadata; the +filesystem trees in the archive are identified by their properties. + +

So what, exactly, is in a vault?

+ +A Ugarit vault contains a load of blocks, each up to a maximum size +(usually 1MiB, although other backends might impose smaller +limits). Each block is identified by the hash of its contents; this is +how Ugarit avoids ever uploading the same data twice, by checking to +see if the data to be uploaded already exists in the vault by +looking up the hash. The contents of the blocks are compressed and +then encrypted before upload. + +Every file uploaded is, unless it's small enough to fit in a single +block, chopped into blocks, and each block uploaded. This way, the +entire contents of your filesystem can be uploaded - or, at least, +only the parts of it that aren't already there! The blocks are then +tied together to create a snapshot by uploading blocks full of the +hashes of the data blocks, and directory blocks are uploaded listing +the names and attributes of files in directories, along with the +hashes of the blocks that contain the files' contents. Even the blocks +that contain lists of hashes of other blocks are subject to checking +for pre-existence in the vault; if only a few MiB of your +hundred-GiB filesystem has changed, then even the index blocks and +directory blocks are re-used from previous snapshots. + +Once uploaded, a block in the vault is never again changed. After all, +if its contents changed, its hash would change, so it would no longer +be the same block! However, every block has a reference count, +tracking the number of index blocks that refer to it. This means that +the vault knows which blocks are shared between multiple snapshots (or +shared *within* a snapshot - if a filesystem has more than one copy of +the same file, still only one copy is uploaded), so that if a given +snapshot is deleted, then the blocks that only that snapshot is using +can be deleted to free up space, without corrupting other snapshots by +deleting blocks they share. Keep in mind, however, that not all +storage backends may support this - there are certain advantages to +being an append-only vault. For a start, you can't delete something by +accident! The supplied fs and sqlite backends support deletion, while +the splitlog backend does not yet. However, the actual snapshot +deletion command in the user interface hasn't been implemented yet +either, so it's a moot point for now... + +Finally, the vault contains objects called tags. Unlike the blocks, +the tags' contents can change, and they have meaningful names rather +than being identified by hash. Tags identify the top-level blocks of +snapshots within the system, from which (by following the chain of +hashes down through the index blocks) the entire contents of a +snapshot may be found. Unless you happen to have recorded the hash of +a snapshot somewhere, the tags are where you find snapshots from when +you want to do a restore. + +Whenever a snapshot is taken, as soon as Ugarit has uploaded all the +files, directories, and index blocks required, it looks up the tag you +have identified as the target of the snapshot. If the tag already +exists, then the snapshot it currently points to is recorded in the +new snapshot as the "previous snapshot"; then the snapshot header +containing the previous snapshot hash, along with the date and time +and any comments you provide for the snapshot, and is uploaded (as +another block, identified by its hash). The tag is then updated to +point to the new snapshot. + +This way, each tag actually identifies a chronological chain of +snapshots. Normally, you would use a tag to identify a filesystem +being backed up; you'd keep snapshotting the filesystem to the same +tag, resulting in all the snapshots of that filesystem hanging from +the tag. But if you wanted to remember any particular snapshot +(perhaps if it's the snapshot you take before a big upgrade or other +risky operation), you can duplicate the tag, in effect 'forking' the +chain of snapshots much like a branch in a version control system. + +Archive imports cause the creation of one or more archive metadata +blocks, each of which lists the hashes of files or filesystem trees in +the archive, along with their metadata. Each import then has a single +archive import block pointing to the sequence of metadata blocks, and +pointing to the previous archive import block in that archive. The +same filesystem tree can be imported more than once to the same +archive, and the "latest" metadata always wins. + +Generally, you should create lots of small archives for different +categories of things - such as one for music, one for photos, and so +on. You might well create separate archives for the music collections +of different people in your household, unless they overlap, and +another for Christmas music so it doesn't crop up in random shuffle +play! It's easy to merge archives if you over-compartmentalise them, +but harder to split an archive if you find it too cluttered with +unrelated things. + +I've spoken of archive imports, and backup snapshots, each having a +"previous" reference to the last import or snapshot in the chain, but +it's actually more complex than that: they have an arbitrary list of +zero or more previous objects. As such, it's possible for several +imports or snapshots to have the same "previous", known as a "fork", +and it's possible to have an import or snapshot that merges multiple +previous ones. + +Forking is handy if you want to basically duplicate an archive, +creating two new archives with the same contents to begin with, but +each then capable of diverging thereafter. You might do this to keep +the state of an archive before doing a bit import, so you can go back +to the original state if you regret the import, for instance. + +Forking a backup tag is a more unusual operation, but also +useful. Perhaps you have a server running many stateful services, and +the hardware becomes overloaded, so you clone the basic setup onto +another server, and run half of the services on the original and half +on the new one; if you fork the backup tag of the original server to +create a backup tag for the new server, then both servers' snapshot +history will share the original shared state. + +Merging is most useful for archives; you might merge several archives +into one, as mentioned. + +And, of course, you can merge backup tags, as well. If your earlier +splitting of one server into two doesn't work out (perhaps your +workload reduces, or you can now afford a single, more powerful, +server to handle everything in one place), you might rsync back the +service state from the two servers onto the new server, so it's all +merged in the new server's filesystem. To preserve this in the +snapshot history, you can merge the two backup tags of the two servers +to create a backup tag for the single new server, which will +accurately reflect the history of the filesystem. + +Also, tags might fork by accident - I plan to introduce a distributed +storage backend, which will replicate blocks and tags across multiple +storages to create a single virtual storage to build a vault on top +of; in the event of the network of actual storages suffering a +failure, it may be that snapshots and imports are only applied to some +of the storages - and then subsequent snapshots and imports only get +applied to some other subset of the storages. When the network is +repaired and all the storages are again visible, they will have +diverged, inconsistent, states for their tags, and the distributed +storage system will resolve the situation by keeping the majority +state as the state of the tag on all the backends, but preserving any +other states by creating new tags, with the original name plus a +suffix. These can then be merged to "heal" the conflict. ADDED docs/release-2.0.wiki Index: docs/release-2.0.wiki ================================================================== --- docs/release-2.0.wiki +++ docs/release-2.0.wiki @@ -0,0 +1,33 @@ +

Ugarit 2.0 release notes

+ +

What's new?

+ +Archival mode [dae5e21ffc], and to support its integration into +Ugarit, implemented typed tags [08bf026f5a], displaying tag types in +the VFS [30054df0b6], refactoring the Ugarit internals [5fa161239c], +made the storage of logs in the vault better [68bb75789f], made it +possible to view logs from within the VFS [4e3673e0fe], supported +hidden tags [cf5ef4691c], recording configuration information in the +vault (and providing instant notification if your vault +hashing/encryption setup is incorrect, thanks to a clever idea by Andy +Bennett) [0500d282fc], rearranged how local caching is handled +[b5911d321a], and added support for the history of a snapshot or +archive tag to have arbitrary branches and merges [a987e28fef], which +(as a side-effect) improved the performance of running "ls" in long +snapshot histories [fcf8bc942a]. Also added an sqlite backend +[8719dfb84f], which makes testing easier but is useful in its own +right as it's fully-featured and crash-safe, while storing the vault +in a single file; and improved the appearance of the explore mode ls +command, as the VFS layout has become more complex with the new +log/properties views and all the archive mode stuff. + +

Upgrading

+ +Ugarit 2.0 uses a new format for tags and logs, as well as the whole +new concept of archive tags. As such, the vault format has +changed. Ugarit 2.0 will read a vault created by prior versions of +Ugarit, and will silently upgrade it when it adds things to the vault +(by using the new formt for new things, and keeping the old format for +old things). As such, when you upgrade to Ugarit 2.0 and start using +it on an existing vault, older versions of Ugarit will not be able to +read things that Ugarit 2.0 has added to the vault. ADDED docs/release-old.wiki Index: docs/release-old.wiki ================================================================== --- docs/release-old.wiki +++ docs/release-old.wiki @@ -0,0 +1,143 @@ +

Ugarit v1.* release history

+ + + * 1.0.9: More humane display of sizes in explore's directory + listings, using low-level I/O to reduce CPU usage. Myriad small + bug fixes and some internal structural improvements. + + * 1.0.8: Bug fixes to work with the latest chicken master, and + increased unit test coverage to test stuff that wasn't working + due to chicken bugs. Looking good! + + * 1.0.7: Fixed bug with directory rules (errors arose when files + were skipped). I need to improve the test suite coverage of + high-level components to stop this happening! + + * 1.0.6: Fixed missing features from v1.0.5 due to a fluffed merge + (whoops), added tracking of directory sizes (files+bytes) in the + vault on snapshot and the use of this information to display + overall percentage completion when extracting. Directory sizes + can be seen in the explore interface when doing "ls -l" or "ls -ll". + + * 1.0.5: Changed the VFS layout slightly, making the existence of + snapshot objects explicit (when you go into a tag, then go into a + snapshot, you now need to go into "contents" to see the actual + file tree; the snapshot object itself now exists as a node in the + tree). Added traverse-vault-* functions to the core API, and tests + for same, and used traverse-vault-node to drive the cd and get + functions in the interactive explore mode (speeding them up in the + process!). Added "extract" command. Added a progress reporting + callback facility for snapshots and extractions, and used it to + provide progress reporting in the front-end, every 60 seconds or + so by default, not at all with -q, and every time something + happens with -v. Added tab completion in explore mode. + + * 1.0.4: Resurrected support for compression and encryption and SHA2 + hashes, which had been broken by the failure of the + autoload egg to continue to work as it used to. Tidying + up error and ^C handling somewhat. + + * 1.0.3: Installed sqlite busy handlers to retry when the database is + locked due to concurrent access (affects backend-fs, backend-cache, + and the file cache), and gained an EXCLUSIVE lock when locking a + tag in backend-fs; I'm not clear if it's necessary, but it can't + hurt. + + BUGFIX: Logging of messages from storage backends wasn't + happening correctly in the Ugarit core, leading to errors when the + cache backend (which logs an info message at close time) was closed + and the log message had nowhere to go. + + * 1.0.2: Made the file cache also commit periodically, rather than on + every write, in order to improve performance. Counting blocks and + bytes uploaded / reused, and file cache bytes as well as hits; + reporting same in snapshot UI and logging same to snapshot + metadata. Switched to the posix-extras egg and ditched our own + posixextras.scm wrappers. Used the parley egg in the ugarit + explore CLI for line editing. Added logging infrastructure, + recording of snapshot logs in the snapshot. Added recovery from + extraction errors. Listed lock state of tags in explore + mode. Backend protocol v2 introduced (retaining v1 for + compatability) allowing for an error on backend startup, and logging + nonfatal errors, warnings, and info on startup and all protocol + calls. Added ugarit-archive-admin command line interface to + backend-specific administrative interfaces. Configuration of the + splitlog backend (write protection, adjusting block size and logfile + size limit and commit interval) is now possible via the admin + interface. The admin interface also permits rebuilding the metadata + index of a splitlog vault with the reindex! admin command. + + BUGFIX: Made file cache check the file hashes it finds in the + cache actually exist in the vault, to protect against the case + where a crash of some kind has caused unflushed changes to be + lost; the file cache may well have committed changes that the + backend hasn't, leading to references to nonexistant blocks. Note + that we assume that vaults are sequentially safe, eg if the + final indirect block of a large file made it, all the partial + blocks must have made it too. + + BUGFIX: Added an explicit flush! command to the backend + protocol, and put explicit flushes at critical points in higher + layers (backend-cache, the vault abstraction in the Ugarit + core, and when tagging a snapshot) so that we ensure the blocks we + point at are flushed before committing references to them in the + backend-cache or file caches, or into tags, to ensure crash + safety. + + BUGFIX: Made the splitlog backend never exceed the file size limit + (except when passed blocks that, plus a header, are larger than + it), rather than letting a partial block hang over the 'end'. + + BUGFIX: Fixed tag locking, which was broken all over the + place. Concurrent snapshots to the same tag should now block for + one another, although why you'd want to *do* that is questionable. + + BUGFIX: Fixed generation of non-keyed hashes, which was + incorrectly appending the type to the hash without an outer + hash. This breaks backwards compatability, but nobody was using + the old algorithm, right? I'll introduce it as an option if + required. + + * 1.0.1: Consistency check on read blocks by default. Removed warning + about deletions from backend-cache; we need a new mechanism to + report warnings from backends to the user. Made backend-cache and + backend-fs/splitlog commit periodically rather than after every + insert, which should speed up snapshotting a lot, and reused the + prepared statements rather than re-preparing them all the + time. + + BUGFIX: splitlog backend now creates log files with + "rw-------" rather than "rwx------" permissions; and all sqlite + databases (splitlog metadata, cache file, and file-cache file) are + created with "rw-------" rather then "rw-r--r--". + + * 1.0: Migrated from gdbm to sqlite for metadata storage, removing the + GPL taint. Unit test suite. backend-cache made into a separate + backend binary. Removed backend-log. + + BUGFIX: file caching uses mtime *and* + size now, rather than just mtime. Error handling so we skip objects + that we cannot do something with, and proceed to try the rest of the + operation. + + * 0.8: decoupling backends from the core and into separate binaries, + accessed via standard input and output, so they can be run over SSH + tunnels and other such magic. + + * 0.7: file cache support, sorting of directories so they're archived + in canonical order, autoloading of hash/encryption/compression + modules so they're not required dependencies any more. + + * 0.6: .ugarit support. + + * 0.5: Keyed hashing so attackers can't tell what blocks you have, + markers in logs so the index can be reconstructed, sha2 support, and + passphrase support. + + * 0.4: AES encryption. + + * 0.3: Added splitlog backend, and fixed a .meta file typo. + + * 0.2: Initial public release. + + * 0.1: Internal development release. ADDED docs/security.wiki Index: docs/security.wiki ================================================================== --- docs/security.wiki +++ docs/security.wiki @@ -0,0 +1,170 @@ +

Security model

+ +I have designed and implemented Ugarit to be able to handle cases +where the actual vault storage is not entirely trusted. + +However, security involves tradeoffs, and Ugarit is configurable in +ways that affect its resistance to different kinds of attacks. Here I +will list different kinds of attack and explain how Ugarit can deal +with them, and how you need to configure it to gain that +protection. + +

Vault snoopers

+ +This might be somebody who can intercept Ugarit's communication with +the vault at any point, or who can read the vault itself at their +leisure. + +Ugarit's splitlog backend creates files with "rw-------" permissions +out of the box to try and prevent this. This is a pain for people who +want to share vaults between UIDs, but we can add a configuration +option to override this if that becomes a problem. + +

Reading your data

+ +If you enable encryption, then all the blocks sent to the vault are +encrypted using a secret key stored in your Ugarit configuration +file. As long as that configuration file is kept safe, and the AES +algorithm is secure, then attackers who can snoop the vault cannot +decode your data blocks. Enabling compression will also help, as the +blocks are compressed before encrypting, which is thought to make +cryptographic analysis harder. + +Recommendations: Use compression and encryption when there is a risk +of vault snooping. Keep your Ugarit configuration file safe using +UNIX file permissions (make it readable only by root), and maybe store +it on a removable device that's only plugged in when +required. Alternatively, use the "prompt" passphrase option, and be +prompted for a passphrase every time you run Ugarit, so it isn't +stored on disk anywhere. + +

Looking for known hashes

+ +A block is identified by the hash of its content (before compression +and encryption). If an attacker was trying to find people who own a +particular file (perhaps a piece of subversive literature), they could +search Ugarit vaults for its hash. + +However, Ugarit has the option to "key" the hash with a "salt" stored +in the Ugarit configuration file. This means that the hashes used are +actually a hash of the block's contents *and* the salt you supply. If +you do this with a random salt that you keep secret, then attackers +can't check your vault for known content just by comparing the hashes. + +Recommendations: Provide a secret string to your hash function in your +Ugarit configuration file. Keep the Ugarit configuration file safe, as +per the advice in the previous point. + +

Vault modifiers

+ +These folks can modify Ugarit's writes into the vault, its reads +back from the vault, or can modify the vault itself at their leisure. + +Modifying an encrypted block without knowing the encryption key can at +worst be a denial of service, corrupting the block in an unknown +way. An attacker who knows the encryption key could replace a block +with valid-seeming but incorrect content. In the worst case, this +could exploit a bug in the decompression engine, causing a crash or +even an exploit of the Ugarit process itself (thereby gaining the +powers of a process inspector, as documented below). We can but hope +that the decompression engine is robust. Exploits of the decryption +engine, or other parts of Ugarit, are less likely due to the nature of +the operations performed upon them. + +However, if a block is modified, then when Ugarit reads it back, the +hash will no longer match the hash Ugarit requested, which will be +detected and an error reported. The hash is checked after +decryption and decompression, so this check does not protect us +against exploits of the decompression engine. + +This protection is only afforded when the hash Ugarit asks for is not +tampered with. Most hashes are obtained from within other blocks, +which are therefore safe unless that block has been tampered with; the +nature of the hash tree conveys the trust in the hashes up to the +root. The root hashes are stored in the vault as "tags", which an +vault modifier could alter at will. Therefore, the tags cannot be +trusted if somebody might modify the vault. This is why Ugarit +prints out the snapshot hash and the root directory hash after +performing a snapshot, so you can record them securely outside of the +vault. + +The most likely threat posed by vault modifiers is that they could +simply corrupt or delete all of your vault, without needing to know +any encryption keys. + +Recommendations: Secure your vaults against modifiers, by whatever +means possible. If vault modifiers are still a potential threat, +write down a log of your root directory hashes from each snapshot, and keep +it safe. When extracting your backups, use the ls -ll command in the +interface to check the "contents" hash of your snapshots, and check +they match the root directory hash you expect. + +

Process inspectors

+ +These folks can attach debuggers or similar tools to running +processes, such as Ugarit itself. + +Ugarit backend processes only see encrypted data, so people who can +attach to that process gain the powers of vault snoopers and +modifiers, and the same conditions apply. + +People who can attach to the Ugarit process itself, however, will see +the original unencrypted content of your filesystem, and will have +full access to the encryption keys and hashing keys stored in your +Ugarit configuration. When Ugarit is running with sufficient +permissions to restore backups, they will be able to intercept and +modify the data as it comes out, and probably gain total write access +to your entire filesystem in the process. + +Recommendations: Ensure that Ugarit does not run under the same user +ID as untrusted software. In many cases it will need to run as root in +order to gain unfettered access to read the filesystems it is backing +up, or to restore the ownership of files. However, when all the files +it backs up are world-readable, it could run as an untrusted user for +backups, and where file ownership is trivially reconstructible, it can +do restores as a limited user, too. + +

Attackers in the source filesystem

+ +These folks create files that Ugarit will back up one day. By having +write access to your filesystem, they already have some level of +power, and standard Unix security practices such as storage quotas +should be used to control them. They may be people with logins on your +box, or more subtly, people who can cause servers to writes files; +somebody who sends an email to your mailserver will probably cause +that message to be written to queue files, as will people who can +upload files via any means. + +Such attackers might use up your available storage by creating large +files. This creates a problem in the actual filesystem, but that +problem can be fixed by deleting the files. If those files get +stored into Ugarit, then they are a part of that snapshot. If you +are using a backend that supports deletion, then (when I implement +snapshot deletion in the user interface) you could delete that entire +snapshot to recover the wasted space, but that is a rather serious +operation. + +More insidiously, such attackers might attempt to abuse a hash +collision in order to fool the vault. If they have a way of creating +a file that, for instance, has the same hash as your shadow password +file, then Ugarit will think that it already has that file when it +attempts to snapshot it, and store a reference to the existing +file. If that snapshot is restored, then they will receive a copy of +your shadow password file. Similarly, if they can predict a future +hash of your shadow password file, and create a shadow password file +of their own (perhaps one giving them a root account with a known +password) with that hash, they can then wait for the real shadow +password file to have that hash. If the system is later restored from +that snapshot, then their chosen content will appear in the shadow +password file. However, doing this requires a very fundamental break +of the hash function being used. + +Recommendations: Think carefully about who has write access to your +filesystems, directly or indirectly via a network service that stores +received data to disk. Enforce quotas where appropriate, and consider +not backing up "queue directories" where untrusted content might +appear; migrate incoming content that passes acceptance tests to an +area that is backed up. If necessary, the queue might be backed up to +a non-snapshotting system, such as rsyncing to another server, so that +any excessive files that appear in there are removed from the backup +in due course, while still affording protection. ADDED docs/storage-admin.wiki Index: docs/storage-admin.wiki ================================================================== --- docs/storage-admin.wiki +++ docs/storage-admin.wiki @@ -0,0 +1,91 @@ +

Storage administration

+ +Each backend offers a number of administrative commands for +administering the storage underlying vaults. These are accessible via +the ugarit-storage-admin command line interface. + +To use it, run it with the following command: + +
$ ugarit-storage-admin ''
+ +The available commands differ between backends, but all backends +support the info and help commands, which +give basic information about the vault, and list all available +commands, respectively. Some offer a stats command that +examines the vault state to give interesting statistics, but which may +be a time-consuming operation. + +

Administering splitlog storages

+ +The splitlog backend offers a wide selection of administrative +commands. See the help command on a splitlog vault for +details. The following commands are available: + +
+ +
help
+
List the available commands.
+ +
info
+
List some basic information about the storage.
+ +
stats
+
Examine the metadata to provide overall statistics about the +archive. This may be a time-consuming operation on large +storages.
+ +
set-block-size! BYTES
+
Sets the block size to the given number of bytes. This will affect +new blocks written to the storage, and leave existing blocks +untouched, even if they are larger than the new block size.
+ +
set-max-logfile-size! BYTES
+
Sets the size at which a log file is finished and a new one +started (likewise, existing log files will be untouched; this will +only affect new log files)
+ +
set-commit-interval! UPDATES
+
Sets the frequency of automatic synching of the storage +state to disk. Lowering this harms performance when writing to the +storage, but decreases the number of in-progress block writes that +can fail in a crash.
+ +
write-protect!
+
Disables updating of the storage.
+ +
write-unprotect!
+
Re-enables updating of the storage.
+ +
reindex!
+
Reindex the storage, rebuilding the block and tag state from the +contents of the log. If the metadata file is damaged or lost, +reindexing can rebuild it (although any configuration changes made +via other admin commands will need manually repeating as they are +not logged).
+
+ +

Administering sqlite storages

+ +The sqlite backend has a similar administrative interface to the +splitlog backend, except that it does not have log files, so lacks the +set-max-logfile-size! and reindex! commands. + +

Administering cache storages

+ +The cache backend provides a minimalistic interface: + +
+ +
help
+
List the available commands.
+ +
info
+
List some basic information about the storage.
+ +
stats
+
Report on how many entries are in the cache.
+ +
clear!
+
Clears the cache, dropping all the entries in it.
+ +
Index: ugarit-api.scm ================================================================== --- ugarit-api.scm +++ ugarit-api.scm @@ -402,12 +402,12 @@ ('store-atime (set! (job-store-atime? (current-job)) #t)) ('store-ctime (set! (job-store-ctime? (current-job)) #t)) (('storage command-line) - (set! *storage* - (with-backend-logging + (set! *storage* + (with-backend-logging (import-storage command-line)))) (('hash . conf) (set! *hash* conf)) (('compression . conf) (set! *compression* conf)) (('encryption . conf) (set! *crypto* conf)) (('cache path) Index: ugarit.release-info ================================================================== --- ugarit.release-info +++ ugarit.release-info @@ -7,5 +7,6 @@ (release "1.0.4") (release "1.0.5") (release "1.0.6") (release "1.0.7") (release "1.0.9") +(release "2.0") Index: ugarit.setup ================================================================== --- ugarit.setup +++ ugarit.setup @@ -1,8 +1,8 @@ (use posix) -(define *version* "1.0.9") +(define *version* "2.0") (define (newer file1 file2) (or (not (get-environment-variable "UGARIT_FAST_BUILD")) (not (file-exists? file2))