View Ticket
Ticket Hash: dae5e21ffcd3d517e021d8b855fb86ff7d9a271a
Title: Archival mode
Status: Closed Type: Feature_Request
Severity: UNSPECIFIED Priority: 2_Medium
Subsystem: Archival_Frontend Resolution: Open
Last Modified: 2014-11-02 01:49:53
Version Found In:
Currently, I have been implementing Ugarit's backup facility through its "snapshot" mode, but it's meant to be a backup *and archival* system.

Whereas snapshot mode takes a filesystem tree and adds it to a chain of snapshots of the same tree rooted at a tag, archival mode takes a filesystem tree and inserts it into a differently-structured thing called a library, also rooted at a tag.

A library is implemented as a chain of snapshot-like blocks, each of which refers to the previous library in the chain, has a small amount of metadata, and points to a contents block, However, the contents is an s-expression stream of metadata entries. Each metadata entry has a hash (pointing to the root block of the archived filesystem tree, which may often be a raw file rather than a directory), then an alist mapping metadata keys to values.

The metadata for a given archived filesystem tree may be superceded by later libraries in the chain, in which case the earlier metadata is ignored.

The library metadata should be cached by the front-end, in an SQLite database, all keyed on the tag name. The hash of the latest library is stored in the cache, so that whenever the archive is opened, it can be compared to the current state of the library tag and the chain followed (processing updates as we go) until the previous point is found, thereby only importing the latest changes. The metadata of a given filesystem tree in the library is the metadata attached to it by the library entry, plus any metadata attached to the top-level library block itself, which is inherited to all metadata created in that library.

The default virtual filesystem presented by the explore command, when it finds a library tag, can present the library chain like a snapshot chain, but the virtual filesystem provided by 9P/NFS/WebDAV/FUSE mode can be configurable to provide multiple views on the archive.

One that comes to mind is to specify a number of metadata keys. The virtual filesystem then has a directory level per metadata key, within which all filesystem trees with the given set of values, matching a global filter restriction, are found. By setting a global restriction of type=music, and giving the directory keys as artist, album and title, we get a nice music browser. Further extensions might be to extend the syntax for directory keys from single symbols that select a metadata key to constructs like (track-number "-" title) to generate compound strings at each level, and configuring what to do with filesystem trees that lack the metadata key in question (the options being to ignore that filesystem tree, or to provide a default value such as "Unknown").

alaric added on 2012-04-16 13:20:06 UTC:
It would be useful to record the exact absolute path and hostname a file tree came from when it goes into an archive, as that can be useful metadata in figuring out what it is later.

Now, when we import a file into the archive by snapshotting it and then introducing a metadata record about it into an archive delta, we must check to see if the file already exists in that archive, so as to not overwrite previous rich metadata with naff initial auto-generated metadata. However, it might be nice to read in the previous metadata and append a new "archived from" entry specifying the hostname, location, and time. As the metadata is an alist, it will be easy to do this as long as "archived from" is a single property, so we can tie together the hostname/location/time triple as a single item.

alaric added on 2012-05-04 10:35:07 UTC:
I should also like to add that the "location archived from" should be represented (for convenience) as four components: hostname, absolutely directory path, filename, and extension.

So the metadata alist of a file might include one or more of:

(archived-from "2011-04-32 22:45:01" "anger" "/home/alaric/projects/foo" "backup" "tar.gz")

One would hope that the extension would be the same (modulo case?) for all the archives, but we can never be sure :-)

When building search queries, it would therefore be nice to be able to say (second archived-from) to extract the second field from archived-from - or maybe even to define a table of aliases so we can say archived-from-hostname.

User Comments:
alaric added on 2014-11-02 01:49:53:

The basics are now there. From the command line, you can import a manifest of objects, search for objects matching a query string, list available properties of objects matching a query string, list available values of a property for objects matching a query string, stream a chosen object to stdout (if it's a file), or extract a chosen object to the filesystem.

Next steps are [9c3ac71f94] for generic property-based explorer in the VFS, and [fff691ada2] for customised views, and [33fd928177] for a manifest generator.

Outside of archival mode, but enhancing its utility tremendously, are a 9p/fuse/puffs client to allow mounting a vault as a read-only filesystem; and replicated storage [f1f2ce8cdc] - with archival mode, the vault starts to become the primary storage for data, rather than just a backup, and so internal vault replication for resilience becomes all the more important.

Future work of note includes an archive tagging GUI [7b6588068f], a public gallery viewer for images [5b07f64457], and support for storing emails in an archive [ea1b7f9ad7].