Ticket Hash: | f1f2ce8cdc0e6471ddcd36d98a48836c95552a3f | |||
Title: | Replicated storage | |||
Status: | Claimed | Type: | Feature_Request | |
Severity: | UNSPECIFIED | Priority: | 2_Medium | |
Subsystem: | Backends | Resolution: | Open | |
Last Modified: | 2015-07-31 21:38:49 | |||
Version Found In: | ||||
Description: | ||||
Write a storage backend that acts as a proxy to a number of other storage backends, known as "shards", each tagged with a trust percentage and read/write load weightings.
An sqlite database holds the configuration, and a cache of what blocks are on what shards. If the configuration lists no shards (eg, when creating the storage initially), then the storage can be opened but is read-only and empty until Upon storage open, all the shards are opened, and a vector of available shard metadata and connections is kept in memory. Shards that fail to open are still in the vector, but marked as broken. A shard that appears to be read-only upon connect automatically gets a write weighting of zero, meaning no writes are attempted there, and a warning is logged if the write weighting was not configured to be zero. Each block that is When a block is requested from storage, the cache is consulted to find out which shards should carry it, and one picked at random (subject to the read weighting). The read is attempted from that shard, and if it fails with an error (as opposed to the block not existing), it's marked as broken and another tried. If the block is reported as not existing, then the cached reference to that block being on that shard is removed from the cache, and a warning output. If no shards can be reached to obtain a copy of the block, a fatal error is raised. If the block is not listed in the cache, then if the "reconstruct cache" configuration flag is set in the sqlite database, all shards are checked, and the cache updated with who has the block for future reference, and a warning issued and the "insufficient replication" flag set if the total trust is <100%; if the flag is not set, they are tried randomly (weighted by read weighting) until the block is found somewhere, that fact is cached, and the search stops then. When a tag is written to the storage, it is replicated to all non-broken storages, with an update timestamp appended and the name mangled by prepending When a tag is read from the storage, it (with the name mangled) is read from all shards. The value with the most recent timestamp is used, and any shards that had older timestamped tags, or which lacked the tag, are updated and a warning issued. If the "insufficient replication" flag is set upon There will be storage admin commands to:
alaric added on 2012-12-29 15:25:40 UTC: We can add revision counters to the contents of the tags stored on the shards to resolve those conflicts trivially in favour of the most recent version, but that will then lose snapshots if we have other clients snapshotting to the same replicated vault during a network partition. This is not very likely with snapshots, but very likely with archival mode. Therefore, as the backend is ignorant of the nature of snapshot chains (as it lacks the encryption keys to even read them!), the best the backend can do when a tag becomes conflicted in this way is to return *all* available values of the tag, extending the backend API to allow for conflicted tags to be returned as a list of strings rather than a string. The front-end then needs to have a way of handling this case when it arises. This can be done in the procedure that reads the value of a tag, which can fix the conflicted tag up before continuing and returning the resolved tag value. For snapshot tags, the thing to do would be to create an explicit merge object referring to the list of parents, and reset the tag to point to that. The merge object should contain an alist mapping string prefixes to parent tag values; fold-history can, when it encounters such an object in the chain, recursively fold over the parent histories, merging them together and prefixing the node names with the prefixes from the alist. The merge will preserve the temporal ordering of snapshots in the merged chain. The prefixes would just be "a", "b", "c" etc. for automatic merges. Such merge objects might also be created manually via the command line interface, nominating a list of tags to merge into a newly-created tag; the tag names are then usable as prefixes in that case. A similar merge object can be used in archive-mode archive history chains; in which case, the merge only needs to worry about conflicting metadata updates to the same object on either side, and choose an order for them (based on a timestamp?). Again, the merges might be created automatically when resolving a conflicted tag, or manually to explicitly merge two archives within the same vault. alaric added on 2013-02-08 13:32:43 UTC: Instead, write a special #ugarit-not-a-vault tag that contains a string message stating what it is, and make open-vault check for #ugarit-not-a-vault in the storage it opens, and die if it's found. That can be used as a generic mechanism to mark storages that aren't vaults so they can't be accidentally used as such. | ||||
User Comments: | ||||
alaric added on 2014-10-14 22:49:55:
It's important to preserve the database that says what blocks are where; losing it means we can recover it slowly as we look for blocks, but if we don't have a full list of blocks with their replication data, we can't ensure they're not lost when nodes are removed. Therefore, we should probably, on every flush!, store a log of block replication changes to a special tag on every backend. That means that multiple instances of the replicated backend will also be kept more or less in sync, as they'll check for new updates on open. As for dealing with tag splits - with [a987e28fef] it is possible to merge both snapshot and archive tags, but it has to happen at the vault level, not at the storage level. The best a backend can do is to spot that a tag has diverged, find the set of distinct values, point the tag at the most popular, and store the others as new tags named by suffixing the original name. The user can notice this from the log or by observation, and trigger a manual merge. alaric added on 2014-10-25 18:59:53: See this blog post for rationale on replication configuration: http://www.snell-pym.org.uk/archives/2014/10/25/configuring-replication/ alaric added on 2015-07-31 21:38:49: I'm starting to work on this, because I need it - my home vault disk is starting to die, and I've lost some snapshots because it's not replicated! |