View Ticket
Ticket Hash: f1f2ce8cdc0e6471ddcd36d98a48836c95552a3f
Title: Replicated storage
Status: Claimed Type: Feature_Request
Severity: UNSPECIFIED Priority: 2_Medium
Subsystem: Backends Resolution: Open
Last Modified: 2015-07-31 21:38:49
Version Found In:
Write a storage backend that acts as a proxy to a number of other storage backends, known as "shards", each tagged with a trust percentage and read/write load weightings.

An sqlite database holds the configuration, and a cache of what blocks are on what shards. If the configuration lists no shards (eg, when creating the storage initially), then the storage can be opened but is read-only and empty until ugarit-storage-admin is used to add some shards.

Upon storage open, all the shards are opened, and a vector of available shard metadata and connections is kept in memory. Shards that fail to open are still in the vector, but marked as broken.

A shard that appears to be read-only upon connect automatically gets a write weighting of zero, meaning no writes are attempted there, and a warning is logged if the write weighting was not configured to be zero.

Each block that is put! into the storage will be uploaded to enough shards to make the total trust of them all be at least 100%, by randomly picking non-broken shards according to their write load weighting. If a shard rejects the attempt to store the block with an error, then it is flagged as broken and a new shard chosen. If it is not possible to find enough working shards to reach a 100% trust level, then an "insufficient replication" flag is set and a warning issued; if it was not possible to attain a non-zero trust level (eg, no shard with nonzero trust would take it), a fatal error is raised.

When a block is requested from storage, the cache is consulted to find out which shards should carry it, and one picked at random (subject to the read weighting). The read is attempted from that shard, and if it fails with an error (as opposed to the block not existing), it's marked as broken and another tried. If the block is reported as not existing, then the cached reference to that block being on that shard is removed from the cache, and a warning output. If no shards can be reached to obtain a copy of the block, a fatal error is raised.

If the block is not listed in the cache, then if the "reconstruct cache" configuration flag is set in the sqlite database, all shards are checked, and the cache updated with who has the block for future reference, and a warning issued and the "insufficient replication" flag set if the total trust is <100%; if the flag is not set, they are tried randomly (weighted by read weighting) until the block is found somewhere, that fact is cached, and the search stops then.

When a tag is written to the storage, it is replicated to all non-broken storages, with an update timestamp appended and the name mangled by prepending #replicated-tag-, in order to hide it if a user accidentally tries to use a shard storage as a vault.

When a tag is read from the storage, it (with the name mangled) is read from all shards. The value with the most recent timestamp is used, and any shards that had older timestamped tags, or which lacked the tag, are updated and a warning issued.

If the "insufficient replication" flag is set upon close!, then a warning telling the user to run the fix-replication! storage admin command is logged.

There will be storage admin commands to:

  • add-shard!: Add a shard. The tag #ugarit-vault-configuration is written to it with a reference to a configuration block identifying this as a shard (see [0500d282fc95ef]) by having the format identified ugarit-replicated-shard-v1 and no test block.
  • remove-shard!: Remove a shard. Removing a shard causes a scan of the cache to be done; if any blocks will end up with less than one hundred percent total trust because of the change, then a fatal error is logged unless a force flag was supplied; the error message is even more alarming if one or more blocks would end up with zero total trust. If the removal proceeds, then all cache entries for that shard are removed.
  • shards: List the current shards, with the broken flags.
  • configure-shard!: Update the trust, read weight, or write weight of an existing shard.
  • stats: Scan the cache and report on how many blocks have >=100, <100 or 0 total trust.
  • fix-replication!: Scan the cache to find blocks with <100 total trust, and re-replicate them to additional shards to make it up to 100. Report warnings for any that cannot be sufficiently replicated.
  • check-cache!: Scan the cache and actually check that each block exists on each shard it's supposed to. Warn and update the cache if not. If an "extra thorough" flag is supplied, also check every shard that's not supposed to carry the block in case it does already.

alaric added on 2012-12-29 15:25:40 UTC:
If a tag is updated while one or more shards are unreachable, then when they come back, we have a conflicting state of the tag on different shards.

We can add revision counters to the contents of the tags stored on the shards to resolve those conflicts trivially in favour of the most recent version, but that will then lose snapshots if we have other clients snapshotting to the same replicated vault during a network partition. This is not very likely with snapshots, but very likely with archival mode.

Therefore, as the backend is ignorant of the nature of snapshot chains (as it lacks the encryption keys to even read them!), the best the backend can do when a tag becomes conflicted in this way is to return *all* available values of the tag, extending the backend API to allow for conflicted tags to be returned as a list of strings rather than a string.

The front-end then needs to have a way of handling this case when it arises.

This can be done in the procedure that reads the value of a tag, which can fix the conflicted tag up before continuing and returning the resolved tag value.

For snapshot tags, the thing to do would be to create an explicit merge object referring to the list of parents, and reset the tag to point to that. The merge object should contain an alist mapping string prefixes to parent tag values; fold-history can, when it encounters such an object in the chain, recursively fold over the parent histories, merging them together and prefixing the node names with the prefixes from the alist. The merge will preserve the temporal ordering of snapshots in the merged chain.

The prefixes would just be "a", "b", "c" etc. for automatic merges.

Such merge objects might also be created manually via the command line interface, nominating a list of tags to merge into a newly-created tag; the tag names are then usable as prefixes in that case.

A similar merge object can be used in archive-mode archive history chains; in which case, the merge only needs to worry about conflicting metadata updates to the same object on either side, and choose an order for them (based on a timestamp?). Again, the merges might be created automatically when resolving a conflicted tag, or manually to explicitly merge two archives within the same vault.

alaric added on 2013-02-08 13:32:43 UTC:
We can't write the #ugarit-vault-configuration tag to a backend without having a vault context to apply the crypto.

Instead, write a special #ugarit-not-a-vault tag that contains a string message stating what it is, and make open-vault check for #ugarit-not-a-vault in the storage it opens, and die if it's found.

That can be used as a generic mechanism to mark storages that aren't vaults so they can't be accidentally used as such.

User Comments:
alaric added on 2014-10-14 22:49:55:

It's important to preserve the database that says what blocks are where; losing it means we can recover it slowly as we look for blocks, but if we don't have a full list of blocks with their replication data, we can't ensure they're not lost when nodes are removed.

Therefore, we should probably, on every flush!, store a log of block replication changes to a special tag on every backend. That means that multiple instances of the replicated backend will also be kept more or less in sync, as they'll check for new updates on open.

As for dealing with tag splits - with [a987e28fef] it is possible to merge both snapshot and archive tags, but it has to happen at the vault level, not at the storage level. The best a backend can do is to spot that a tag has diverged, find the set of distinct values, point the tag at the most popular, and store the others as new tags named by suffixing the original name. The user can notice this from the log or by observation, and trigger a manual merge.

alaric added on 2014-10-25 18:59:53:

See this blog post for rationale on replication configuration:

alaric added on 2015-07-31 21:38:49:

I'm starting to work on this, because I need it - my home vault disk is starting to die, and I've lost some snapshots because it's not replicated!