ARGON
Documentation
Login

TUNGSTEN is the ARGON kernel module responsible for managing persistent storage on each node. It is written in CHROME.

Underlying storage

TUNGSTEN is given access to a number of mass storage devices on each node, as listed in the node boot configuration area managed by HYDROGEN and NITROGEN.

It uses all of these underlying storage devices to provide a single logical storage area - there is no need for administrative decisions about "what to put in which filesystem". However, as documented in the security model, classification and trust considerations can affect which entities are stored on which mass storage devices.

TUNGSTEN may be configured to encrypt data on mass storage devices. It generates a dedicated encryption key for each different classification level of entity it is given on each storage device, and stores them in the key storage area provided by HYDROGEN. If WOLFRAM notifies TUNGSTEN that the node's clearance has been reduced in any way, TUNGSTEN immediately erases the keys to clearances it no longer has, and flushes its caches of decrypted data.

Rather than bothering with inefficient and unreliable technologies such as RAID-5 (with its infamous write hole), TUNGSTEN can be configured to make sure it stores copies of the same data on different storage devices in its pool, to protect against device failures. As TUNGSTEN is aware of the replication and in direct control of it, this results in more efficient and reliable operation.

The amalgamated storage available on a node is used to store a single copy-on-write B-tree (which allows duplicate keys). This gives us cheap atomic transactions, good write throughput as long as the transactions are relatively large, snapshot isolation for reads in transactions, and the ability to continue reading data while that data is being updated without blocking. The B-tree maps arbitrary byte strings (the keys) to other arbitrary byte strings (the values); both could be arbitrarily large, so the on-disk representation of a byte string consists of up to N bytes of the string followed by, if it is required, pointers to the rest of the string stored outside of the B-tree in its own disk blocks. This representation efficiently supports both large and small keys and values.

Disk block pointers are stored as a storage device ID and a variable-length integer referencing the block on that device, to allow efficient storage on small devices while being able to scale to arbitrarily large ones.

New storage devices can be added to the TUNGSTEN pool at any time, and will become available for new allocation. Existing storage devices can be removed, too; TUNGSTEN must be informed so that it can relocate needed data from the device, after which the device can be removed if the relocation was successful.

On boot, TUNGSTEN will identify all attached storage devices that are marked as having TUNGSTEN data in and merge them into the pool. However, TUGNSTEN will store a copy of the node's globally unique ID in the "root block" of a storage device, so that when a storage device that was used by a different node in the past is plugged into the node, it can be identified as "alien" and not automatically merged into the pool (possibly causing disaster).

The contents of the B-tree

Entity state

Most of the records store entity data. For these records, the byte strings used as keys all start with a type byte identifying them as entity data, the volume ID, and the volume-local part of an entity ID (which is a fixed-length unique ID), dividing the key space into a region for each entity. Not all entities in the volume will necessarily be present on all the nodes, however - WOLFRAM distributes entities across the cluster, replicating each entity to a number of nodes. And in future WOLFRAM might distribute the state of an entity across multiple nodes rather than fully replicating the entity onto each node, but that is something I will address only if a need develops for very large entities.

But either way, each entity's state is stored as a number of key-value pairs in the B-tree, kept isolated from every other entity's state. The next part of the keys after the cluster-local part of the entity ID is a variable-length string called the slice ID, used to compartmentalise storage within the entity.

Slices fall into two kinds: mutable and immutable.

Within each mutable slice are stored, logically, a set of tuples, as per the CARBON data model. Each tuple is an IRON record, serialised to a byte string and stored in TUNGSTEN along with some metadata used by WOLFRAM to manage replication, and optionally metadata marking it as a temporary tuple that should be discarded after a specified point in time or can be discarded when mass storage space is short; the key, after the slice ID, is a type byte identifying it as a tuple, a numerical ID referencing the IRON type symbol of the tuple record, and a numerical ID that identifies the tuple.

The presence of the record type byte after the slice ID indicates that other types of record exist within a slice, however. The representation of tuples makes it efficient to find all tuples of a particular type within a slice, as the type symbol's ID is part of the key, but we will often want to find a particular tuple based on the values of its fields, so it is possible to create indices on particular fields. These are represented by records within the slice having a type byte marking them as index records, which map a tuple type symbol ID, a field number within that tuple, and a value to the unique ID of a record with that value in that field; the fact that the B-Tree allows duplicates is used here, as many tuples may have the same value in that field.

Although all of the data stored by an entity is logically represented as a set of tuples, it may not always be physically represented so. For instance, an array of floating point numbers might logically be represented as a set of tuples relating the identity of the array, an index within it, and a floating point value. However, this type of object (among many others) is special-cased within CARBON and a single B-tree record is stored containing the entire array in a packed format. Attempting to write tuples stating the value of array elements causes CARBON to just update the packed array. Many other type bytes are used to store special-cased objects.

Immutable slices are hash references to CAS blocks (discussed below) containing compactly serialised representations of IRON tuples - more on that below. Immutable slices can be created by directly supplying the data to go inside them in ready-serialised form (possibly with cryptographic signatures and other metadata attached), or created by "snapshotting" a mutable slice (optionally applying signatures as we do so).

ID pools

For reasons of compactness, as mentioned briefly above, IRON symbols are not represented directly in their serialised representations within TUNGSTEN, but as numeric IDs. The mapping between symbol IDs and the full text of the symbols, which are really paths in the CARBON global namespace, is represented as special records in the B-Tree outside of any entity (and so having an initial type byte identifying them as not being entity data); each symbol has a record mapping the full serialised form of the sybmol to its assigned numeric ID, and another mapping the numeric ID back to the symbol.

Similarly, entity IDs have a part called the cluster ID that represents the originating entity. The IDs of entities from the same cluster will share the same cluster ID part. As the cluster ID might change, rather than store all those copies of the cluster ID separately, we also store entity IDs with the cluster ID replaced with just the hash of the cluster's master key, and maintain a global per-node mapping from those to the full cluster IDs. See MERCURY for more details.

There are also records outside of entity space used to manage free space on the storage devices, write-ahead logs for core metadata updates, intermediate results of large computations, and things like caching compiled code. Some records are used to maintain an index across all temporary records in the entire system, to enable efficient expiry of outdated records and the rapid location of records that can be dropped to free mass storage space.

CAS objects

TUNGSTEN also manages a content-addressible store, indexed on the hash of the stored content. The records are serialised IRON data, possibly containing embedded CAS link hashes. CAS blocks are stored as a set of B-tree records keyed on the hash and a suffix byte indicating whether this record is the actual data or some metadata - CAS objects may be tagged with a cache expiry time (meaning they can be deleted once the time has passed) and/or a drop cost (meaning, if physical storage is short, then the CAS objects with the lowest drop cost can be deleted). They may also carry a "desired minimum replication" count; WOLFRAM will attempt to ensure that at least that many copies exist in the cluster through active replication. If a node has to request a CAS object from TUNGSTEN storage on another node then, depending on how long it took to arrive and how much local storage is available, it may also replicate it locally as a form of caching, so CAS objects will often be replicated around the cluster anyway if they're popular.

CAS objects may be referenced from within entity storage - immutable entity slices are CAS hash links to serialised state. CAS objects might also be referenced from other CAS objects, and CAS links are probably going to be the contents of all the temporary storage record types listed below. Either way, all CAS objects in the cluster must be referenced from some other TUNGSTEN object in the cluster, and WOLFRAM will probably store back-reference information for CAS objects in TUNGSTEN to provide for efficient garbage collection.

Summary

So, to conclude, the different types of keys found in a TUNGSTEN store include:

IRON symbol to numeric ID

0x01 (IRON symbol :: short symbol) -> (numeric ID :: short integer)

Numeric ID to IRON symbol

0x02 (numeric ID :: short integer) -> (IRON symbol :: short symbol)

Free space

0x03 (device ID :: short integer) (first free block :: short integer) -> (contiguous region of free space on the device)

Write-ahead log

0x04 (device ID :: short integer) -> (contiguous write ahead log on the device)

Global temporary storage

0x05 (storage tag :: short any) 0x00 -> (temporary data :: long any)

Global temporary storage metadata

0x05 (storage tag :: short any) 0x01 -> (record containing expiry time, drop cost, etc for associated temporary storage record :: short record)

Global temporary storage expiry time index entry

0x06 (expiry time :: short integer) -> 0x00 (storage tag :: short any)

Entity temporary tuple expiry time index entry

0x06 (expiry time :: short integer) -> 0x01 (cluster-local entity ID :: short integer) (slice ID :: short any) (type symbol ID :: short integer) (tuple ID :: short integer)

Entity temporary compound object of type 0xNN expiry time index entry

0x06 (expiry time :: short integer) -> 0x02 (cluster-local entity ID :: short integer) (slice ID :: short any) 0xNN (object ID :: short any)

Global temporary storage drop cost index entry

0x07 (drop cost :: short integer) -> 0x00 (storage tag :: short any)

Entity temporary tuple drop cost index entry

0x07 (drop cost :: short integer) -> 0x01 (cluster-local entity ID :: short integer) (slice ID :: short any) (type symbol ID :: short integer) (tuple ID :: short integer)

Entity temporary compound object of type 0xNN drop cost index entry

0x07 (drop cost :: short integer) -> 0x02 (cluster-local entity ID :: short integer) (slice ID :: short any) 0xNN (object ID :: short any)

Cluster master public key hash to full cluster ID

0x08 (hash :: short integer) -> (cluster ID :: short record)

CAESIUM global index

0x08 (volume ID :: short integer) (cluster-local entity ID :: short integer) -> (tuple ID :: short integer)

Entity data slice header

0x80 (cluster-local entity ID :: short integer) (slice ID :: short any) 0x00 -> (slice header record :: long record)
Records whether the slice is mutable, or just a CAS link to an immutable serialised slice.

Entity data tuple

0x80 (cluster-local entity ID :: short integer) (slice ID :: short any) 0x01 (type symbol ID :: short integer) (tuple ID :: short integer) 0x00 -> (tuple as a record :: long record)

Entity data tuple WOLFRAM replication metadata

0x80 (cluster-local entity ID :: short integer) (slice ID :: short any) 0x01 (type symbol ID :: short integer) (tuple ID :: short integer) 0x01 -> (record of WOLFRAM replication metadata :: short record)

Entity data temporary tuple expiry metadata

0x80 (cluster-local entity ID :: short integer) (slice ID :: short any) 0x01 (type symbol ID :: short integer) (tuple ID :: short integer) 0x02 -> (record containing expiry time, drop cost, etc for associated temporary tuple :: short record)

Entity data compound object of type 0xNN

0x80 (cluster-local entity ID :: short integer) (slice ID :: short any) 0xNN (object ID :: short any) 0x00 (optional further key components...) -> (object :: long any)

Entity data compound object of type 0xNN WOLFRAM replication metadata

0x80 (cluster-local entity ID :: short integer) (slice ID :: short any) 0xNN (object ID :: short any) 0x01 -> (record of WOLFRAM replication metadata :: short record)

Entity data temporary compound object of type 0xNN expiry metadata

0x80 (cluster-local entity ID :: short integer) (slice ID :: short any) 0xNN (object ID :: short any) 0x02 -> (record containing expiry time, drop cost, etc for associated temporary compound object :: short record)

Entity data tuple index entry

0x80 (cluster-local entity ID :: short integer) (slice ID :: short any) 0x02 (type symbol :: short integer) (field number :: short integer) (value :: short any) -> (tuple ID :: short integer)

CAS object data

0xA0 (hash algorithm ID :: short integer) (hash :: short any) 0x00 -> (record :: long any)

CAS object expiry data

0xA0 (hash algorithm ID :: short integer) (hash :: short any) 0x01 -> (record containing expiry time, drop cost, etc for associated temporary tuple :: short record)

CAS object WOLFRAM replication metadata

0xA0 (hash algorithm ID :: short integer) (hash :: short any) 0x02 -> (record of WOLFRAM replication metadata :: short record)

The structure of the keys is chosen to ensure that related data are adjacent in the B-Tree where practical, to increase locality of reference. Otherwise, there would be no need to have several type bytes in the same key as I suggest.

Data representation

The key and data structures documented above are a mixture of literal bytes used as type markers and logical data items listed in parentheses. Values in parentheses are either unstructured data, IRON values in unpacked binary format but with symbols encoded as numeric IDs rather than the full symbol format, or IRON values in packed binary format (again, with symbols encoded as numeric IDs).

The meanings of the type suffixes written in the logical data items are:

None specified

This is used purely for values that are used to denote raw regions of the mass storage device, in order to mark them as free or to reserve them for special purposes. The contents are either arbitrary and never examined, such as free space, or written in specialised binary formats, such as write-ahead logs.

short integer

An integer, written in IRON unpacked binary form without any type prefix (as the type is statically known).

short symbol

A symbol (a full CARBON directory path), written in IRON unpacked binary form without any type prefix (as the type is statically known).

short record

An IRON record with symbol tag, written in IRON unpacked binary form without any type prefix (as the type is statically known).

short any

An arbitrary IRON value, written in IRON unpacked binary form with a type prefix (as the type is not statically known).

long record

An IRON record with symbol tag, written in any IRON binary form, with an encoding prefix, but without any type prefix (as the type is statically known).

long any

An arbitrary IRON value, written in any IRON binary form, with an encoding prefix, and with a type prefix (as the type is not statically known).

Removable devices

Removable mass storage devices plugged into an ARGON node can be handled in one of a few ways, depending on the file system type.

Ones with standard file systems such as FAT32 will be handled by FLUORINE, presenting a foreign filesystem interface via a virtual entity under the NITROGEN node entity.

Ones found to contain a TUNGSTEN B-tree are not automatically merged into the storage pool, however; only devices listed in the node's configuration are merged in, as otherwise somebody who inserts a removable device could override arbitrary entity data within the system. The TUNGSTEN implementation could be able to support providing separate copy-on-write B-Trees on other mass storage devices, without impinging on the operation of the pooled one used to implement the node's entity storage, in order to support file transfer between nodes and backups; I have not thought sufficiently about the concepts to write about that yet, however.

Replication

TUNGSTEN data storage is mainly accessed via WOLFRAM, which uses TUNGSTEN storage on each node of the cluster to provide distributed, replicated, persistent storage across the cluster. However, NITROGEN reads data directly from the local TUNGSTEN copy of the node entity in order to read configuration information used to bootstrap the initial state required for WOLFRAM to run.