WOLFRAM is the kernel component in charge of linking the node with other nodes in the cluster.
The cluster entity
At the core of its responsibilities is the maintenance of the cluster entity. This entity's TUNGSTEN state is replicated to every TUNGSTEN node (through the TUNGSTEN replication mechanisms discussed below).
The cluster entity is created automatically when a cluster is first bootstrapped, and contains global configuration and management information used across the cluster.
The node list
A list of all the nodes in the cluster is kept. Each node has some metadata; which volumes are mirrored there (see the section on distributed TUNGSTEN below), the ID of its NITROGEN node entity, a name for administrative convenience, arbitrary metadata such as location and serial number, a public key, its IP addresses, which (if any) of them should be published in the cluster key for other clusters to use as the targets of MERCURY requests, and a priority for the election of the node as a CAESIUM scheduler. Each node is also given a unique integer used to identify it compactly.
Some nodes might not have IP addresses listed - these are "roving nodes" (and as such can't have any IPs enabled for publishing via MERCURY). They will connect to the cluster from random places, and probably shouldn't be in any particularly special communication groups as such! When they connect, their transient IP would be broadcast through the cluster by WOLFRAM, and used temporarily.
Some facility for asking a node to relay for another on that's behind NAT will probably be a good idea at some point. A roving node behind NAT would need to find a node that it can contact that has a stable IP, and ask it to relay for it. This fact would be published across the cluster via WOLFRAM like a transient IP notification, and then nodes wishing to contact that node could then ask the relay-node to pass messages on.
Optionally, the administrator can configure "communication groups". These are sets of nodes that happen to have some particularly good network connection between them. Each node may be in zero or more groups.
Each group gets a name for administrative convenience, and a set of network parameters:
- Point-to-point bandwidth available between members of the group
- Total bandwidth available within the group
- Expected latency between members of the group
- Expected packet loss between members of the group
- A requested IRIDIUM keep-alive frequency
- A list of members, each with an optional group-specific network address and IRIDIUM listening port number.
- A list of subordinate groups, the members of which are automatically included in this group (unless already specified).
These parameters can be used to set initial estimates for IRIDIUM flow control, within the group. The optional group-specific network address is used to control routing of the underlying traffic to make sure the special group link is actually used, where necessary.
A special hardcoded "default group" automatically contains all nodes in the system and provides their default networking parameters.
The security architecture
This area can only be modified by security administrators. It contains security configuration for the cluster, such as the list of secrecy levels and the list of codewords that can be used to make up security classifications and clearances, the classification of each volume and the clearance of each node. It also contains the external trust list, which consists of a list of other clusters, each of which has a clearance (how much we trust the cluster as a whole), a default clearance (how much we trust any entity from the cluster), and a list of clearances of particular entities. It contains a list of encryption algorithms and their clearances, and assigns a clearance to each of the WOLFRAM communication groups. See the security model page for details on how these things are used.
The list of volumes
Entities are grouped into volumes for management purposes. Each volume has some metadata; a name for administrative convenience, a replication factor (how many copies of each entity to maintain within the nodes mirroring the volume, or "infinity" to specify "all of them"), a list of which nodes mirror it, and a list of the entities within it.
Every TUNGSTEN node has an implicit volume which is mirrored purely on that node, and there is a cluster volume which is mirrored on every TUNGSTEN node. Non-TUNGSTEN nodes have no mass storage of their own, so cannot mirror entities; they provide their node entity through the special-case magic of NITROGEN.
The list of entities
The cluster entity contains a full roster of the entities existing within the cluster. Each entity in the list has the ID of the volume that contains it, the list of which actual nodes carry replicas of the entity, and scheduling parameters for its handlers when they run.
The list of distributed caches
Memcache-like distributed caches may be created in the cluster, each configured by nominating a number of member nodes to carry a fraction of the cache. Each node in the cache is assigned space quotas for persistent and in-memory storage.
Configuration for other kernel components
For instance, MERCURY stores the cluster's public key, along with the cluster's master public key and the cluster ID version number.
CAESIUM will store the cluster's schedule here, too.
CARBON will store the namespace override map here, as well.
WOLFRAM is in charge of all communication within the cluster, over an iridium transport using a default WOLFRAM UDP port unless configured otherwise. It does this using the list of nodes and communication groups. It will attempt to use the "best" link between two nodes, according to the communication groups that the nodes share, but it will also be mindful of available bandwidth within those groups and will attempt to avoid over-committing.
It is responsible for encrypting and authenticating in-cluster communications, except when using sufficiently trusted links that it is deemed unnecessary. This encryption and authentication is performed using the node's public keys stored in the list of nodes; a key exchange is performed with the destination node to obtain a session key, and the two nodes authenticate to each other using their node keypairs.
The underlying IRIDIUM link is used by WOLFRAM itself, and also made available to MERCURY for in-cluster communications.
Distributed transactions are available across the cluster. WOLFRAM provides the ability to create, commit, and abort transactions. I need to re-read about Paxos to find out if we can use it, as it should scale better than two-phase commit.
Integral to the distributed transaction system is the provision of Lamport timestamps, for assigning a causal order to events and for creating cluster-global unique IDs, in both cases appending the node ID to the lamport timestamp to disambiguate simultaneous events.
WOLFRAM provides distributed access to the local TUNGSTEN storage on each node. By referring to the entity, volume, and node lists, it can obtain the list of nodes which carry replicas of any given entity, and cross-reference that with node liveness information from the in-cluster IRIDIUM links to know which nodes would be good candidates to access entity state from.
Lamport timestamps are used to track the last modification to any given replicated TUNGSTEN tuple, disambiguated by the modifying node index, in the special area for WOLFRAM metadata reserved by TUNGSTEN.
Updates are specified as tuple assertions; such representation considerations as compound objects are handled purely locally on the node.
WOLFRAM replicates updates to all the TUNGSTEN replicas of that entity upon transaction commit, but it has to skip replicas that are currently unreachable. However, it keeps a log of update transactions that have not been fully replicated, and upon seeing the missing node again, replays them.
Also, a transaction may be flagged to "commit asynchronously", in which case it is simply replicated to every node without a distributed commit, meaning that it may appear on different nodes at different points in time.
The distributed storage system cooperates closely with NITROGEN to manage the overall state of the node.
Distributed data model
The use of distributed transactions means that, in any given group of currently-interconnected nodes, the shared state of a replicated entity will appear strongly consistent (except when asynchronous commits are used). However, the presence of link failures may cause transactions to arrive asynchronously even when synchronous commits were requested, so this can never be relied upon.
CARBON updates are always either in the form of a new tuple to add. In effect, the TUNGSTEN storage in a section only ever grows, adding new tuples, with one exception: CARBON allows "negative tuples" of the form [c:not tuple], which cancel out corresponding "positive tuples". It also allows "negative rules" of the form [c:not [c:rule tuple]], too.
But how do we define "cancel out" in a distributed system with weak timing guarantees? That depends on a notion of what came before. Thankfully, we can provide that with the Lamport modification timestamps attached to tuples, which provide a global ordering on events that is consistent from all frames of reference.
As such, cancellation updates need to be stored in the local TUNGSTEN storage, so that they can correctly cancel out updates that arrive later but have an earlier timestamp; however, they can be marked as temporary records in TUNGSTEN so that they disappear after a reasonable amount of time chosen to be sufficient for all disconnected nodes to have had a chance to reconnect.
Compound objects are handled below the level of individual updates, and so need to maintain their own update timestamps for their component parts, and will generally have their own "override rules" for contradicting tuples. For instance, an array with individually updatable elements will need to store the update timestamp of every element of the array, so that only the "latest" update asserting the content of an element is kept.
Applications need to be written using suitable idioms to work well with the distributed state model. For instance, storing a counter in a tuple and emitting update transactions of the form ([c:not [counter-value 0]] [counter-value 1]) would be a bad idea, as two concurrent updates to the same counter would produce exactly the same update transaction, resulting in the counter ending up as 1 rather than 2. Instead, it would be better to log individual events as tuples of their own, and then add them up to find out how many have happened. Of course, that wouldn't scale well if done with plain tuples; but compound objects are capable of handling such an interface with an efficient implementation "under the hood".
It is also responsible for ensuring that each entity is sufficiently replicated within the volume, and that the entities are distributed evenly between the nodes mirroring a volume.
When an entity is created, it chooses sufficient nodes in the volume containing the entity to meet the replication factor requirements, prioritising nodes with less bytes of entities (from that volume) currently mirrored on them.
When an entity is deleted, if the nodes that mirrored it are now significantly underloaded compared to the others in terms of bytes of entities in that volume, then some entities may be relocated from more-loaded nodes.
When a node is added, it will shift some entity replicas onto that node to reduce the load on the other nodes.
When a node is removed, any entities that are now replicated to less nodes than the volume's replication factor will be copied to additional nodes within the volume, to restore the replication factor.
Relocating or copying an entity mirror is done by listing the entity as being mirrored on the target node, but with a marker stating that it is being transferred there, so that it should receive updates but not be queried there; this ensures that no updates are missed during the migration process. When it is complete it is upgraded to a full replica, and removed from the source node (unless this is a copy operation). If any other nodes are accessing that entity from the source node, then the node will return an error indicating that the entity is not available on that node, and the other nodes will re-locate it automatically.
Cluster reconfigurations are requested through administrative interfaces on the cluster entity itself. When a new node is added, a bootstrap configuration object is generated containing cluster keys and the node's ID number, which must be given to the new node in order to bootstrap itself into the cluster.
An entity is allowed to delete itself. This is replicated and its TUNGSTEN state freed up.
Entities are also allowed to create new entities, within the same volume. The initial TUNGSTEN state of the new entity has to be provided as a CARBON knowledge base per section. The new entity's EID is returned.
Entities are organised into volumes, and each TUNGSTEN node mirrors a specified set of volumes. There's a special node-volume for each node that is only mirrored on it, and there's a special cluster volume that is mirrored to every TUNGSTEN node (and contains the cluster entity). However, other volumes can be created manually through an interface on the cluster entity, and mapped to chosen nodes.
Each volume contains a "volume entity", which presents its administrative interface. Deleting the volume entity automatically deletes the volume. The volume entity, by being primordially created in the volume, is initially the only entity capable of creating entities in that volume, as entities may only directly create other entities in the same volume as themselves; as such, it presents the IODINE storage container interface via MERCURY, to allow its owner to create entities within the volume.
WOLFRAM provides distributed processing facilities to entity code. An entity may request that a specified LITHIUM handler and arguments be used as a distributed job generator. On a chosen set of nodes (FIXME: Should it be all nodes trusted to handle the entity's classification? Just nodes mirroring the entity? Something inbetween? As the cluster is a global resource, some form of global management may be needed. Perhaps entities should have a processor quota and priority that can be overridden by administrators. Think about how to choose), WOLFRAM will install a HELIUM job generator callback at a specified priority level. This callback will invoke the specified LITHIUM handler in the entity to obtain the ID of a second LITHIUM handler and arguments that represent the actual task to perform. This will continue until the original handler stops providing jobs; when all of the jobs have completed, the distributed job as a whole is complete. The job generator system basically implements a lazy list of jobs to do, allowing it to be computed on the fly. And for parallelisable tasks that aren't generated in bulk, it will be possible to submit a single job to be run, as a LITHIUM handler ID and arguments, which may be distributed to a lightly-loaded node in the cluster to be run if the local node is busy.
FIXME: Is it worth having a lighter-weight job generator interface where you provide a local next-job-please closure, and remote nodes call back via WOLFRAM to ask for new jobs? Actually distributing job generation (causing the job generator state to be synchronised at the TUNGSTEN level) might be too heavyweight.
WOLFRAM also provides a distributed cache mechanism.
There are several caches in the system. As well as one cache created implicitly per node, which is stored only on that node, and a global cache that covers all nodes, the administrator may create caches covering a specified set of nodes in the cluster configuration.
Cached objects have the usual TUNGSTEN temporary storage metadata - an expiry time and a drop priority for automatic removal when space is short; and in addition, if the drop priority is not specified (meaning "do not automatically delete"), the cache item can be given a replication factor, indicating how many nodes in the cache it should be replicated to for safety. Objects are allocated to persistent or transient storage depending on whether they have a drop priority, and whether storage of the ideally desired type is available.
The cached objects on each node are stored in a combination of TUNGSTEN (if the node has mass storage), using global temporary storage records, and in-memory storage; the storage tag specifies the cache identifier (to keep them disjoint) and the cache tag of the object.
Each node within a given cache will have its own cache limits for persistent (TUNGSTEN) and transient (RAM) storage; the allocation of cached objects uses these as weightings.