ARGON
Documentation
Login

NITROGEN is a kernel component that runs on every ARGON node and implements the special "node entity" representing the node itself. The node entity has no storage in TUNGSTEN like normal entities, even though it is used to store the node's configuration, and LITHIUM is never called up on to handle MERCURY or CARBON requests for the entity, as the NITROGEN kernel component directly handles those requests.

As such, it has direct access to the hardware abstraction layer of HYDROGEN; the scheduling parameters and status reporting from HELIUM and LITHIUM; the network stacks of IRIDIUM, WOLFRAM, MERCURY and FLUORINE; and the storage management of TUNGSTEN and WOLFRAM.

It is able to present information about the running state of the node to interested parties, and to allow configuration of the kernel components that comprise the node. In particular, it drives the facility in HYDROGEN to update the bootstrap code for the node (eg, to roll out new kernels) and to update the bootstrap configuration (which is used to configure core device drivers, and by TUNGSTEN to find its persistent storage devices and get them up and running). All of the node's configuration is stored in the special bootstrap configuration area provided by HYDROGEN; if the node has storage than the corresponding node volume will exist and be used to store any other entities created in the node's volume, but if the node has no local storage any attempts to create entities in the node's volume will fail.

Other than the obvious configuration and tuning, one responsibility of note is to start up any device drivers or real-time tasks that need to be started manually on the node because they can't be auto-started through detecting the presence of the device. This might be the case because there isn't a device, as in the case of real-time tasks rather than device drivers, or for devices that just can't be automatically detected. These are blocks of CHROME code that are installed as kernel components so they have full access to the low-level infrastructure of the node.

Another responsibility of NITROGEN is to monitor the cluster configuration. Just like node entities, every cluster has a cluster entity (and a cluster volume) that automatically exists to manage it. The cluster entity is also stored in the HYDROGEN bootstrap configuration area.

Oh, by the way, any cryptographic keys in the cluster or node configuration stored by NITROGEN get put in HYDROGEN's secure key store rather than just dumped into the generic bootstrap configuration area, obviously.

OPEN QUESTION: Should NITROGEN handle MERCURY requests for the cluster entity and cluster security entity, and the special volume admin interface for volume entities? This would make it "official" that these special system entities only store state in TUNGSTEN and never trigger LITHIUM to invoke CHROME code loaded from CARBON - which would have a few advantages:

Operational State Machine

NITROGEN maintains the passage of each node through transitions of a state machine. The states are:

Non-running states

In the OFF, ADMIN and WIPING states, the system is in low-level administrative mode and is not considered to be "running". Kernel components above the level of HYDROGEN, HELIUM, IRON, CHROME and NITROGEN are not running, and HELIUM's threading is disabled. Only one CPU is active.

OFF

Either power is off, or the node is in the early stages of booting. This state can be entered manually from any other state by removing power from the node, or by requesting that the system remove power from itself where the hardware permits, or if a reboot has been requested; and when power is applied to the node, it will attempt to boot, either going to ADMIN if there's a configuration, installation, or hardware error, or a button is pressed or configuration space directive set to to request administration mode upon boot, or proceeding to ISOLATED STANDBY, ISOLATED RUNNING, RECOVERING STANDBY, or RECOVERING RUNNING based upon a setting from the configuration space. Nodes that lack the hardware to power themselves off will go to the ADMIN state if a software transition to OFF is requested.

ADMIN

A manual configuration state entered if there's a problem booting, or upon a manual request to interrupt automatic booting, or entered manually from any other system state, or automatically upon critical system failure from any other system state. Only one CPU is active, with no threading. The CPU is dedicated to providing the administrative console interface. The reason for entering the state should be available on the console, and options to repair the problem, alter configuration, and so on, and continue on to ISOLATED STANDBY, ISOLATED RUNNING, RECOVERING STANDBY, RECOVERING RUNNING or WIPING, or to switch to the OFF state, at the administrator's command.

WIPING

This state is used to decommission the node. Only one CPU is active, with no threading, and that one CPU proceeds to request a secure wipe of the key storage area from HYDROGEN, a wipe of the rest of the configuration and bootstrap space, then a fast wipe of all attached TUNGSTEN storage volumes, then a secure wipe of all attached TUNGSTEN storage volumes. If all that completes, an automatic transition to OFF occurs (or ADMIN if the hardware does not allow a software OFF). The console can be used to manually abort the wipe, either by entering OFF or dropping into ADMIN mode. A post-WIPING node will probably not be able to escape the OFF state in future due to the destruction of bootstrap code and configuration, and the lack of cryptographic data in the key storage area; it might at best make it into ADMIN, but a reinstallation will probably be required to get it booting again.

ISOLATED states

In these states, the node is booted up and connected to the cluster, but any TUNGSTEN local storage on the node is disabled. Nodes without TUNGSTEN storage only have the ISOLATED states as their "running states"; RECOVERING and SYNCHRONISED are not applicable to them. Real-time tasks and device drivers are running in these states, but may be operating in a degraded mode of some kind outside of ISOLATED RUNNING, due to LITHIUM being inactive.

ISOLATED STANDBY

The node is remaining idle until administratively told to do otherwise. From this state it can be told to switch OFF or go into ADMIN, or to go to ISOLATED RUNNING to start LITHIUM, or into RECOVERING STANDBY to start recovery or into RECOVERING RUNNING to start both, or into WIPING to erase the node.

ISOLATED RUNNING

The node is accepting requests for LITHIUM from whatever kernel components feel like generating them (MERCURY, CARBON, CAESIUM, etc). The local TUNGSTEN store (if any) is not being kept up to date by WOLFRAM, so any access to entity data has to be obtained from other nodes via WOLFRAM. From this state, it can go to OFF or ADMIN (for a hard shutdown), to WIPING (for a hard wipe), to RECOVERING RUNNING to start recovery, to RECOVERING STANDBY (starting recovery but doing a hard stop of LITHIUM), to ISOLATED STANDBY (for a hard stop) or to ISOLATED STOPPING (in which case a desired target state must be chosen).

ISOLATED STOPPING

This state is used to leave the ISOLATED RUNNING state cleanly. Unlike the direct transitions to OFF, ADMIN, ISOLATED STANDBY, RECOVERING STANDBY or WIPING, which terminate all currently running LITHIUM handlers immediately, the ISOLATED STOPPING state disables starting new LITHIUM handlers but waits until all existing ones have stopped normally, before transitioning to OFF, ADMIN, ISOLATED STANDBY, RECOVERING STANDBY or WIPING. However, the stop process can be manually cancelled by an immediate transition to one of those target states (terminating all currently running LITHIUM handlers), or to ISOLATED RUNNING or RECOVERING RUNNING to abort the shutdown and resume LITHIUM handling, optionally starting recovery at the same time.

RECOVERING states

In all of these states, WOLFRAM is attempting to bring the local TUNSGTEN storage up to date with the cluster. These states may only be entered by nodes with TUNGSTEN storage attached. Communication failures with the rest of the cluster that prohibit recovery will result in the node remaining in the same state, retrying, rather than aborting to an ISOLATED state. Succesful completion of recovery will cause an automatic transition to a corresponding SYNCHRONISED state. Real-time tasks and device drivers are running in these states, but may be operating in a degraded mode of some kind outside of RECOVERING RUNNING, due to LITHIUM being inactive.

RECOVERING STANDBY

Recovery is occuring without LITHIUM handlers being invoked. When it is up to date, an automatic transition occurs to SYNCHRONISED STANDBY. However, the recovery can be aborted by a manual transition to ISOLATED STANDBY, OFF, ADMIN or WIPING; or aborted while turning LITHIUM on with a manual transition to ISOLATED RUNNING.

RECOVERING RUNNING

Recovery is occuring while LITHIUM is enabled. As the local TUNGSTEN storage is not synchronised, access to entity state must be from other nodes via WOLFRAM, except where it can be proved that the local state required is up to date already. There can be manual transitions to RECOVERING STANDBY (to keep recovering but to do a hard stop of LITHIUM handlers), OFF or ADMIN (for a hard shutdown), WIPING (for a hard wipe), ISOLATED STANDBY (for a hard stop of LITHIUM and to abort recovery), ISOLATED RUNNING (to just abort recovery while keeping LITHIUM running), or to RECOVERING STOPPING (for a soft stop of LITHIUM, and then a transition to a chosen target state). If recovery completes, there is an automatic transition to SYNCHRONISED RUNNING.

RECOVERING STOPPING

This is used to offer an orderly stop of LITHIUM from RECOVERING RUNNING. LITHIUM does not accept new tasks, but existing handlers are allowed to complete. When they are all stopped, an automatic transition to a chosen target state is performed, or it can be performed manually to abort the clean stop (killing all pending LITHIUM handlers if the transition is not to a RUNNING state). The valid target states are OFF, ADMIN, WIPING, ISOLATED STANDBY, ISOLATED RUNNING, RECOVERING STANDBY or RECOVERING RUNNING. If recovery completes while in RECOVERING STOPPING, then an automatic transition to SYNCHRONISED STOPPING occurs.

SYNCHRONISED states

These states can only be entered if WOLFRAM is satisfied that the TUNGSTEN local storage is up to date through completing a RECOVERING state. It can only be maintained while connectivity to the cluster lets WOLFRAM be sure that the local TUNGSTEN storage is being kept synchronised; in the event of failure, an automatic transition to a corresponding RECOVERING state will occur. Real-time tasks and device drivers are running in these states, but may be operating in a degraded mode of some kind outside of SYNCHRONISED RUNNING, due to LITHIUM being inactive.

SYNCHRONISED STANDBY

LITHIUM is not configured to start handlers in this state. Manual transitions to SYNCHRONISED RUNNING, OFF, ADMIN, WIPING, ISOLATED STANDBY or ISOLATED RUNNING are available; all by SYNCHRONISED RUNNING will abandon the synchronisation, requiring recovery to get it back. A synchronisation failure will automatically transition the node to RECOVERING STANDBY.

SYNCHRONISED RUNNING

LITHIUM is configured to start handlers. Manual hard transitions to SYNCHRONISED STANDBY, OFF, ADMIN, WIPING or ISOLATED STANDBY are available, which will kill current LITHIUM handlers in progress. Synchronisation can be stopped without stopping LITHIUM by a manual transition to ISOLATED RUNNING. Soft transitions are available by going to the SYNCHRONISED STOPPING state then on to a chosen target state. A synchronisation failure will automatically transition the node to RECOVERING RUNNING.

SYNCHRONISED STOPPING

This is used for a soft stop from SYNCHRONISED RUNNING. As usual, new LITHIUM handlers are not started, but existing ones allowed to run to completion, then an automatic transition to SYNCHRONISED STANDBY, OFF, WIPING or ISOLATED STANDBY occurs. Or a manual transition to any of those states or back to SYNCHRONISED RUNNING or ISOLATED RUNNING may be triggered to cancel the clean shutdown. A synchronisation failure will cause an automatic transition to RECOVERING STOPPING.

A table of valid state transitions

Key: - = no transition (we're already in that state), A = automatic transition will occur when required, M = manual transition is available, M* = manual transition is available but will terminate any currently running LITHIUM handlers, AM = automatic transition will occur when required, or can be manually triggered, AM* = an automatic transition will occur when all currently running LITHIUM handlers have terminated, or a manual transition is available but will terminate any currently running LITHIUM handlers.

This table does not show the fact that the system may automatically go into the ADMIN state from any other state in the event of a system failure, as it's implicit and just made the table look a bit messier. I left it out as it's an exceptional case.

From

to O to A to IS to IR to IX to RS to RR to RX to SS to SR to SX to W Notes

O

-MAAAA
A M-MMMMM Administrative console is open.
IS MM-MMMM
IR M*M*M*-MM*MM* LITHIUM is up
IX AM*AM*AM*M-AM*MAM* LITHIUM is cleanly stopping.
RS MMMM-MAM Recovery is in progress.
RR M*M*M*MM*-MAM* Recovery is in progress, LITHIUM is running.
RX AM*AM*AM*MAM*M-AAM* Recovery is in progress, LITHIUM is cleanly stopping.
SS MMMMA-MM Synchronized.
SR M*M*M*MAM*-MM* Synchronzed, LITHIUM is running.
SX AM*AM*AM*MAAM*M-AM* Synchronized, LITHIUM is cleanly stopping.
W AMAM- Secure erasure is in progress.

State transitions and WOLFRAM

Note that there are no state transitions triggered by "gaining or losing a connection to the WOLFRAM cluster".

The closest are the automatic transitions between corresponding SYNCHRONISED and RECOVERING states caused by the local TUNGSTEN store gaining or losing synchronisation with the cluster - and loss of synchronisation will usually be caused by failure to be able to communicate with the cluster (the other option being failure of TUNGSTEN storage so we can't write to it). And we will never be able to transition from RECOVERING to SYNCHRONISED if we can't connect to the cluster to gain synchronisation - with the trivial exception being where we are the only member of the cluster, in which case recovery completes immediately.

However, we can never really say for sure "if the cluster is reachable". All we can say is that a particular attempt to communicate with it has failed or not. Failure in synchronisation causes a transition to a RECOVERING state, but that's the only network failure transition in our state diagram.

Notably, failure to contact other nodes to obtain access to entity state in the RUNNING states will simply cause that LITHIUM handler to fail with an exception, aborting it, and will not trigger any node state transition.