WORK IN PROGRESS
While IRON is the model for individual particles of data flying about an ARGON system, CARBON is the larger-scale model of data en masse. CARBON is, at heart, defined in terms of IRON (a CARBON knowledge base can all be encoded in IRON), and CARBON is all about providing large-scale structure for bits or IRON data; but CARBON deals with issues of scale that IRON need not concern itself with.
To a programmer, the main facility provide by CARBON is the knowledge base, or KB for short. A KB is a set of tuples, each of which make a statement of fact about something. The tuples are themselves just IRON records, but with some metadata attached - such as expiry timestamps for transient data, or bookkeeping information relating to merging disparate changes.
A tuple expresses a relationship between some objects. Those objects might be IRON values, such as integers or strings of text; or they might be "distant" objects such as entities (represented by their name in the CARBON name space, which in turn is written as an IRON symbol) or "abstract" objects identified by names in the CARBON name space that don't have an entity ID associated.
Every tuple has a type, which is an IRON symbol (and, therefore, itself names a point in the CARBON name space), and is used to express the meaning of the tuple.
For instance, one might express a title of an entity like so:
!ns C </argon/carbon> [C:object: </example/foo> title: "A nice name" language: "en"]
This is a relationship (/argon/carbon/object:title:language:) which can bind a title to any object, in this case being used to bind a title to an entity.
To give an example of abstract objects, imagine building a Wikipedia-like database of useful facts about things at /wikipedia/. We might want to talk about love, which we might name /wikipedia/love. But there is no "love entity" that represents love in the world of ARGON; love isn't available as a software service (although you can rent a passable substitute, as they say...). But nonetheless, we might want to make some statements about love, such as providing a description of it:
!ns wp </wikipedia> [C:object: wp:love description: "Love is ..." language: "en"]
So what's special about /wikipedia/love that makes it abstract, and /example/foo that makes it refer to a concrete entity? Not much, really - they're both objects, it's just that /example/foo happens to have an EID associated with it so that you can poke at it with MERCURY to ask it to do things. The means by which an EID is associated with an object are explained below under "The Directory".
You might think that CARBON sounds a lot like a relational database, with a table called "descriptions" that has columns "Object", "Description" and "Language", and you wouldn't be far wrong.
However, the difference starts to become apparent when rules appear on the scene. Rules are themselves tuples, but tuples which allow tuples to be created on demand.
[C:rule: [individual: $X descends-from: $Y] if: [C:or [individual: $X child-of: $Y] [C:and [individual: $X child-of: $Z] [individual: $Z descends-from: $Y]]]]
Within a rule, symbols whose last component starts with a $ are considered variables. The above rule explains that somebody is a descendant of somebody else if they are their child, or a descendant of one of their children.
Given the above rule, if CARBON is asked if somebody descends from somebody, it will automatically follow the rule to see if it can find a chain of "child of" relationships joining them, and if so, it will report success.
You can also have rules without an if part, too; they create tuples without requiring any conditions to hold first.
[C:rule [loves alaric $X]]
Alaric loves everything!
Objects with no useful name
Sometimes you have an object that isn't a concrete value such as a number or a string, but which also doesn't have a human-assigned global name.
For instance, an e-commerce system will need a way to identify things like orders.
The thing to do in this case is to still identify the object with a symbol, as it is not a concrete value; but to make a symbol up. This can be done by making a symbol relative to a single assigned namespace chosen for the purpose (eg, /com/mycompany/orders), named with a meaningful prefix followed by a unique number (and WOLFRAM provides us with Lamport timestamps which are guaranteed unique within the cluster). So the end result might be symbols like /com/mycompany/orders/id-2827391.
Ideally, as orders are objects that one might perform operations upon (such as cancelling), the shop entity should give its orders entity IDs (as personae of itself, or by creating independent order entities) and assign the names to them.
Although anything can be described as a list of tuples making statements about it, that's not always the most efficient way of doing so.
For instance, a bitmapped image could be described pixel by pixel, like so:
[bitmap: </example/my-face> x: 0 y: 0 Y: 1.0 u: 0.3 v: 0.7] [bitmap: </example/my-face> x: 1 y: 0 Y: 1.0 u: 0.3 v: 0.71] ...
However, that would occupy very many bytes per pixel.
Alternatively, we could represent the entire image as a set of IRON homogenous vectors (they'll be compressed more efficiently if we don't interleave the planes), one for each component of the pixel, with the pixels packed by row:
[bitmap: </example/my-face> width: 640 height: 480 Y: #float<1.0 1.0 ...> u: #float<0.3 0.3 ...> v: #float<0.7 0.71 ...>
But that leads to problems with concurrent updating. With the tuple-per-pixel model, two different handlers running in the same entity can update different parts of the image at once, and both sets of changes will survive; yet if a tuple update can only update the entire image in one go, then of any two updates, only one will survive. The other will be overwritten.
So what we do is to cheat. CARBON has a number of "compound object types" built into it, which are handled specially. If it sees a tuple declaring that an object is a vector of floats:
[C:vector: </example/my-face-Y> type: </argon/iron/types/float> length: 307200]
...then it will actually allocate space for 307,200 floats in a vector.
If it then sees tuples such as:
[C:vector: </example/my-face-Y> is: 1.0 at: 0]
...it will update the vector rather than keeping the tuple.
We could thus represent our image like so:
[bitmap: </example/my-face> width: 640 height: 480 Y: </example/my-face-Y> u: </example/my-face-u> v: </example/my-face-v>] [C:vector: </example/my-face-Y> type: </argon/iron/types/float> length: 307200] [C:vector: </example/my-face-u> type: </argon/iron/types/float> length: 307200] [C:vector: </example/my-face-v> type: </argon/iron/types/float> length: 307200]
A vector might have individual elements updated one by one, but the compound object handler also responds to tuples stating that a given range of vector elements have a single value, or that a literal IRON vector object represents the values of the vector in a given range, for bulk updating.
The result is a lossless representation of the bitmap, something like a PNG file; we just store the image as vectors, and rely on IRON to know tricks for compactly representing numeric vectors. Does that mean that all images need to be stored losslessly under ARGON, though? What about the improved compression of a JPEG file? IRON will always store vectors as honestly as it can, but we can still do lossy compression. If one follows the example of JPEG and breaks each plane of an image into 8x8 cells, then performs a discrete cosine transform on each and quantises them differentially to reserve more bits for more important image components, you end up with an array of 8x8 arrays of small integers for each plane.
We can then represent each plane as a vector made by taking the (0,0) element from each sub-array in turn, then the (0,1), then the (1,0), and so on in a JPEG-esque serpentine transposition that will tend to move all the significant information to the start of the vector, and make the trailing end of the vector largely zero. IRON will then have little difficulty in making a good job of losslessly compressing the result, thanks to the lossy encoding of the image.
Compound objects allow for compact representation of things that would be messy with tuples; homogenous vectors are the best examples of those. But we also have compound objects that help with parallel updates. For instance, we might be interested in counting how many times some event has occurred. The naive solution is to have a tuple containing the count:
[event: </example/foo> count: 57]
When the event happens again, we look at the current count, add one, and tell CARBON:
[C:not [event: </example/foo> count: 57]] [event: </example/foo> count: 58]
The C:not tells CARBON to forget the old tuple, and then a new tuple is provided with the new count.
However, if we have two such updates occuring in parallel, both of them might read 57 as the current value - and both would then write back 58 as the new count. The event has happened twice, but we've only gone up by one. And we won't even have a clue, as CARBON will not store [event: </example/foo> count: 58] twice - it merges identical tuples.
However, if we declare:
Then CARBON will treat that as an event counter object, storing a numeric counter and a buffer for pending events, initialised to zero and empty.
We can then say:
[C:counter: </example/foo-counter> event: ...some unique ID...]
If the knowledge base is not replicated, or it is but all the replicas are currently accessible, it will atomically increment the counter.
Buf it we are using a replicated knowledge base with inaccessible replicas, so it cannot atomically increment the counter, it will store the unique ID in the pending event buffer. When all the replicas get together again, they can merge their event buffers (removing duplicates) and atomically increment the counter by the size of the resulting set.
When CARBON is asked to satisfy a query of the form:
[C:counter: </example/foo-counter> count: $X]
...it will add up the counter and the size of the event buffer it has to produce a result.
And the following tuple will hold if the event buffer is empty and all replicas are reachable:
Temporary knowledge bases can be created at will, and are stored purely in RAM. They are just IRON objects with an opaque internal structure.
Tuples are stored within them, and transparently mapped to compound objects where applicable.
Perhaps the most interesting thing about them is that they can "chain" onto other knowledge bases (of any type), which will be consulted to satisfy queries along with the main knowledge base. In the event of any conflict, the main knowledge base has priority, and the chained knowledge bases are listed in priority order when configured into the knowledge base.
Conflicts are detected by consulting the metadata attached to tuple type symbols, which provide rules about what other tuples conflict with tuples using that type symbol.
TODO: Re-read that book on updating logical databases and explain how to handle this!
TUNGSTEN sections: Persistent KBs
Persistent storage of entity state in TUNGSTEN is handled by CARBON, which uses the low-level B-Tree storage management of TUNGSTEN to present a number of knowledge bases, each corresponding to a "section" of the entity's TUNGSTEN storage, in close cooperation with WOLFRAM. WOLFRAM will provide CARBON with tuples specifying updates that need to be made, but they might not arrive "in order"; they will be tagged with Lamport timestamps indicating the order in which they should be applied. As such, CARBON needs to store metadata alongside the tuples in TUNGSTEN, indicating what the last update timestamp was, so it can ignore updates with an earlier timestamp.
Compound object handlers get to take over the TUNGSTEN storage of their objects, so they can use appropriately compact and updateable representations in terms of the B-Tree. They also need to store their own update timestamps for individually updateable elements of the compound object, so they can correctly maintain the current state.
Note that the entity does not know its own CARBON name, as it might not have one or might have many, so when opening a TUNGSTEN section KB, an initial value for the default namespace needs to be supplied.
Remote KBs via MERCURY
TODO: Note that as the CARBON protocol opens up TUNGSTEN knowledge base sections, it needs to know an initial default namespace - so the CARBON name being used to access the entity must be supplied in requests. If there is none (eg, we are just working from a raw EID), then that EID is mapped into a CARBON symbol of the form /argon/eid/VERSION NUMBER/CLUSTER ID IN BASE64[/CLUSTER-LOCAL PART OF ENTITY ID IN BASE64[/PERSONA CLASS/PERSONA PARAMETERS IN BASE64]]. /argon/eid offers a gateway that resolves these names back to the encoded entity IDs; if no cluster-local part is provided, then an empty string is used, producing the EID of the cluster entity corresponding to the cluster.
How rules can be written whose body is not another CARBON query, but instead a pointer to a request that is sent via MERCURY back to the origin entity, actually causing entity code to run. This is necessary for the interesting cases where we're not just publishing information via CARBON, but instead exposing an actual service that generates or otherwise obtains information on demand, ranging from computational services, access to continuously-changing information sources such as sensors, processes that require access to secret information to generat a result, access to existing information systems (such as gateways to the Internet or "legacy systems"), and so on.
Note that this is only for READING data. Anything that causes changes to the world needs to be a separate MERCURY action via an EID.
Direct publishing from TUNGSTEN
TODO: Direct access via MERCURY to the published TUNGSTEN sections, including support for ACLs and personae (use the persona class).
Sketch the framework of support for a CDN by configuring the cluster to forward the published TUNGSTEN sections of a configured list of its entities out to nominated caching servers, distributed close to spots of anticipated demand.
TODO: Talk about cache-control metadata on CARBON results in the protocol, how they're generated from metadata in TUNGSTEN or explicitly via dynamic services, and how they can be cached by the client entity (or a shared proxy?) to reduce load on the server and decrease latency/bandwidth usage for the client.
TODO: Talk about how multiple concurrent downloads of the same knowledge packet (perhaps detected because it's from the same TUNGSTEN static section, perhaps detected by some hashing scheme) can be detected and subsequent requests for it told to fetch already-fetched blocks from peers who have already downloaded it, sharing the distribution load in the manner of BitTorrent.
To support this, downloads of large CARBON responses are automatically split into blocks by byte range, and a MERCURY connection used to stream them down. However, the protocol on that connection allows for the server to suggest that a block be fetched from another MERCURY endpoint (actually, a list of them is included, to be tried in some order until success is obtained), along with the hash the block should have if it's not been tampered with. The client can still explicitly request the block from the server, though, if it can't fetch it from a peer. So the server serves as both a tracker and a seeder of last resort, in Bittorrent parlance.
This is also the mechanism by which clients can be directed to CDN servers that have been set up.
Why's it so complex?
The CARBON-over-MERCURY protocol is fairly complex, and here's my justification for why.
On one end of the spectrum, I want it to be as fast as DNS for the common case of following the series of links that let one resolve a symbol into information about it. The basic request for information about a name is a simple MERCURY protocol operation; as long as the request and response fit into an MTU, it can be handled as a single UDP packet in each direction, just like DNS. And bigger responses can be handled by performing an IRIDIUM connection handshake and then streaming the results.
However, in the common case of public published data without any ACLs, those responses can be lifted direct from disk (or in-memory disk cache!) on any node in the cluster without needing to fire up a LITHIUM handler for the entity being asked. And those responses can be cached in the client cluster, meaning that any other requests from within the same cluster can be satisfied from the cache. And for very large responses, all the clusters that need it at the same time can cooperate in a peer-to-peer broadcast network to distribute it efficiently.
And where high demand is anticipated, you can pay the expense of setting up CDN servers around the world, which the latest static data is published to, and which clients are transparently directed to using the peer-to-peer protocol; they're basically configurable extra seeders, which new versions of the data are automatically sent to.
And in the less-common case where data isn't published in advance - it can gateway back to the parent entity to compute data on the fly, transparently to the end-user.
We're trying to cover a lot of cases here under a single unified interface. So it's a bit complex, but I think that's justified in the complexity it removes from elsewhere in the system.
A single global root EID, run by a non-profit foundation with a suitable governance structure to prevent it ever being monopolised.
TODO: Explain the means by which a knowledge base can assign EIDs to objects, and how this can be used to recursively seek out published information about the object identified by an IRON symbol, by starting from the root EID and asking it for tuples involving objects identified by symbols which are prefixes of a target symbol (a special kind of query baked into the CARBON-over-MERCURY protocol), which at the higher levels of the tree will usually just be pointers to EIDs associated with parents of the target object; these can be recursively queried, gathering any information about the target symbol that comes up in the process, until we run out of parent EIDs to ask.
Therefore, any symbol maps to a chain of entities, containing at least the root directory entity, and usually containing subsequent child directory entities. The symbol may itself map to an entity, in which case that entity can be asked what information the entity has about "itself". If not, then the nearest parent entity is the authoratative source of information about that object.
Therefore, any entity can create an arbitrarily large subtree of objects within itself, using its own global name as a prefix, without needing to actually create entities; they can be purely informational objects, containing information but without any identity as an entity. Or the entity can attach EIDs to them that are actually just personae of its own EID; this is particularly useful for gateways to external systems, which can map the external information structure in a CARBON directory tree of objects, each of which appears as an entity acting as a gateway to behaviour that is mapped to the remote system. Or an entity may create actual entities as offspring of itself and then add them to a directory it exports, making them independent while still being children of itself in the CARBON tree.
Where ARGON system software is published from. This is delegated to a non-profit foundation (which may or may not be the same one as providing the root) which FIXME: Read these documents and gather together all the CARBON names described in them, and document them fully here.
Where global non-profit organisations can register their own subtrees, for a small registration fee to the foundation and a small annual fee, to cover costs and to make sure failed organisations release their names back into the pool.
Where global companies can register their own subtrees, for a less small registration fee and annual fee, but still easily affordable for startups.
Where international public bodies can register their own subtrees, for no fee but needing to prove their identity.
Where anyone who doesn't want to tie their identity to a particular country can register a name, for a small cost-price setup fee and an annual renewal that costs nothing and just involves confirming continued usage (and which might only be required if no other evidence of continued usage can be found automatically). There is no restriction on registration under this prefix - corporations are welcome to, but /com looks better. Registrations are strictly anonymous under /me, with all that implies.
/<ISO two-letter country code>/com|gov|me
As above, but deliberately choosing to associated with a given geographical jurisdiction. The markup on registration and renewal fees, where it exists, is reduced slightly.
A name that will never be bound, except perhaps to an information marker, used purely for examples without fear of it ever clashing with anything.
A name that will never be bound in the global directory, except perhaps to an informational marker, reserved as a place for a local override to be presented by user interfaces in order to display information relevant to the context the user interface is in - eg, resources inside the user agent itself, the user's own CARBON space under their user agent entity, links to the user's bookmarked CARBON names, and resources relating to the user interface device being used to interact with the user agent, such as auto-discovered resources on the local network, hardware devices attached to the user interface device, and resources administratively configured into the user interface service such as the nearest printer, information resources about the building containing the device or the organisation providing it, etc.
For convenience, we should give these top-level names.
It would be nice to also reserve a top-level prefix for some kind of distributed name system, which can even be implemented by a local override pointing to a local copy of the entity gatewaying into it so that the foundation cannot be used to control it. /argon/eid is one such, with ugly-looking names. It might be nice to have one that works like tor onion URLs, and maybe one based on something like namecoin, and so on.
Note that published CARBON data from entities comes from a section called </argon/carbon/public> and is made available to all callers; but it is possible to declare (in </argon/carbon/configuration> other sections to publish with given ACLs.
Talk about the ability to override parts of the namespace on a cluster-wide scale, configuring a list of CARBON namespace roots, each with the EID of the entity to "splice" into the tree at that point. The actual root of the global CARBON directory is configured in at this point, by having a mapping for </>.
Also talk about the subsequent ability to demand that a local copy of any given namespace subtree be kept in the cluster at all times, in effect overriding its original name but pointing to a snapshot stored within the cluster. Note that CHROME modules list their dependencies, which can be used to recursively local-copy them. This is used to ensure that critical resources are available "offline", and to configure the cluster to use a specific version of something rather than "the latest". This effectively overrides the remote cache-control headers with a local directive.
Such local copies do not change when upstream changes occur, but an administrator can view a list of newly available things and opt to upgrade, or downgrade when older versions are still available. This is like "installing software". A mechanism to keep the current version archived away somewhere for later downgrading, or even to fetch the current remote version direct into the archive to try later or offline, would be desirable. This suggests a storage model where a given CARBON prefix maps to a dictionary mapping "source CARBON name:version identifier" pairs to CARBON knowledge bundles, with an ability to select one element from the dictionary as "current", and a requirement for a "version" property that can be expressed with a suitable CARBON tuple, to know what version is in a given bundle.
The local copies are stored in one or more nominated WOLFRAM distributed caches, with an infinite replication factor and no expiry timestamp or drop priority so they are kept until otherwise specified. By default, they go to the cluster cache, so are replicated to every node.