IRON is the data model upon which ARGON is built. It's the abstract model for data sent over MERCURY, and the underlying representation for CARBON knowledge bases, and the homoiconic representation for CHROME source code.
As well as the abstract model, a textual encoding is defined for humans to read and write; and compact binary encodings for storage and communication.
Model and Textual Encoding
I have a fine blog post describing the concepts and rationale behind the data model.
Integer numbers
Integers are written as a sequence of decimal digits, or the sequence 0x followed by a sequence of hexadecimal digits, or the sequence 0b followed by a sequence of binary digites; all optional prefixed with - to indicate that it's a negative number; or, alternatively, the special sequences #-inf for negative infinity or #+inf for positive infinity.
Rational numbers
Rationals are written in one of two forms: either a finite integer (as described above) followed immediately by a / then a positive finite integer in the same base as the first integer (no further 0x or 0b prefix is necessary or allowed) to represent an arbitrary fraction, or an integer (as described above) with a . character inserted somewhere within it (except at the very end), representing a radix point. The latter form may also be suffixed with an exponent written in the form of an ^ character followed by a further finite integer, in the same base as the first numeric component (no further 0x or 0b prefix is necessary or allowed). The value of the number is, in this case, multiplied by the base to the power of the exponent.
Inexact numbers
Inexact numbers are written either as an integer or rational, prefixed with a ~ symbol, indicating that there is some arbitrary uncertainty in the accuracy of the number. A future extension may allow specifying a precision.
Characters
Any glyph that can be represented in Unicode can be an IRON character. This explicitly does not include any invisible (except for whitespace) or "control" characters, and does include any sequence of combining characters with a final non-combining character. They can be written with the sequence #' followed by the character than a closing ', or a sequence of the form #ucs( followed by zero or more hexadecimal UCS codepoints for combining characters each followed by +, then the UCS codepoint for a non-combining character, then ).
Booleans
Booleans may be written #true or #false
Symbols
IRON symbols have a rich internal structure, and can be written using various shorthands. Fundamentally, an IRON symbol is a list of strings, representing a hierarchical path through the CARBON directory, giving symbols a global identity. These strings are known as "components" and can be any non-empty strings of IRON characters.
A symbol may be written in its full form as the components of the symbol, in order, each prefixed by a / character, terminating in whitespace; any characters in the components that would create ambiguities with other syntax being prefixed with a \.
For instance, /foo/bar/baz.
A symbol may also be written in abbreviated form, with reference to a "current namespace" defined in context, as a sequence of one or more path components separated by /, with characters that would cause problems prefixed with a \ as usual. Such a symbol's value is found by concatenating the current namespace symbol with the additional components contained in this representation.
For instance, bar/baz.
Or it can be written with reference to a previously declared namespace binding. This is written as a single symbol component known as the "prefix" (with all the usual escaping rules), followed by # then one or more additional symbol components, separated by / characters. The actual value of the symbol is the symbol bound to the prefix in the current context, with the remaining components appended to it.
For instance, C#object:title:language:.
The empty list
The empty list is written as ().
Nil
Nil, a value that explicitly represents the absence of a value, is written #nil.
Lists
A basic list can be written as (, zero or more whitespace characters, the value at the head of the list, one or more whitespace characters, ., zero or more whitespace characters, the rest of the list (which may not actually be a list), zero or more whitespace characters, then ).
As a shorthand, chains of lists ending in the empty list can be written as (, zero or more whitespace characters, then one or more values separated by one or more whitespace characters, then zero or more whitespace characters, then ).
And finally, chains of lists ending in any arbitrary value can be written as (, zero or more whitespace characters, then one or more values separated by one or more whitespace characters, then one or more whitespace characters, ., one or more whitespace characters, the rest of the final list, zero or more whitespace characters, then ).
Sets
A set (unordered collection of values without repetition) is written as |(, zero or more whitespace characters, zero or more values separated by one or more whitespace characters, zero or more whitespace characters, then )|
Maps
Maps are notionally sets of pairs, with the constraint that no two pairs in the set may share the first element. They are written as { followed by zero or more whitespace characters then zero or more elements, each written as a value, one or more whitespace characters, :, one or more whitespace characters, then another value. The elements are separated by one or more whitespace characters. Finally, there are zero or more trailing whitespace characters followed by }.
Arrays
An array is an ordered matrix of values, of any dimensionality.
A single-dimensional array is written as #<, zero or more whitespace characters, zero or more values separated by one or more whitespace characters, then zero or more whitespace characters followed by >.
A multi-dimensional array written as #[, followed by N, a finite integer greater than 1 denoting the number of dimensions of the array, then ]< followed by zero or more whitespace, a sequence of zero or more array slices of dimensionality N-1 separated by zero or more whitespace, zero or more additional whitespace, then >; each array slice is either (if of dimensionality one) a sequence of zero or more values surrounded by < and > with zero or more whitespace characters on either side and one or more whitespace characters between each - or if of dimensionality N which is more than one, a sequence of zero or more slices of dimensionality N-1 surrounded by < and > with zero or more whitespace characters on either side and zero or more whitespace characters between each.
For example, #<1 2 3>, #[2]<<1 2 3> <4 5 6>>.
Strings
- A string is a shorthand for a one-dimensional array of characters. It's written as the sequence of characters, surrounded in " quotes, with any \ or " characters prefixed with a \.
Records
A record contains a symbol, known as the record's "type symbol", and one or more values known as "fields".
In the simplest case, where the final component of the type symbol contains no : characters, a record is written as [/tt> followed by the type symbol, one or more whitespace characters, the field values separated by one or more whitespace characters, then ].
For example, [+ 1 2 3] (where the type symbol references the default namespace), [</foo/bar/baz> 1 2 3] (using a full symbol), or [foo:bar/bam 1 2 3] (using a namespace prefix).
However, if the final component of the type symbol contains : characters, then it must be composed of one or more "field headers", those being sequences of characters that end in a :. The record is then written as a [, followed by the type symbol but truncated after the first field header, followed by one or more whitespace characters, followed by a field value, then (if there are any more field headers in the type symbol), each field header in turn with one or more whitespace characters on either side, followed by a corresponding value - then a final ].
For example, [if: foo then: bar else: baz], where the type symbol is the current default namespace with the component if:then:else appended; or [/argon/carbon/object: /example/foo title: "A nice name" language: "en"] for a record with a type symbol of /argon/carbon/object:title:language:.
If the final field header in a type symbol is just : on its own, then the final value must be written AFTER the closing ] - which has a : appended to it.
For example, [note: "This is bogus"]: 123 is a record with type symbol note:: and fields "This is bogus" and 123.
CAS references
A CAS reference is a value that identifies an object in some other, unspecified, storage system; as the name implies, it is a content-addressable store, and the CAS reference is therefore some kind of hash of the object.
A CAS references is written #cas(, followed by a sequence of characters other than : (indicating the hash algorithm used), followed by a :, then a hexadecimal integer, then ).
Comments
-
As a bit of syntactic sugar, a sequence of one or more instances of a newline, zero or more whitespace characters, a ;, zero or more whitespace characters, any non-newline characters (known as "content") then a newline before any value causes that value to be wrapped in a record with type symbol /argon/iron/note::. The first field value is the result of concatenating the "content" strings with a single space character between each of them, and the second field value is the value found after the comment.
If a value is followed by a sequence of one or more instances of one or more whitespace characters, a ;, zero or more whitespace characters, zero or more non-newline characters (the "content") then a newline, then it too is wrapped in a record in the same manner.
For instance, the following examples:
; This is very good, ; I like it 123
and:
123 ; This is very good, ; I like it
...both represent the value that could be written [</argon/iron/note:> "This is very good, I like it" ]: 123.
Namespace bindings
Any value can be prefixed with #!default-namespace, one or more whitespace characters, then a symbol which becomes the default namespace while the value is being parsed.
It can also be prefixed with #!namespace, one or more whitespace characters, a single symbol component called the prefix, one or more whitespace characters, then a symbol, which is then beound to the prefix in the context used to parse the value - shadowing any existing binding for that prefix in the current context.
Type constrained collections
Optionally, collections may constrain the types of their elements. The syntax for this is simply a # followed by a type definition (explained below), followed by a value of that type. However, as a shorthand, arrays can be type-constrained by writing the element type after the initial # - before the optional dimensionality declaration and the <. Note that it is illegal to give such a type prefix to a type other than list, set, map, or array.
Shared values
IRON values may contain other IRON values, but they are not constrained to a tree structure. Although cycles are forbidden, there can be multiple references to a single value, so IRON values can be an arbitrary directed acyclic graph.
Any value can be prefixed with #!def(, followed by a positive integer N, then ). This assigns the prefixed value with the identifier N from this point in the stream onwards.
Anywhere a value is expected, the syntax #!ref(, followed by a positive integer N, then ) can be used to represent a reference to the value previously assigned the identifier N.
Type definitions
Type definitions can be any of the following:
- ( type ) - for a list whose head is of the inner type, and whose tail is either the empty list or (recursively) the same type as this.
- ( type1 . type2 ) - for a list whose head is of type1, and whose tail is either of type2 or (recursively) the same type as this.
- |( type )| - for a set whose elements are all of the inner type.
- { type1 : type2 } - for a map whose keys are all of type1 and whose values are all of type2.
- < type ></77> - for a one-dimensional array whose elements are all of the inner type.
- < type [ N (positive finite integer) ]> - for an N-dimensional array whose elements are all of the inner type.
- < type [*]> - for an array of any dimension whose elements are all of the inner type.
- [ symbol type... ] - for a record of type "symbol" and whose fields are of the listed types.
- [ (field-header type)... ] - for a record whose type symbol is made by concatenating all the field-headers (symbol components ending in :) and whose fields are of the list types.
- integer - for an integer
- rational - for any rational or integer value
- number - for an inexact, rational or integer value
- (integer, rational or number)( min (number) .. max (number) ) for a number of the indicated kind, between the given numerical values of that type (inclusive).
- u8, s8, u16, s16, u32, s32, u64, s64, u128 or s128 as shorthands for integer(0..255), integer(-128,127), and so on - representing integers with the ranges of unsigned or twos-complement signed binary integers of the given bit width.
- ieee754:16, ieee754:32, ieee754:64, ieee754:128, or ieee754:256 for an inexact number with the range and precision of the IEEE 754 binary16, binary32, binary64, binary128 or binary256 types, respectively.
- character - for a character
- boolean - for a boolean
- symbol - for a symbol
- nil - for a nil value
- cas-reference - for a CAS reference
- * - for any value
- type1 | type2 - for a value of either type
ABNF
decimal-integer = *DIGIT hex-integer = "0x" *HEXDIG binary-integer = "0b" *("0" / "1") finite-integer = [ "-" ] positive-integer positive-integer = ( decimal-integer / hex-integer / binary-integer ) integer = finite-integer / "#-inf" / "#+inf" rational-fraction = finite-integer "/" 1*HEXDIG rational-radix = finite-integer "." 1*HEXDIG [ "^" [ "-" ] 1*HEXDIG ] rational = rational-fraction / rational-radix exact-number = (integer / rational) inexact-number = "~" exact-number number = exact-number / inexact-number character = "#'" GLYPH "'" / ( "#ucs(" *( 1*HEXDIG "+") 1*HEXDIG ")" ) boolean = "#true" / "#false" symbol-component = 1*( ( "\" GLYPH ) / UNAMBIGUOUS-GLYPH ) absolute-symbol = 1*( "/" symbol-component ) relative-symbol = symbol-component *( "/" symbol-component ) prefixed-symbol = symbol-component "#" symbol-component (* "/" symbol-component ) symbol = absolute-symbol / relative-symbol / prefixed-symbol empty-list = "(" optional-whitespace ")" nil = "#nil" whitespace = 1*( WHITESPACE ) optional-whitespace = *( WHITESPACE ) list-cell = "(" value whitespace "." whitespace value optional-whitespace ")" condensed-list = "(" value *( whitespace value ) [ whitespace "." whitespace value ] optional-whitespace ")" list = list-cell / condensed-list empty-set = "|(" optional-whitespace ")|" set = empty-set / ( "|(" value *( whitespace value ) optional-whitespace ")|" ) empty-map = "{" optional-whitespace "}" map-entry = value whitespace ":" whitespace value map = empty-map / ( "{" map-entry (* whitespace map-entry ) optional-whitespace "}" ) one-dimensional-array-body = optional-whitespace *( whitespace value ) optional-whitespace array-slices = optional-whitespace "<" array-body ">" optional-whitespace array-body = one-dimensional-array-body | array-slices array = "#" [ type-definition ] [ "[" positive-integer "]" ] "<" array-body ">" array /= DQUOTE *( ("\" DQUOTE ) / "\\" / OTHER-GLYPH ) DQUOTE cas-reference = "#cas(" 1*( GLYPH-EXCEPT-COLON ) ":" 1*HEXDIG ")" plain-record-symbol = symbol ; except no colons in final component field-header = *( GLYPH-EXCEPT-COLON ) ":" field-record-symbol = symbol ; except final component ends in ":" plain-record = plain-record-symbol *( whitespace value ) optional-whitespace "]" record-end = "]" / ( "]:" whitespace value ) record-field = optional-whitespace field-header whitespace value field-record = field-record-symbol *( record-field ) optional-whitespace record-end record = "[" ( plain-record / field-record ) namespace-binding = optional-whitespace type-definition /= "(" optional-whitespace type-definition optional-whitespace ")" type-definition /= "(" optional-whitespace type-definition whitespace "." whitespace type-definition optional-whitespace ")" type-definition /= "|(" optional-whitespace type-definition optional-whitespace ")|" type-definition /= "{" optional-whitespace type-definition whitespace ":" whitespace type-definition optional-whitespace "}" type-definition /= "<" optional-whitespace type-definition optional-whitespace ">" type-definition /= "<" optional-whitespace type-definition optional-whitespace "[" positive-integer "]" optional-whitespace ">" type-definition /= "[" optional-whitespace plain-record-symbol (* whitespace type-definition ) optional-whitespace "]" type-definition /= "[" optional-whitespace field-record-symbol (* whitespace field-header whitespace type-definition ) optional-whitespace "]" type-definition /= "integer" / "rational" / "number" [ optional-whitespace "(" optional-whitespace number optional-whitespace ".." optional-whitespace number optional-whitespace ")" ] type-definition /= "u8" / "s8" / "u16" / "s16" / "u32" / "s32" / "u64" / "s64" / "u128" / "s128" / "u256" / "s256" type-definition /= "ieee754:16" / "ieee754:32" / "ieee754:64" / "ieee754:128" / "ieee754:256" type-definition /= "character" / "boolean" / "symbol" / "nil" / "cas-reference" type-definition /= "*" type-definition /= type-definition 1*( "|" type-definition ) ")" actual-value = number / character / boolean / symbol / ( [ "#" type-definition ] list ) / ( [ "#" type-definition ] set ) / ( [ "#" type-definition ] map ) / ( [ "#" type-definition ] array ) / cas-reference / record namespace-binding = optional-whitespace ( default-namespace-binding / prefix-binding ) default-namespace-binding = "#!default-namespace " whitespace symbol prefix-binding = "#!namespace" whitespace symbol whitespace symbol prefix-comment = NEWLINE optional-whitespace ";" *( GLYPH-EXCEPT-NEWLINE ) NEWLINE suffix-comment = optional-whitespace ";" *( GLYPH-EXCEPT-NEWLINE ) NEWLINE value = *( namespace-binding / prefix-comment ) optional-whitespace value *( suffix-comment )
Binary Encodings
The IRON binary encodings can crop up in various different circumstances, with different requirements, so a few variants on them are available.
Unpacked Binary Encoding
The simplest is the unpacked binary form. This is a fairly traditional binary encoding, with numbers in variable-length twos-complement integer format or an IEEE floating point type; characters as sequences of Unicode code points; symbols as lists of strings; and so on. Variable-length structures are prefixed with a length integer. Type prefixes are used to guide the reader as to how to interpret the following bytes as a value, although the type prefix of the value as a whole can be omitted on output if it is statically known, and supplied to the reader upon input to tell it what to expect. References to shared sub-values are supported by special type prefixes which declare a numeric anchor or a reference to an anchored value; recursion is not permitted. CAS links are represented within the encoding as numeric indices; at the end of the encoding is a table of the actual hashes which those indices refer to.
Packed Binary Encoding
Packed binary encoding offers the option of multiple forms of compression, so a packed binary value always starts with an encoding header before the start of the value. Again, if the type is statically known to writer and reader, then after the encoding header we go straight into the value; otherwise, it is followed by a type prefix, encoded in a form that depends on the contents of the encoding header.
The encoding header starts with a type byte. Available encoding types are:
0
- Unpacked binary encoding, as per the previous section.
1
Large-symbol entropy encoded using a simple move-to-front encoding using a maximum distance of 2^N symbols and LZ77 backreferences of up to M bytes (with backreferences encoded in the 17-bit symbol stream, using a "backreference" symbol followed by variable-length integer encodings of the offset and length). N (0-16) and M (0 for no encoding or values up to 40 as per the LZMA2 header) are stored in a second and third encoding header byte. The move-to-front encoding represents symbols as a variable-length-encoded count of how many symbols ago the next symbol was last seen, or the entire symbol if it does not occur within the maximum distance.
2
As per (1), but with Huffman encoding of the symbols. This is similar to Deflate. The order of the statistical model used to drive the Huffman encoder (eg, the number of previous symbols to take into account when calculating the frequency distribution of symbols, in a hidden Markov model) is stored in the second and third header bytes along with the LZ77 maximum distance.
3
As per (2), but with arithmetic range encoding instead of the Huffman engine. This is similar to LZMA.
Large symbol entropy encoding
IRON values can be represented as a stream of codes. Some codes are literal data - parts of characters, parts of integers, atomic values like #true - and some are markers to denote the start and end of compound values.
Clearly, such codes cannot fit into bytes. With characters being sequences of Unicode codepoints, and variable-length integers, there are a dizzying array of different codes. How many bits do we need to represent them?
Well, by encoding them with an entropy encoder, the number of possible symbols does not matter - it's the probability of any given symbol (given the context of the previous N symbols) that affects the length of its encoded form in bits. I discuss this in more detail in a blog post.
However, the number of possible symbols does have some bearing on the memory requirements of the encoder and decoder as they need to store frequency information for the encountered symbols so far, or move-to-front backreference buffers, so a slightly compacted space of symbol codes is chosen.
1: Compound object terminator
One symbol value is used as a marker to terminate the current compound object.
65536 + 17: Characters
Unicode codepoints are represented with a range of 65536 symbols for actual codepoints, plus 17 symbols known as "plane selectors" that set the current top five bits to be appended to codepoints until otherwise specified. The default at the start of the IRON value encoding is all zero bits.
An IRON character is a sequence of one or more codepoints, the first of which represents the base character and any others of which are combining characters to add to it, possible interspersed with plane selectors. It is terminated by a compound object terminator. No type prefix is required, as the presence of codepoint symbols indicates a character.
256 + 256: Integers
Integers are represented in twos-compliment binary form, split into one byte per symbol. The first symbol contains the most significant byte of the integer including the sign bit, so represents a value from -128 to 127; a range of 256 symbols is used for this. Additional bytes are represented as "continuation byte symbols" (so they can be distinguished from initial byte symbols) in decreasing significance, which use a second set of 256 symbols.
1: Rationals
Rationals are represented with a type prefix symbol introducing a rational number (elided when the type is statically known), followed by three integers: the numerator and denominator, and the exponent. The denominator must be positive.
1: Inexact numbers
A type prefix symbol followed by either an integer or a rational.
2: Booleans
True and False each have a unique symbol, and as such require no type prefixes.
1: General arrays
- General arrays are represented by a type prefix symbol introducing them, elidable when statically known, followed by a positive integer for the dimensionality of the array, then the one-dimensional leaves of the array, each represented as zero or more "chunks". Each chunk is an arbitrary positive integer, followed by that many values to go into the array. The entire leaf can be a single chunk if its length is known at the start of encoding, or may be split into chunks if not. A compound object terminator comes after the last chunk to indicate termination. Each value must have a type prefix where applicable. A single-dimensional array is represented as a single leaf; an N-dimensional array is represnted as a leaf of N-minus-one-dimensional arrays.
1: Heterogenous arrays
Heterogenous arrays are represented like general arrays, but after the type prefix symbol comes the serialised type declaration. The type of the elements is then statically known, so type prefix symbols can be elided, the fixed-width numeric types can be represented using compact binary encodings, and all the exact numeric types can be stored delta-encoded, which will make them more compact for many types of data.
2: Lists
One symbol represents the empty list. Another is a type prefix for a non-empty list, elidable when statically known, followed by the two values comprising the head and tail of the list.
1: Nil
A single symbol represents the Nil value, #nil. This is a type prefix with no value symbols, so a statically-known Nil value can be represented as no symbols whatsoever.
1: Maps
A map is represented as a type prefix symbol (elidable where statically known) followed by the entries, represented as pairs of values, eventually terminated by a compound object terminator.
1: Sets
A set is represented as a type prefix symbol (elidable where statically known) followed by the elements, eventually terminated by a compound object terminator.
3: Symbols
Symbols may be represented as absolute paths, paths relative to a default namespace, or paths relative to a declared namespace prefix. They are introduced by a type prefix symbol, then an arbitrary number of strings representing the components of the path, then a compound object terminator symbol. One type prefix symbol declares an absolute path, another declares a default-relative path, and the third declares that the first string is the named prefix to use and the rest a path relative to that prefix.
1: CAS hash reference
This is followed by an integer (encoded as above) index. At the end of the compressed data stream is a table of actual CAS hashes, which these indices index into. The CAS hashes are uncompressed as they're high-entropy.
1: Records
A record is represented by a type prefix symbol introducing a record (if not statically known), followed by the type symbol of the record (without type prefix) unless it is specified by a static type, then the fields of the record (with their type prefixes as appropriate, unless the record type is statically known), then a compound object terminator.
2: Anchor and backreference (with numeric index)
These two symbols introduce an anchor point, which has no direct effect on the decoded value, or refer back to a previously declared anchor in order to encode a reference to a previously-occurring value. IRON values may not recursively contain themselves, so only complete values can be backreferenced. Both symbols are followed by an integer, which identifies the anchor point being declared or referenced.
FIXME: LZ77 backreference
How best to encode these? Probably with a special range of symbols each for small lengths and distances, plus a symbol for a long distance that is followed by an arbitrary integer. The distance comes first, so long lengths are represented as a distance (as a symbol, or a prefix symbol plus arbitrary integer) followed by an arbitrary integer rather than one of the short length symbols.
2: Namespace declarations
The default namespace may be overridden with a suitable type prefix symbol followed by the new default namespace path as a symbol object. A namespace prefix may be declared with a second type prefix symbol followed by a string naming the prefix, then a symbol object.
1: End of value
This symbol terminates the encoding of the IRON value.
1: Reset statistical model
This symbol resets the statistical model (move-to-front buffer or hidden Markov model) back to its initial state. Encoders can output this whenever they want, for instance if they expect a change in the statistical distribution of values and estimate that better compression will be had by starting afresh.
Implementation
Its in-memory representation, basic algorithms, and converters to and from the textual and binary transfer encodings need to be implemented as HYDROGEN source code using the HELIUM memory management and garbage collection interfaces, as it is depended upon by CHROME.
Status
The draft specification of the data model and the written form is available in DocBook format from the Fossil repository; it has lagged behind developments in this page, however.