IRON is the data model upon which ARGON is built. It's the abstract model for data sent over MERCURY, and the underlying representation for CARBON knowledge bases, and the homoiconic representation for CHROME source code.
As well as the abstract model, a textual encoding is defined for humans to read and write; and compact binary encodings for storage and communication.
Model and Textual Encoding
The IRON binary encodings can crop up in various different circumstances, with different requirements, so a few variants on them are available.
Unpacked Binary Encoding
The simplest is the unpacked binary form. This is a fairly traditional binary encoding, with numbers in variable-length twos-complement integer format or an IEEE floating point type; characters as sequences of Unicode code points; symbols as lists of strings; and so on. Variable-length structures are prefixed with a length integer. Type prefixes are used to guide the reader as to how to interpret the following bytes as a value, although the type prefix of the value as a whole can be omitted on output if it is statically known, and supplied to the reader upon input to tell it what to expect. References to shared sub-values are supported by special type prefixes which declare a numeric anchor or a reference to an anchored value; recursion is not permitted.
Packed Binary Encoding
Packed binary encoding offers the option of multiple forms of compression, so a packed binary value always starts with an encoding header before the start of the value. Again, if the type is statically known to writer and reader, then after the encoding header we go straight into the value; otherwise, it is followed by a type prefix, encoded in a form that depends on the contents of the encoding header.
The encoding header starts with a type byte. Available encoding types are:
- Unpacked binary encoding, as per the previous section.
Large-symbol entropy encoded using a simple move-to-front encoding using a maximum distance of 2^N symbols and LZ77 backreferences of up to M bytes (with backreferences encoded in the 17-bit symbol stream, using a "backreference" symbol followed by variable-length integer encodings of the offset and length). N (0-16) and M (0 for no encoding or values up to 40 as per the LZMA2 header) are stored in a second and third encoding header byte. The move-to-front encoding represents symbols as a variable-length-encoded count of how many symbols ago the next symbol was last seen, or the entire symbol if it does not occur within the maximum distance.
As per (1), but with Huffman encoding of the symbols. This is similar to Deflate. The order of the statistical model used to drive the Huffman encoder (eg, the number of previous symbols to take into account when calculating the frequency distribution of symbols, in a hidden Markov model) is stored in the second and third header bytes along with the LZ77 maximum distance.
As per (2), but with arithmetic range encoding instead of the Huffman engine. This is similar to LZMA.
Large symbol entropy encoding
IRON values can be represented as a stream of codes. Some codes are literal data - parts of characters, parts of integers, atomic values like #true - and some are markers to denote the start and end of compound values.
Clearly, such codes cannot fit into bytes. With characters being sequences of Unicode codepoints, and variable-length integers, there are a dizzying array of different codes. How many bits do we need to represent them?
Well, by encoding them with an entropy encoder, the number of possible symbols does not matter - it's the probability of any given symbol (given the context of the previous N symbols) that affects the length of its encoded form in bits. I discuss this in more detail in a blog post.
However, the number of possible symbols does have some bearing on the memory requirements of the encoder and decoder as they need to store frequency information for the encountered symbols so far, or move-to-front backreference buffers, so a slightly compacted space of symbol codes is chosen.
1: Compound object terminator
One symbol value is used as a marker to terminate the current compound object.
65536 + 17: Characters
Unicode codepoints are represented with a range of 65536 symbols for actual codepoints, plus 17 symbols known as "plane selectors" that set the current top five bits to be appended to codepoints until otherwise specified. The default at the start of the IRON value encoding is all zero bits.
An IRON character is a sequence of one or more codepoints, the first of which represents the base character and any others of which are combining characters to add to it, possible interspersed with plane selectors. It is terminated by a compound object terminator. No type prefix is required, as the presence of codepoint symbols indicates a character.
256 + 256: Integers
Integers are represented in twos-compliment binary form, split into one byte per symbol. The first symbol contains the most significant byte of the integer including the sign bit, so represents a value from -128 to 127; a range of 256 symbols is used for this. Additional bytes are represented as "continuation byte symbols" (so they can be distinguished from initial byte symbols) in decreasing significance, which use a second set of 256 symbols.
Rationals are represented with a type prefix symbol introducing a rational number (elided when the type is statically known), followed by two integers: the numerator and denominator. The denominator must be positive.
Floating point numbers (which follow the IEEE model in IRON, if not necessarily the encoded form...) are represented with one of a range of symbols; no type prefix is required as these symbols are unique to floating point numbers. The symbols represent positive infinity, negative infinity, not-a-number, or minus zero as a single symbol; or a symbol representing a general floating point number followed by two integers that encode the mantissa and base-2 exponent. Where type prefixes are not required due to the type being statically known, the "general floating point number" symbol can be elided in that case, but the special symbols for special cases are still required.
True and False each have a unique symbol, and as such require no type prefixes.
1: General vectors
- General vectors are represented by a type prefix symbol introducing them, elidable when statically known, followed by zero or more "chunks". Each chunk is an arbitrary positive integer, followed by that many values to go into the vector. The entire vector can be a single chunk if its length is known at the start of encoding, or may be split into chunks if not. A compound object terminator comes after the last chunk to indicate termination. Each value must have a type prefix where applicable.
11: Heterogenous vectors
Heterogenous vectors are represented like general vectors, but with one of eleven type prefix symbols that declare the type of the vector's elements: float, symbol, u8, s8, u16, s16, u32, s32, u64, s64 or char. The type of the elements is then statically known, so type prefix symbols can be elided; and the numeric types can be stored delta-encoded, which will make them more compact for many types of data.
2: Pairs and the empty list
One symbol represents the empty list. Another is a type prefix for a pair, elidable when statically known, followed by the two values comprising the pair.
A single symbol represents the Nil value, #nil. This is a type prefix with no value symbols, so a statically-known Nil value can be represented as no symbols whatsoever.
A map is represented as a type prefix symbol (elidable where statically known) followed by the entries, represented as pairs of values, eventually terminated by a compound object terminator.
3: CARBON name symbols
CARBON name symbols may be represented as absolute paths, paths relative to a default namespace, or paths relative to a declared namespace prefix. They are introduced by a type prefix symbol, then an arbitrary number of strings representing the components of the path, then a compound object terminator symbol. One type prefix symbol declares an absolute path, another declares a default-relative path, and the third declares that the first string is the named prefix to use and the rest a path relative to that prefix.
2: CARBON namespace declarations
The default namespace may be overridden with a suitable type prefix symbol followed by the new default namespace path as a symbol object. A namespace prefix may be declared with a second type prefix symbol followed by a string naming the prefix, then a symbol object.
A record is represented by a type prefix symbol introducing a record (if not statically known), followed by the CARBON name symbol of the record (without type prefix symbol), then the fields of the record (with their type prefixes as appropriate, as the types are not statically known), then a compound object terminator.
2: Anchor and backreference (with numeric index)
These two symbols introduce an anchor point, which has no direct effect on the decoded value, or refer back to a previously declared anchor in order to encode a reference to a previously-occurring value. IRON values may not recursively contain themslves, so only complete values can be backreferenced. Both symbols are followed by an integer, which identifies the anchor point being declared or referenced.
FIXME: LZ77 backreference
How best to encode these? Probably with a special range of symbols each for small lengths and distances, plus a symbol for a long distance that is followed by an arbitrary integer. The distance comes first, so long lengths are represented as a distance (as a symbol, or a prefix symbol plus arbitrary integer) followed by an arbitrary integer rather than one of the short length symbols.
1: End of value
This symbol terminates the encoding of the IRON value.
1: Reset statistical model
This symbol resets the statistical model (move-to-front buffer or hidden Markov model) back to its initial state.
Its in-memory representation, basic algorithms, and converters to and from the textual and binary transfer encodings need to be implemented as HYDROGEN source code using the HELIUM memory management and garbage collection interfaces, as it is depended upon by CHROME.
The draft specification of the data model and the written form is available in DocBook format from the Fossil repository; binary encodings are still to be formally specified, and I need to add support for shared sub-values to the model and encodings.