CHROME is the high-level programming language that the bulk of ARGON itself, and user applications written above ARGON, will be written in. HYDROGEN already provides a programming language, but it's terribly low level, and "unsafe"; HYDROGEN has a significant level of access to the low-level aspects of its own implementation. In HYDROGEN one can write code which writes random numbers to random locations in memory until the system falls over. Therefore, HYDROGEN code cannot be accepted from untrusted sources. Apart from low-level libraries, things involved in bootstrapping, and the occasional bit of hand-coded stuff that requires particularly high performance, as much HYDROGEN code as possible should be generated by higher-level language compilers - such as the CHROME compiler.
The design goals of CHROME are, in order of decreasing priority:
- Safety. It should be possible to run untrusted code, giving it access only to resources explicitly granted, with no way for the untrusted code to obtain access to other resources or interfere with the operation of other parts of the system (although issues such as the consumption of CPU time and memory are outside of the language's scope; HELIUM exists to control those).
- Ease of programming. The language should be simple to understand, and easy to reason about. There should be the minimum of surprises and gotchas. The language should be expressive enough to handle complex concepts naturally.
- Efficiency of implementation. The compiler should be able to easily convert the input source code to efficient HYDROGEN VM code with the minimum of complex time-consuming analysis. There are two efficiencies at stake here - the efficiency of the compiler, and the efficiency of the resulting code. I'm also conflating time and space efficiency, too.
As a language, it's a purely functional language with a strong LISPy flavour, representing code in the IRON data model rather than s-expressions. The IRON data model also serves as the fundamental data model of CHROME itself; CHROME programs manipulate a superset of IRON values (in particular, IRON cannot represent function objects directly, just as procedures in Scheme cannot be represented as s-expressions). Functions are first-class and bindings are lexically scoped, so functions close over their arguments.
Laziness can come with a performance cost, but the Haskell folks have done lots of good work on strictness analysis. In exchange, you can write conditional constructs without needing macros, streams, infinite data structures, some performance gains, and easier semantics for recursive expressions.
Uniqueness typing has many many advantages, so we'll use that instead of nasty monads. But we can do monads too, when they're useful.
Haskell-style typeclasses and generic functions are a nicen way to provide polymorphism, rather than class- or prototype-based object orientation.
Hygienic macros (with a module system based on the CARBON namespaces used for IRON symbols) are provided for syntactic extensibility. However, they are "second-class values"; they are bound in the lexical environment just like functions and other values, but they're not fully first-class as there is a constraint that macros in code to be compiled must be statically deducible, in that every reference to them must be through an expression whose value can be found at compile time. Attempting to apply a macro at run time is an error.
Macros are arbitrary code, given access to the input expression as a "syntax object", decorating the original source code with useful metadata such as the position in the original input source code expression (so error locations can be mapped back), and the lexical environment of the macro invocation.
But for simple cases, we'll implement syntax-rules style template rewriting macros as a library in terms of the core's low-level macros.
The language defines a range of primitives, for all the usual arithmetic and other core functions that can't practically be implemented otherwise. These primitives are what core language symbols are bound to. They are a subtype of functions, in that they can be applied directly, but unlike functions in general, they allow equality comparison. A macro can tell if an application is a primitive, and which one it is.
Bounded-space tail calls and first-class partial continuations provide for flow control, but I need to think more about how that will interact with the uniqueness types. A continuation may contain many closed-over unique values, so the continuation itself must be a unique object. That's fine for implementing threading with continuations, but what about amb, exceptions, and so on? Perhaps we need a means for capturing a non-unique continuation that only works if the dynamic environment we close over in the continuation contains no unique values (which can only be checked at runtime, as it depends on the call chain). Code that handles mutation by passing a chain of unique values through it, but which contains try/throw/catch exceptions, would need the throw to embed ALL unique values currently in scope (but not in scope at the catch, which is possible if the throw is within the lexical scope of the catch) in the exception object to avoid leaving them in the abandoned continuation; so invoking a continuation (and thus abandoning the current continuation) would be forbidden if the current continuation closed over unique values (and was therefore unique). You'll need to explicitly destroy them, or pass them to the new continuation.
There's no need for unwind-protect or dynamic-wind in a pure language, thankfully.
Dynamically-scoped parameters are a boon in Scheme, so we'll have them, too.
Static typing is important, as it can help to detect a wide range of programming errors. However, it can't become a straitjacket. It's useful in Scheme to be able to (read) arbitrary s-expressions. Haskell's typeclass generics go a long way, but it should be perfectly possible to have some bindings that we know nothing static about - which are entirely dynamically typed - alongside bindings with very narrowly defined static types, which can be implemented fully unboxed in the compiled code, with everything specialised, and type violations detected at compile time. So let's have a type hierarchy that's a pointed complete partial order, giving us an "Any" type we can use in those cases.
Dependant types are awesome, too. Let's have those while we're at it. We're now well into "Type system is Turing complete" territory, so we'd better have function totality as part of the type system to keep our type expressions halting.
Idris does a pretty epic job of making such a type system work in practice, but I do find the Haskelly syntax for types a bit obtuse at times; it's too pattern-matchy for me. The way constraints are written in "design by contract" languages (eg, Eiffel) are much more readable, to my eyes at least - and as far as I can see, identical in expressiveness. Design by Contract and dependent types are much the same idea, just expressed in different ways, so let's bring them together; we can statically check the constraints that are amenable to that, and leave anything else to run time checking. It just becomes another aspect of gradual typing.
Currying is sometimes handy, but (I find) often conflates what's going on; so let's not put currying in the core syntax. We'll still have single-argument lambdas, but we'll pass tuples to them, thereby making it practical to get at the tuple as a first-class value to implement variadic functions. The tuples will normally be unique values, created afresh, so the system will have little difficulty in keeping them off the heap.
In summary, it looks a lot like a Scheme/Clean/Idris hybrid written in IRON syntax, without mutation and with generic functions in the core; and with macros as second-class values.
Clearly, I need to write a lot more detail on this. Details matter tremendously in programming language design, but while major innovation happens at the large-scale of a language design, the details are usually a matter of "not making mistakes" rather than "being clever" - so relatively tedious!
The CHROME implementation appears (to the rest of the system) to be an interpreter, accepting CHROME expressions encoded in IRON and returning the IRON value they evaluate to. However, closures are represented as executable code, so evaluating the definition of a function results in its compilation, something like how FORTH (and, therefore, HYDROGEN) interprets source code to produce compiled code. All the ARGON system components other than HYDROGEN (written in a platform-specific language), HELIUM and IRON (written in HYDROGEN), and the CHROME boostrap are written in CHROME - even CHROME itself; the "CHROME bootstrap" is a very simple and naive CHROME interpreter written in HYDROGEN which is used to run the CHROME compiler, which then compiles itself and replaces the naive interpreter. The boot process of an ARGON node will involve loading the HYDROGEN source code for HELIUM, IRON, and the CHROME bootstrap, compiling the full CHROME compiler, then compiling the rest of the system components from CHROME source code. But the biggest user of CHROME will be LITHIUM, which reads application code from entities, compiles it with CHROME (using a persistent cache of compiled code where useful), and runs it.
As a general principle, the kernel language that macro-expansion results in is designed to "target the worst case"; everything becomes a basic lambda calculus with optional type annotations. Constructs like let are all unravelled into this form, so the final compiler has no special cases for them; by being designed to compile the minimal kernel language as efficiently as possible, it must identify situations where closure construction and application can be optimised away itself without any hand-holding, thereby reducing the pressure on macro authors to attempt to use particular constructs "because they're faster".
System components will be compiled with an initial namespace enriched beyond the standard libraries with interfaces that allow them to inject raw HYDROGEN code, as a kind of "foreign function interface" allowing them to directly access hardware interfaces. User code will not be given this option - any access to hardware facilities beyond the standard library will have to be provided indirectly through system components that expose a "safe" access-controlled interface.
A CHROME implementation is a function from IRON source code comprising a module declaration to a CARBON knowledge base. The top-level unit of source code must be a module declaration - because that is what, through importing other modules, actually defines the language in use (as most of the language is defined as macros and other values in standard modules). Modules are identified by their CARBON names, and their contents are obtained from CARBON. As well as the source code for a module, the implementation is also passed an optional set of additional CARBON data to be made available in the top-level environment, as a mechanism used to provide additional capabilities to privileged code.
The structure of the implementation is a pipeline:
Input IRON source code is converted into syntax objects, node-for-node. Every node is annotated with the path to it from the root in the original source, and is bound to the initial lexical environment, which is filled with bindings from modules the source code imports, plus any additional bindings made available to this particular piece of code.
Macro-expansion occurs, by walking the source tree from the top down, performing partial evaluation in the initial environment. Names are looked up in the environment, and the values they refer to are substituted for them, and as expansion work down the tree, the environment is passed down too, being modified as binding forms are encountered. When an application turns out to be a macro application, the macro is applied to the body syntax object and the environment in which it appears, and the result must be a fully expanded expression in the kernel language. Macros may recursively invoke the expander on their arguments, in order to manipulate the post-expansion code; they must pass in an environment when invoking the expander, so are free to expand their bodies in the environment of their choice. . Where a macro creates new syntax objects afresh, they will inherit the location path from the macro application syntax object; where it substitutes values into a template defined in the macro body, then the nodes in that template will keep their location path from the macro definition; and nodes passed through from the macro call will, of course, preserve their metadata from the input. Therefore, every node in the output has useful metadata. The expander, through its function as a partial evaluator, will expand the application of known functions, but must be careful to avoid being caught in infinite or excessive recursions, by detecting looping expansions and, if necessary, leaving work to be done at run time. Any application of a macro that cannot be performed at expansion time is an error.
Most of what users see as "the language" will just be macros provided in the standard environment; in effect, the standard environment's macros are a compiler into the kernel language, and different initial environments will provide very different languages.
The result of macro-expansion is an expression in the minimal kernel language (wrapped in syntax objects). The partial evaluation process will have simplified the code to a great extent, by performing applications of known functions to known arguments and the like, and the code is at this point an arbitrary DAG rather than a tree: shared subexpressions may exist. Note that expressions in the kernel language are a different type from input expressions - expansion is a bridge between two different languages. As well as invoking macros from the environment, the expander converts normal literal values, function applications, and references to values in the environment by symbolic name into kernel-language constructs. The lexical environments are not part of the kernel language - references to names from top-level imports are replaced by their values, and references to free variables inside function bodies are essentially Skolemised: replaced by references to the function declaration node, wrapped in a tuple reference to the desired argument by number, as the kernel language has no notion of symbolic reference. Because of its low-level nature, a core set of core language macros plus the hardcoded handling of other expression types in the expander will be responsible for generating it, and all other macros willgenerally generate expressions in the CHROME language then recursively invoke the expander to convert them to the kernel language.
This can then be type-inferred and type-checked, annotating it with types (or rejecting it as incorrect). Type checking may involve arbitrary computation, so this stage might be aborted due to not completing in reasonable time!
We can then perform strictness analysis and wrap lazy values in thunks, thereby converting the expression into a strict dialect of the minimal core language.
The end result, a type-annotate syntax-object expression in the kernel language, can then be finally interpreted to produce a value. All reduction that's possible in advance will already be done - the only reduction that will still be possible is to replace lambda expressions with procedure objects, which (at the implementation's option) may involve some form of compilation to directly executable code. I'm leaning towards the CPS transformation as an implementation model, as documented elsewhere, for this final stage, which will involve an intermediate representation of the expression in a version of the minimal core language with explicit continuations.
The HYDROGEN code generator will perform low-level peephole optimisations, but it assumes it will be being driven by a high-level optimising compiler most of the time, so HYDROGEN is under no particular pressure to attempt to identify high-level language patterns and optimise them.