CHROME is the high-level programming language that the bulk of ARGON itself, and user applications written above ARGON, will be written in. HYDROGEN already provides a programming language, but it's terribly low level, and "unsafe"; HYDROGEN has a significant level of access to the low-level aspects of its own implementation. In HYDROGEN one can write code which writes random numbers to random locations in memory until the system falls over. Therefore, HYDROGEN code cannot be accepted from untrusted sources. Apart from low-level libraries, things involved in bootstrapping, and the occasional bit of hand-coded stuff that requires particularly high performance, as much HYDROGEN code as possible should be generated by higher-level language compilers - such as the CHROME compiler.
The design goals of CHROME are, in order of decreasing priority:
- Safety. It should be possible to run untrusted code, giving it access only to resources explicitly granted, with no way for the untrusted code to obtain access to other resources or interfere with the operation of other parts of the system (although issues such as the consumption of CPU time and memory are outside of the language's scope; HELIUM exists to control those).
- Ease of programming. The language should be simple to understand, and easy to reason about. There should be the minimum of surprises and gotchas. The language should be expressive enough to handle complex concepts naturally.
- Efficiency of implementation. The compiler should be able to easily convert the input source code to efficient HYDROGEN VM code with the minimum of complex time-consuming analysis. There are two efficiencies at stake here - the efficiency of the compiler, and the efficiency of the resulting code. I'm also conflating time and space efficiency, too.
As a language, it's a purely functional lazy language with a strong LISPy flavour, representing code in the IRON data model rather than s-expressions. The IRON data model also serves as the fundamental data model of CHROME itself; CHROME programs manipulate a superset of IRON values (in particular, IRON cannot represent function objects directly, just as procedures in Scheme cannot be represented as s-expressions). Functions are first-class and bindings are lexically scoped, so functions close over their arguments.
Laziness can come with a performance cost, but the Haskell folks have done lots of good work on strictness analysis. In exchange, you can write conditional constructs without needing macros, streams, infinite data structures, some performance gains, and easier semantics for recursive expressions.
Uniqueness typing has many many advantages, so we'll use that instead of nasty monads. But we can do monads too, when they're useful.
Haskell-style typeclasses and generic functions are a nicen way to provide polymorphism, rather than class- or prototype-based object orientation.
Hygienic macros (with a module system based on the CARBON namespaces used for IRON symbols) are provided for syntactic extensibility. However, they are "second-class values"; they are bound in the lexical environment just like functions and other values, but they're not fully first-class as there is a constraint that macros in code to be compiled must be statically deducible, in that every reference to them must be through an expression whose value can be found at compile time. Attempting to apply a macro at run time is an error which, in most cases, the type system will hopefully save us from at compile time!
Macros are arbitrary code, given access to the input expression as a "syntax object", decorating the original source code with useful metadata such as the position in the original input source code expression (so error locations can be mapped back), and the lexical environment of the macro invocation.
But for simple cases, we'll implement syntax-rules style template rewriting macros as a library in terms of the core's low-level macros.
A lexical environment isn't just a mapping from symbols to values - it's a CARBON knowledge base containing such mappings as well as arbitrary additional metadata annotating those bindings.
The language defines a range of primitives, for all the usual arithmetic and other core functions that can't practically be implemented otherwise. These primitives are function-like objects that core language symbols are bound to. They are a subtype of functions, in that they can be applied directly, but unlike functions in general, they allow equality comparison. A macro can tell if a value is a primitive, and which one it is.
OPEN QUESTION: Bounded-space tail calls and first-class partial continuations provide for flow control, but I need to think more about how that will interact with the uniqueness types. A continuation may contain many closed-over unique values, so the continuation itself must be a unique object. That's fine for implementing threading with continuations, but what about amb, exceptions, and so on? Perhaps we need a means for capturing a non-unique continuation that only works if the dynamic environment we close over in the continuation contains no unique values (which can only be checked at runtime, as it depends on the call chain). Code that handles mutation by passing a chain of unique values through it, but which contains try/throw/catch exceptions, would need the throw to embed ALL unique values currently in scope (but not in scope at the catch, which is possible if the throw is within the lexical scope of the catch) in the exception object to avoid leaving them in the abandoned continuation; so invoking a continuation (and thus abandoning the current continuation) would be forbidden if the current continuation closed over unique values (and was therefore unique). You'll need to explicitly destroy them, or pass them to the new continuation.
There's no need for unwind-protect or dynamic-wind in a pure language, thankfully.
Dynamically-scoped parameters are a boon in Scheme, so we'll have them, too. As they are part of the dynamic context of execution, they're logically contained inside the continuation. Since partial continuations allow slices of dynamic execution context to be shuffled around, this affects the binding of dynamic parameters too.
Static typing is important, as it can help to detect a wide range of programming errors. However, it can't become a straitjacket. It's useful in Scheme to be able to (read) arbitrary s-expressions. Haskell's typeclass generics go a long way, but it should be perfectly possible to have some bindings that we know nothing static about - which are entirely dynamically typed - alongside bindings with very narrowly defined static types, which can be implemented fully unboxed in the compiled code, with everything specialised, and type violations detected at compile time. So let's have a type hierarchy that's a pointed complete partial order, giving us an "Any" type we can use in those cases.
Dependant types are awesome, too. Let's have those while we're at it. We're now well into "Type system is Turing complete" territory, so we'd better have function totality as part of the type system to keep our type expressions halting.
Idris does a pretty epic job of making such a type system work in practice, but I do find the Haskelly syntax for types a bit obtuse at times; it's too pattern-matchy for me. The way constraints are written in "design by contract" languages (eg, Eiffel) are much more readable, to my eyes at least - and as far as I can see, identical in expressiveness. Design by Contract and dependent types are much the same idea, just expressed in different ways, so let's bring them together; we can statically check the constraints that are amenable to that, and leave anything else to run time checking. It just becomes another aspect of gradual typing.
Currying is sometimes handy, but (I find) often conflates what's going on; so let's not put currying in the core syntax. So let's have the type system contain only single-argument lambdas, but we'll pass tuples to them, thereby making it practical to get at the tuple as a first-class value to implement variadic functions. The tuples will normally be unique values, so the system will have little difficulty in keeping them off the heap.
In summary, it looks a lot like a Scheme/Clean/Idris hybrid written in IRON syntax, without mutation and with generic functions in the core; and with macros as second-class values.
Clearly, I need to write a lot more detail on this. Details matter tremendously in programming language design, but while major innovation happens at the large-scale of a language design, the details are usually a matter of "not making mistakes" rather than "being clever" - so relatively tedious!
The CHROME implementation appears (to the rest of the system) to be an interpreter, accepting CHROME expressions encoded in IRON and returning the IRON value they evaluate to. However, closures are represented as executable code, so evaluating the definition of a function results in its compilation, something like how FORTH (and, therefore, HYDROGEN) interprets source code to produce compiled code. All the ARGON system components other than HYDROGEN (written in a platform-specific language), HELIUM and IRON (written in HYDROGEN), and the CHROME boostrap are written in CHROME - even CHROME itself; the "CHROME bootstrap" is a very simple and naive CHROME interpreter written in HYDROGEN which is used to run the CHROME compiler, which then compiles itself and replaces the naive interpreter. The boot process of an ARGON node will involve loading the HYDROGEN source code for HELIUM, IRON, and the CHROME bootstrap, compiling the full CHROME compiler, then compiling the rest of the system components from CHROME source code. But the biggest user of CHROME will be LITHIUM, which reads application code as CHROME module definitions fetched from CARBON, compiles it with CHROME (using a persistent cache of compiled code where useful), and runs it.
Evaluation of CHROME code consists of two phases - macro-expansion into a primitive kernel language, then evaluation of the resulting kernel-language expression. As a lot of the time, the original expression being evaluated will be a full CHROME module definition, most of that evaluation will consist of compiling the function declarations inside the module and returning a module referring to a bunch of compiled functions.
As a general principle, the kernel language that macro-expansion results in is designed to "target the worst case"; basically a simple gradually-typed lambda calculus. Constructs like let are all unravelled into this form, so the final compiler has no special cases for them; by being designed to compile the minimal kernel language as efficiently as possible, it must identify situations where closure construction and application can be optimised away itself without any hand-holding, thereby reducing the pressure on developers or macro authors to attempt to use particular constructs "because they're faster".
System components will be compiled with an initial namespace enriched beyond the standard libraries with interfaces that allow them to inject raw HYDROGEN code, as a kind of "foreign function interface" allowing them to directly access hardware interfaces. User code will not be given this option - any access to hardware facilities beyond the standard library will have to be provided indirectly through system components that expose a "safe" access-controlled interface.
Looking in more details, the structure of the implementation is a pipeline:
Input IRON source code is converted into syntax objects, node-for-node. Every node is annotated with the path to it from the root in the original source, and is bound to the initial lexical environment, which is filled with bindings from modules the source code imports, plus any additional bindings made available to this particular piece of code.
Macro-expansion occurs, by walking the source tree from the top down, performing partial evaluation in the initial environment. Names are looked up in the environment, and the values they refer to are substituted for them, and as expansion work down the tree, the environment is passed down too, being modified as binding forms are encountered. When an application turns out to be a macro application, the macro is applied to the body syntax object and the environment in which it appears, and the result must be a fully expanded expression in the kernel language. Macros may recursively invoke the expander on their arguments, in order to convert any embedded code into kernel-language expressions; they must pass in an environment when invoking the expander, so are free to expand their bodies in the environment of their choice. They are also free to modify the kernel code returned by the expander when using it to form the result of macro expansion. Where a macro creates new kernel-language syntax objects afresh, they will inherit the location path from the macro application syntax object; where it substitutes values into a template defined in the macro body, then the nodes in that template will keep their location path from the macro definition; and kernel-language code returned by recursive calls to the expander based on expressions obtained from the inputs to the macro call will, of course, preserve their metadata from the input. Therefore, every node in the kernel-language output has useful metadata. The expander, through its function as a partial evaluator, will expand the application of known functions, but must be careful to avoid being caught in infinite or excessive recursions, by detecting looping expansions and, if necessary, leaving work to be done at run time. Any application of a macro that cannot be performed at expansion time is an error.
Most of what users see as "the language" will just be macros provided in the standard environment; in effect, the standard environment's macros are a compiler into the kernel language, and different initial environments will provide very different languages.
The kernel language does not use symbol bindings in a lexical environment any more. All symbolic variable names in the input code will have been replaced with the value bound to that symbol if it's statically known in the environment; the only case where they aren't known statically is if they are the arguments to lambda expressions, which will be referenced (hygienically) directly to the specific lambda expression node rather than given names. Lexical environments are not returned by macro expansion, as they have no meaning in the kernel language.
The result of macro-expansion is an expression in the minimal kernel language (with embedded source location metadata). The partial evaluation process will have simplified the code to a great extent, by performing applications of known functions to known arguments and the like, and the code is at this point an arbitrary DAG rather than a tree: shared subexpressions may exist. Note that expressions in the kernel language are a different type from input expressions - expansion is a bridge between two different languages.
This can then be type-inferred and type-checked, annotating it with types (or rejecting it as incorrect). Type checking may involve arbitrary computation, so this stage might be aborted due to not completing in reasonable time - although the type system will attempt to deduce a subset of functions that can be shown to halt for all inputs and give an error if type expressions can be shown to NOT halt for all inputs, rather than trying!
We can then perform strictness analysis and wrap lazy values in thunks, thereby converting the expression into a strict dialect of the kernel language.
The end result, a type-annotated expression in the kernel language, can then be finally interpreted to produce a value. All reduction that's possible in advance will already be done - the only reduction that will still be possible is to replace lambda expressions with function objects, which (at the implementation's option) may involve some form of compilation to directly executable code. I'm leaning towards the CPS transformation as an implementation model, as documented elsewhere, for this final stage, which will involve an intermediate representation of the expression in a version of the minimal core language with explicit continuations.
The HYDROGEN code generator will perform low-level peephole optimisations, but it assumes it will be being driven by a high-level optimising compiler most of the time - so HYDROGEN is under no particular pressure to attempt to identify high-level language patterns and optimise them.