The Myrddin Programming Language Draft, Mon Oct 26 11:32:05 EDT 2009 Ori Bernstein Abstract: Myrddin is designed to be a simple, low level programming language. It is designed to provide the programmer with a great deal of control over the generated object code, while at the same time providing many productivity boosts from things like strong type checking, type inference, and functional concepts. Myrddin is not a language exploring the forefront of type theory or of compiler technology, but instead aims to be practical, small, and easy to use for low level kernel and embedded work. Introduction: A Myrddin program is structured around declarations of variables and constants. Every named value is either a variable or a constant, with no distinction provided between functions, tuples, and other values. Closures are supported, Comments: A comment begins with '/*' and ends with '*/'. Comments may nest. Keywords: Keywords: Namespaces: pkg - Specifies the module a source file is in. Can only be defined once per module. use - Requests a module (or symbols within a file) to be loaded. If the word following it is a raw identifier, it searches the module in the system paths, otherwise it loads the file named, relative to the current directory. Visibility Modifiers: export - Specifies that symbols between this and the next visibility modifier should be exported. share - Specifies that the symbols between this and the next visibility modifier are module-local. This is the default scope. local - Specifies that the symbols between this and the next visibility modifier should be file-local Storage Class Modifiers: extern - The following symbol is declared outside this module. A place holder, if you will. Storage Types: const - The following symbol is a compile-time constant, and should be read-only. var - The following symbol is a variable, and should be modifiable to the program. Control flow: while - A standard while loop for - A standard for loop goto - A standard jump statement if - A standard if expression else - Joined to the if switch - Switch statement case - A match case. Unlike C, this does not fall through. default - If there are no matches, do this. User types: struct - A user-defined structure. union - A user-defined union, where all members start at offset 0. enum - A user defined enumeration. May be cast into an int. Declarations: In Myrddin, a declaration is either a const value or a variable value. A const value is immutable, and is fixed at compile time, ideally being placed into read-only memory. A declaration begins with any number of storage class modifiers, as enumerated in the previous section. Repetitions are allowed, and are simply no-ops. This is then followed by one of the two storage types, 'const' or 'var', followed by the name of the variable being declared. Optionally, this is followed '::' followed by the type. This is then optionally followed by an '=' and an initializer. Examples: extern const write :: (fd::int, buf::void*, len::uint) var a const b = 42 volatile const x :: int = 123 If a type is not provided, it is inferred. Types: Built-in types: void The unit type. This type cannot be assigned or otherwise manipulated. bool A Boolean type. The value of this is either 'true' (equivalent to any non-zero) or 'false', equivalent to a zero value. The size of this type is undefined. char A value representing a single code point in the default encoding. The encoding is undefined, and the value of the character is opaque. int8 int16 int32 int64 int uint8 uint16 uint32 uint64 uint Integer types. For the above types, the number at the end represents the size of the type. The ones without a number at the end are of undefined type. These values can be assumed to be in two's complement. The semantics of overflowing are yet to be specified. float32 float64 complex Floating-point types. The representation is yet to be defined. Complex is likely going to be dropped from the language. Compound types: Functions: Function types are specified with an initial '(', which may be omitted in certain special cases such as the parameter list of a function literal. This is then followed by a list of argument names, possibly with a type specifier. This is followed by an optional return type, and terminated with a ')' Examples: (a,b,c -> int) a function with 3 arguments named a, b, and c, with types to be inferred later, returning an int (a::int) a function with a single int argument, with a return type left to be inferred. (a::int -> int) a function with a single int-valued argument named 'a', returning an int Structs and Unions: A struct must be defined to be used. It is defined as follows: struct ident member :: type ;; The name (in this case, 'ident') may then be used at any place (including before its definition) to specify a type as such: var sinstance :: ident Unions follow the same rule, although the offsets of all members are zero, meaning that a union may not hold multiple values at once. The size of a union is determined by the largest element it holds, with padding to align, and the alignment is determined by the largest alignment. Enums: An enumeration is a list of names that represent integers. Tuples: Type inference: Constants: Integer constants: 0x[0-9A-F]+ - A hexadecimal integer constant. Type: int [0-9]+ - A decimal integer constant. Type: int [0-9]*\.[0-9]+ - A floating point constant. Type: float64 String constants: "stringchar*" - A string containing all stringchars between quotes. Stringchars are either single characters, or quoted escape codes (all escape codes compatible with C) 'c' - A character literal. Compound constants: {; } - A function constant. A function constant is started by an opening '{', which is immediately followed by a function prototype, possibly with the braces omitted. The function prototype is terminated by a newline. This is then followed by a block of code, and a closing '}' [a,b,..] - A tuple literal. Started by a '[', with a ',' delimited list of values, terminated by a ']' FIXME: - An array literal. Started by a SOMETHING, ARRAY CONSTANTS containing a ',' delimited list of values, and terminated by a SOMETHING Expressions: Operators (categories in order of precedence): FIXME: Specify results and rules precisely. Namespace: : Namespace lookup Postfix: :: Cast to . Member lookup ++ Postincrement -- Postdecrement [] Array at () Function call Prefix: ++ Preincrement -- Predecrement ! Logical negation ~ Bitwise negation & Address-of * Dereference Shift: << Shift left >> Shift right Multiplicative: * Multiply / Divide % Modulo Additive: + Add - Subtract Relational: < Less than <= Less than or equal to > Greater than >= Greater than or equal to != Not equal to == Equal to Bitwise And: & Bitwise and Bitwise Or: | Bitwise or ^ Bitwise exclusive or Logical And: && Short circuiting logical conjunction Logical Or: || Short circuiting logical disjunction Assignment: = Variable assignment <<= >>= *= /= Operator-assignment. = is shorthand %= &= |= ^= for lhs = lhs rhs Return: -> Returns a value from a function Examples: A Hello World Program: /* define the 'write' syscall, provided by libc */ extern const write :: (fd::int, buf::void*, len::int -> int) /* main needs to have its symbol exported */ export: /* define main */ const main = { /* let the type be inferred to be (->void); ie, no args, and returning void */ /* actually call write */ write(1, "Hello World\n"::void*, 12) } Grammar: Intro: When defining the grammar, the following conventions are used; Nonterminals are defined in lower-case, terminal tokens are defined in CamelCase, and "quoted strings" are used for literal tokens. Optional productions are surrounded by '[' and ']', and productions repeated 0 to N times are followed by a '*' Misc Tokens: Eof = EOF Ident = [a-zA-Z_][a-zA-Z0-9]* IntLit = 0x[0-9a-fA-F]+ | 0-9+ Char = [^\] | \\ | \" | \' | \0 | \n | \r | \a | \t | \b | \[0-0A-Fa-f][0-0A-Fa-f] StrLit = "Char*" CharLit = 'Char' RealLit = [0-9]*.[0-9]+ BoolLit = "true" | "false" Top Level: myrddin ::= toplevel* Eof toplevel ::= empty-line module use visibility decl struct-def union-def enum-def module ::= "module" IDENT use ::= "use" Ident "use" StrLit visibility ::= "export:" "share:" "local:" User Type Definitions: struct-def ::= "struct" Ident Newln struct-body EndBlk union-def ::= "union" Ident Newln struct-body EndBlk enum-def ::= "enum" Ident enum-body EndBlk struct-body ::= Ident "::" type enum-body ::= Ident ["=" expr] Declarations: decl ::= modifiers decltype decl-core ["=" expr] EndStmt decl-core ::= Ident ["::" type] modifiers ::= modifier* modifier ::= "extern" decltype ::= "const" "var" label ::= ":" Ident Types: Type ::= Ident type "*" type "[" [expr] "]" functype tupletype tupletype ::= "[" tupletype-body "]" tupletype-body ::= type tupletype-body "," type functype ::= "(" functype-body ")" functype-body ::= decl-core functype-body "," decl-core functype-body "->" type Expressions: ret-expr ::= "->" expr expr expr-list ::= expr expr-list "," expr expr ::= asn-expr asn-expr ::= orl-expr asnop asn-expr orl-expr orl-expr ::= orl-expr "||" andl-expr andl-expr andl-expr ::= andl-expr "&&" orb-expr orb-expr orb-expr ::= orb-expr "|" andb-expr andb-expr andb-expr ::= andb-expr "&" cmp-expr cmp-expr cmp-expr ::= cmp-expr cmp-op add-expr add-expr add-expr ::= add-expr add-op mul-expr mul-expr mul-expr ::= mul-expr mul-op shift-expr shift-expr shift-expr ::= shift-expr shift-op prefix-expr prefix-expr prefix-expr ::= prefix-op postfix-expr postfix-expr postfix-expr ::= toplev-expr "." Ident toplev-expr "::" type toplev-expr "++" toplev-expr "--" toplev-expr "[" expr "]" toplev-expr "(" expr-list ")" arglist ::= expr arglist "," expr toplev-expr ::= qual-name "(" expr ")" if-expr while-expr for-expr literal qual-name: Ident qual-name ":" Ident if-expr ::= "if" expr EndStmt block ["else" block endblk] "if" expr EndStmt block-body "else" block while-expr ::= "while" expr EndStmt block for-expr ::= "for" expr EndStmt expr EndStmt expr EndStmt block Literal Values: literal ::= IntLit CharLit BoolLit StrLit FloatLit tuple-lit func-lit tuple-lit ::= "[" expr-list "]" func-lit ::= "{" func-lit-sig EndStmt block-body "}" func-lit-sig ::= functype functype-body Blocks: block ::= block-body EndBlk block-body ::= block-item block-body block-item block-item ::= EndStmt ret-expr decl label Type Rules: FIXME: Fill this in. Structure of the Compiler: The Myrddin compiler is split into 2 different parts: There is the parser, which is a library independent of the backend. It is packaged separately as libmyrparse. The compiler proper then takes the AST produced by libmyrparse, and compiles it into assembly. Currently, it simply writes this assembly out to a file, leaving invocation of the assembler to the programmer or makefile. The parser is a standard, hand-coded recursive decent parser, contained in m-parse.c, producing a parse tree described in m-tree.h. The types of tree are defined in m-tree-types.def and m-tree-type.h. The semantics of this tree are checked within m-semcheck.c, and type inference is also performed there. The compiler proper (myrc/src) middle-end then lowers the generated AST to a simple three address code format (something that should eventually be converted to an SSE representation), defined in myrc-il.c, ignores the chance to do optimziations (again, something that should eventually be fixed), then converts it to a selection DAG, defined in myrc-seldag.c, and hands it off to the backend to be compiled. The backend is free to do whatever it wants to do, although normally it should canonicalize and fold the tree (fixing the sizes of unfixed-size operands, inserting emulation for 64-bit ints on a 32-bit machine, and all that jazz), and then selecting the instructions. Typically, this tree matching is done with a .md source file generated by myrparse/cgg, since it is far more pleasant to write pattern matches than to manually code them up. Then, the register allocator gets run on the result, and the assembly gets written out to a file. The backend is controlled by the frontend through a callback mechanism. To teach the compiler about a new backend, a MyrcBe struct (defined somewhere in the backend code) is added as an extern symbol to src/myrc-ctx.h, and inserted into the table of backends in src/myrc-ctx.c. The fields in this struct are as follows: name - A string that would be used to select this backend. This is what the compiler uses with the '-b' argument. fold - The canonicalization/fold function mentioned above fnasm - Assembles a single function. peep - does peephole optimiaztion regalloc - allocates registers. write - dumps assembly to a file data - any extra data you want to pass to the functions The control flow graph data structures are shared throughout the Bugs: We don't know how to deal with partially visible types. Assume you have an exported datatype (let's call it foo) which contains a private data structure (let's call it bar). How should the 'bar' be treated? eg: private: struct x ... ;; public: struct y foo::x* ;; And in another file, what does 'y.foo' do?: y.foo Some options include: - Disallowing public types to refer to private types. This causes difficulty in producing data structures without making everything public. - Forcing it to be an incomplete type (and thus only a pointer to it may be taken), and poisoning the type name. (or exporting it as a void*) This doesn't fragilize ABI, but it causes efficiency and elegance problems. Plus, it's an ugly hack. - Replacing the type with a "dummy type" This makes ABI fragile, as a change to internal data structures can theoretically break external programs, but allows easy and efficient use of the data types. - Export them, but as "hidden types". This also leads to a fragile ABI, but seems like it may be less dirty than the dummy type solution. This is probably the best. - Other options? Mechanisms for exporting symbols within a module are fragile and expensive Currently, to import symbols within a module, we have the use "filename.myr" directive. This is convenient, but means that we must fully parse and typecheck filename.myr, which may be expensive, but also causes problems with mutually recursive modules. Some fixes include: - Forcing exported symbols to have fully qualified types. This violates the "Don't repeat yourself" principle, and it looks esthetically hideous. - Disallowing mutual recursion. This makes modules inflexible. - Simply throwing an error if types don't check with mutually recursive modules. This complicates the usefile loader, but seems the least disgusting. Generics: I want to implement parametric polymorphism. I just need to figure out how to do so efficiently - Emit bytecode to the usefiles, and turn that into code when I build them? - "box" internally? Type Classes: Do I want to add type classes? Libmyrparse isn't suited for all situations: - It emits it's own error messages, meaning that errors are hard to trap and deal with sanely if you were, say, writing an editor. - It has poor error messages in many cases - The location mechanism is too complicated. It was done as a case of premature optimization, and should be torn out and replaced with a slightly nicer mechanism. Ideally, character ranges would be added to it. Implementation incomplete: - Currently, we don't cover much of the IR for the code, and we have lots of silly bugs. - There is only one target supported - x86. ARM, x86_64, and others should be added. - There is no floating point support implemented in the x86 port - The register allocator blows monkey balls - CGG produces really bad error messages - perhaps #line directives should be generated? - Swich (and/or match) statements aren't implemented. Documentation incomplete: - This is about all the documentation there is. We've got bugs galore: Open your bloody eyes, and look. Ceci n'est pas un production compiler.