Automatic C Binding Generation for Myrddin
The Myrddin C binding generator, mcbind, now exists.
Until recently, creating Myrddin bindings for C libraries was possible, but painful. The functions had to be wrapped up by hand, one by one, making sure that the extern declarations in Myr matched the ones in C. If there was a mismatch, you'd still get a silent failure.
This approach is still possible, but I've been working on a way to generate this kind of code automatically.
Building on qc, the C compiler written in Myrddin by Andrew Chambers, I've put together code that parses C headers, and spits out the glue code.
It's very rough around the edges -- just barely past demoware -- but I can compile and link to significant modules like libsqlite3, as well as the basics like libc.
What Exists
The implementation lives here:
https://git.eigenstate.org/ori/mcbind.git/tree/
It installs a program named mcbind
, which works like so:
mcbind [-h?] [-b pkg] [-I inc] [-D def] [-l lib] file.h...
-h print this help message
-? print this help message
-b pkg generate bindings
-I inc add 'inc' to your include path
-D def define a macro in the preprocessor e.g. -Dfoo=bar
-l lib link against library 'lib'
You can use it like this:
mcbind -lc -lsqlite3 \
-I /usr/include -I/usr/local/include \
-D__GNUCLIKE_BUILTIN_STDARG \
-b sqlite3 \
sqlite3.h
Or you can use it from a bld.proj file:
bin foo =
# The code that uses the bindings.
# This is what you write.
main.myr
# The generated source code, from
# the gen rule before.
stdio.myr
stdio.glue.c
;;
# Generate the source files.
gen stdio.myr stdio.glue.c {dep=stdio-incs.h} =
mcbind -lc -I /usr/include -b stdio stdio-incs.h
;;
That command generates a giant module with all of the sqlite3 types (and everything included by sqlite3.h). You can call it like this:
use "sqlite3"
const main = {
rc = sqlite3.sqlite3_open(("db.sqlite\0" : byte#), &db)
if rc != 0
std.fatal("error: {}\n", rc)
;;
rc = sqlite3.sqlite3_exec(db,
("SELECT * FROM table;\0" : byte#),
(callback : sqlite3.cfunc#),
(0 : void#), &errmsg)
if rc != 0
std.fatal("error: {} ({})\n",
rc, std.cstrconvp(errmsg))
sqlite3.sqlite3_free((errmsg : void#))
;;
sqlite3.sqlite3_close(db)
}
Unfortunately, the chosen example, Sqlite3, needs a small amount of tweaking: The header exports some functions that are not in the binary, which causes missing symbols -- these need to be removed by hand right now.
I've updated the mcbind example code to reflect the current state of the art, here: https://git.eigenstate.org/ori/cbind-example.git
It's been tested on one platform with a small handful of test cases, so expect bugs. However, barring bugs and warts, it's intended to work on all platforms except Plan 9, where the Myrddin ABI does not match up in any usable way with the C ABI.
Next Steps
There are a number of known rough edges. Strings need to
explicitly be null terminated, since Myrddin allows slices
into strings. They need to be cast to a C string explicitly.
And we don't currently strip C namespace prefixes from
functions, so you get sqlite3.sqlite3_foo
, instead of
the preferred sqlite3.foo
. And macros are currently
ignored entirely, which means defined constants are often
missing. The generated C code contains far too many
warnings, some of which would be trivial to fix, like
paying attention to 'const' on parameters.
Some of these, such as getting basic defined constants, are shallow problems that only require a bit of typing to solve.
In addition, there are a number of defines that trigger other defines, which are needed in order for some code to compile correctly. Over time, we'll accumulate these.
The harder part of the problem comes down to namespacing. I'm thinking of writing something inspired by the C++ module proposal, with changes to deal with the fact that Myrddin namespaces are kind of coupled to the symbol name.
This means that to create a binding, you'd write something along the lines of:
bind zlib_c {
/*
we don't want zip.int32_t, we want cstd.int32_t
to be shared with all C binding code
*/
depend "cstd"
include "zlib.h"
}
A more complicated example might go along these lines:
bind gtk {
depend "cstd"
/* use pkg-config to get the include path and headers */
pkg "gtk+-3.0"
header "gtk/gtk.h"
/*
* GTK includes headers for a lot of deps,
* we should only export the ones we care
* about.
*/
export "@/gtk/*"
/*
* we already have gtk in the namespace, strip off
* this prefix so we can do `gtk.foo` instead of
* `gtk.gtk_foo
*/
prefix = gtk_
}
And possibly, we might want a way of mapping improved names and inserting automatic transformers between the types, so that we can automatically convert char* and Myrddin strings around call sites.
This may take the approach of pointing the code generator at Myrddin source that implements the more specific functions:
const strstr = {str : byte[:]
_strstr(cbindutil.cstr(str))
}
This leaves C strings, which I think are simply going to continue to be painful -- there's no good, general solution, so wrapping with functions is likely to be the best choice.
How It Works
The binding generator is implemented in pure Myrddin code. It's conceptually composed of three parts:
- A C parser, which invokes some callbacks at significant places, such as a struct definition showing up. The definitions collected by these callbacks are pulled together and handed off to the glue generators.
- A C glue generator, which produces trivial mangled wrapper functions.
- A Myr glue generator, which produces less trivial wrapper functions, prototypes, namespacing, and papers over the differences between C and Myrddin.
The C generator outputs a file named pkg.glue.c
. The
Myrddin build system already knew about .glue.c files,
which are C source, compiled with a C compiler, but which
follow some conventions to simplify the compilation and
linking process.
The first of these conventions is that these file are standalone. They don't depend on any other locally included headers, which means that there is no need for dependency tracking -- each file is standalone.
The second of these conventions is that they contain comments with the cflags and libraries that the C code uses:
/* CFLAGS: -I/usr/local/include -DFOO -DBAR */
/* LIBS: c x11 crypt */
The cflags are passed to the C compiler when compiling the .glue.c file to a .glue.o file. The libs are passed along to the final link, where they are added as dynamic dependencies.
Functions are wrapped up with name-mangled wrappers. The Myrddin name mangling scheme for functions is simply
namespace$function
So the output of the C generator looks something like:
int
sqlite3$sqlite3_open(char (*_a0), sqlite3 (**_a1))
{
return sqlite3_open(_a0, _a1);
}
The Myr generator creates a .myr file with extern declarations that forwards to these implementations, something like:
pkg sqlite3 =
extern const sqlite3_open : \
(a0 : char#, a1 : sqlite3## -> int)
;;
The code for generating C declarations from the AST is fairly straightforward. It lives here. It is a fairly straightforward recursive walk of the AST.
The code for generating Myrddin glue is more complicated, due to C concepts that do not translate directly. The code needs to deal with incomplete types, tentative declarations, and colliding names. It lives here.
The C parser exists here, again implemented in pure Myrddin.
It's pretty gratifying to be able to parse C well enough to generate bindings for system libraries -- without a single external dependency or library.