CC - a self-hosting, bootstrappable, minimal C compiler
Introduction
On the never-ending quest of a minimal system I found Swieros and C4 (the C compiler in 4 functions). Inspired and intrigued I started to implement my own.
For abaos (a small operating system of mine, also in C) I cloned the minimal C library, so we can build a freestanding version of C4.
C4 serves as a test whether my own CC is minimal enough and doesn't use silly functions. Additionally C4 as well as CC are compiled both in a (on Linux) hosted version and a freestanding version. We use a series of compilers like gcc, clang, tcc and pcc to make sure that we are not using silly C constructs.
In order to be able to port easily we make almost no use of system calls, the ones we need are:
- brk: for malloc/free, change the start address of the heap segment of the process, if the OS only assigns a single static space, then brk results in a NOP.
- exit: terminate the process, return does not always work in all combinations (for instance with pcc on Linux). Can be a NOP, we don't require any trickery as atext and we don't use buffering anywhere (for instance flushing stdout on exit).
- read/write: read from stdin linearly, write to stdout linearly, this is essentially a model using an input and an output tape. Those two functions must really exist. This basically eliminates the need for a file system which we might not have during early bootstrapping.
Similarly we simplify the C language to not use certain features which can cause trouble when bootstrapping:
- variable arguments: though simple in principle (just some pointers into the stack if you use a stack for function parameters), it is not typesafe. And the only example in practice it's really heavily used for is in printf-like functions.
- preprocessor: it needs a filesystem, we take this outside of the compiler by feeding it an (eventually) concatenated list of *.c files.
- two types: int and char, so we can interpret memory as words or as bytes.
Local version of C4
The local version of C4 has the following adaoptions and extensions:
- switch statement from the switch-and-structs branch, adapted c4 itself to use switch statements instead of if's (as in the switch-and-structs branch)
- struct support from switch-and-structs
- constants like EOF, EXIT_SUCCESS, NULL
- standard C block comments along to c++ end of line ones
- negative enum initializers
- do/while loops
- more C functions like isspace, getc, strcmp
- some simplified functions for printing like putstring, putint, putnl replacing printf-like functions
- BSD-style string functions like strlcpy, strlcat
- strict C89 conformance, mainly use standard comment blocks, also removed some warnings
- some casts around malloc and memset to fit to non-void freestanding-libc
- converted printf to putstring/putint/putnl and some helper functions for error reporting like error()
- removed all memory leaks
- de-POSIX-ified, no open/read/close, use getchar from stdin only (don't assume the existence of a file system), this also means we had to create sort of an old style tape-file with FS markers to separate the files piped to c4.
Note: only too late I discovered that there was a C5 version of the same compiler, which would maybe have served better as a basis.
Examples
Running on the host system using the hosts C compiler
Compiled in either hosted (host libc) or freestanding (our own libc, currently IA-32 Linux kernel only syscalls):
./build.sh cc hostcc hosted d
./build.sh cc hostcc freestanding d
./cc < test1.c > test1.asm
Create a plain binary from the assembly code:
fasm test1.asm test1.bin
Disassemble it to verify it's correctness:
ndisasm -b32 -o1000000h -a test1.bin
You can choose gcc, clang, tcc or pcc as host compiler (hostcc).
Running on the host in the C4 interpreter
Running in C4 interpreter, again, the C4 program can be compiled in hosted or freestanding mode:
./build.sh c4 hostcc hosted d
./build.sh c4 hostcc freestanding d
Here again you can choose the host compiler for compiling C4.
Then we have to create the standard input for C4 using:
echo -n -e "\034" > EOF
cat cc.c EOF hello.c | ./c4
cat c4.c EOF cc.c EOF hello.c | ./c4
cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
EOF contains the traditional FS (file separator) character in the ASCII character set. Every time c4 is invoked it reads exacly one input file up to the first FS character (or stops at the end of stdin).
We can also use -s, or -d on every level as follows:
cat cc.c EOF hello.c | ./c4 -d
References
Compiler construction in general:
C4:
Other minimal compilers and systems:
Assembly:
Documentation: