CC - a self-hosting, bootstrappable, minimal C compiler

Introduction

On the never-ending quest of a minimal system I found Swieros and C4 (the C compiler in 4 functions). Inspired and intrigued I started to implement my own.

For abaos (a small operating system of mine, also in C) I cloned the minimal C library, so we can build a freestanding version of C4.

C4 serves as a test whether my own CC is minimal enough and doesn't use silly functions. Additionally C4 as well as CC are compiled both in a (on Linux) hosted version and a freestanding version. We use a series of compilers like gcc, clang, tcc and pcc to make sure that we are not using more silly C constructs.

In order to be able to port easily we make almost no use of system calls, the ones we need are:

Similarly we simplify the C language to not use certain features which can cause trouble when bootstrapping:

Local version of C4

The local version of C4 has the following adaoptions and extensions:

The reason for all those adaptions is to minimize the dependency on the host system and to be able to use libc-freestanding.c.

Note: Only too late I discovered that there was a C5 version of the same compiler, which would maybe have served better as a basis.

Examples

Running on the host system using the hosts C compiler

Compiled in either hosted (host libc) or freestanding (our own libc, currently IA-32 Linux kernel only syscalls):

./build.sh cc hostcc hosted d
./build.sh cc hostcc freestanding d
./cc < test1.c > test1.asm

Create a plain binary from the assembly code:

fasm test1.asm test1.bin

Disassemble it to verify it's correctness:

ndisasm -b32 -o1000000h -a test1.bin

You can choose gcc, clang, tcc or pcc as host compiler (hostcc).

Running on the host in the C4 interpreter

Running in C4 interpreter, again, the C4 program can be compiled in hosted or freestanding mode:

./build.sh c4 hostcc hosted d
./build.sh c4 hostcc freestanding d

Here again you can choose the host compiler for compiling C4.

Then we have to create the standard input for C4 using:

echo -n -e "\034" > EOF
cat cc.c EOF hello.c | ./c4
cat c4.c EOF cc.c EOF hello.c | ./c4
cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4

EOF contains the traditional FS (file separator) character in the ASCII character set. Every time c4 is invoked it reads exacly one input file up to the first FS character (or stops at the end of stdin).

We can also use -s, or -d on every level as follows:

cat cc.c EOF hello.c | ./c4 -d

Features and Requirements

We have to careful what to put in a bootstrapping compiler, there is a tradeoff between

So we collect some ideas here about features we add or do not add and why. We also collect here what their implications are, when we are implementing them.

We also have to be careful what C4 can do for us and either add it there (but only if small enough) in order no to loose this test case.

Preprocessor for modularisation

Implementation status: no

Reasoning:

Alternative:

Counter arguments:

Preprocessor for conditional compilation

Implementation status: no

Reasoning:

Alternative:

Preprocessor for constant declarations

Implementation status: no

Reasoning:

Caveats:

Variable Initializers

Implementation status: no

Reasoning:

Counter arguments:

Inline Assembly

Implementation status: yes

Reasoning:

Counter arguments:

Alternative:

Some general notes:

GNU inline asm statement has become the de-facto standard (which is too complicated IMHO): I would require sort of a .byte 0xXX instruction only, for readablility maybe simple fasm-like syntax. We must be careful that our invention of an inline assembler can be mapped somehow to the GNU inline asm version, so that we can use that one on the host with gcc/clang/tcc/pcc..

c.c in swieros (the c4 successor) has asm(NOP), this is something we could implement easily. u.h contains an enum with opcodes (most likely doable or an easy architecture like the one in swieros, I doubt this works for Intel opcodes, but we should check if it works for our simplified Intel opcode subset).

There should though be only one single point of information for opcodes per architecture, so asm gets sort of an inline string generator for the assembly output. Or we share a common C-file with enums for the opcodes and cat it to both the assembler and the compiler during the build (should not result in increaed code size, as those are enums).

The asm(x) or asm(x,y) constructs can be mapped on the host compilers to asm __volatile__ .byte ugliness. In cc and c4 we can take the swieros approach. This should give us nice lowlevel inline assembly in a really simplified way (basically embedding bytes).

Not having inline assembly means you need compilation units written and linked to the program in assembly, which - well - adds a linker and calling conventions, which might be too early in bootstrapping.

Object formats and linkers

Implementation status: no

Reasoning:

Alternative:

Forward declarations of function prototypes

Implementation status: yes (TODO)

Reasoning:

Caveats:

Counter arguments:

Functions with variable arguments

Implementation status: no

Reasoning:

Requirements

Alternative:

Counter arguments:

FILE* and stderr

Implementation status: no

Reasoning:

Counter arguments:

Typedefs

Implementation status: no

Reasoning:

Counter arguments:

For-loops

Implementation status: no

Reasoning:

Counter arguments:

Passing arguments to main

Implementation status: yes

Reasoning:

Counter argument:

bool

Implementation status: no

Reasoning:

Union

Implementation status: no

Reasoning:

Counter arguments:

Dangling else

Implementation status: no

Reasoning:

Register Allocation

Implementation status: yes

Reasoning:

Abstract Syntax Trees

Implementation status: yes

Reasoning:

Caveats:

Counter arguments:

Builtin functions

Implementation status: yes

Reasoning:

Caveats:

References

Compiler construction in general:

Some special compiler building topics:

C4:

Other minimal compilers and systems:

Assembly:

Documentation: