diff options
Diffstat (limited to 'miniany/README.txt')
-rw-r--r-- | miniany/README.txt | 580 |
1 files changed, 580 insertions, 0 deletions
diff --git a/miniany/README.txt b/miniany/README.txt new file mode 100644 index 0000000..1234b4a --- /dev/null +++ b/miniany/README.txt @@ -0,0 +1,580 @@ + CC - a self-hosting, bootstrappable, minimal C compiler + +Introduction + + On the never-ending quest of a minimal system I found Swieros and C4 + (the C compiler in 4 functions). Inspired and intrigued I started to + implement my own. + + For abaos (a small operating system of mine, also in C) I cloned the + minimal C library, so we can build a freestanding version of C4. + + C4 serves as a test whether my own CC is minimal enough and doesn't use + silly functions. Additionally C4 as well as CC are compiled both in a + (on Linux) hosted version and a freestanding version. We use a series + of compilers like gcc, clang, tcc and pcc to make sure that we are not + using more silly C constructs. + + In order to be able to port easily we make almost no use of system + calls, the ones we need are: + * brk: for malloc/free, change the start address of the heap segment + of the process, if the OS only assigns a single static space, then + brk results in a NOP. + * exit: terminate the process, return does not always work in all + combinations (for instance with pcc on Linux). Can be a NOP, we + don't require any trickery as atexit and we don't use buffering + anywhere (for instance flushing stdout on exit). + * read/write: read from stdin linearly, write to stdout linearly, + this is essentially a model using an input and an output tape. + Those two functions must really exist. This basically eliminates + the need for a file system which we might not have during early + bootstrapping. + + Similarly we simplify the C language to not use certain features which + can cause trouble when bootstrapping: + * variable arguments: though simple in principle (just some pointers + into the stack if you use a stack for function parameters), it is + not typesafe. And the only example in practice it's really heavily + used for is in printf-like functions. + * preprocessor: it needs a filesystem, we take this outside of the + compiler by feeding it an (eventually) concatenated list of *.c + files. Note: in the hosted environment we (and glibc) can use as + many preprocessor features as they want, they just don't have to + get visible in our code. + * two types: int and char, so we can interpret memory as words or as + bytes. + +Local version of C4 + + The local version of C4 has the following adaoptions and extensions: + * switch statement from the switch-and-structs branch, adapted c4 + itself to use switch statements instead of if's (as in the + switch-and-structs branch) + * struct support from switch-and-structs + * constants like EOF, EXIT_SUCCESS, NULL + * standard C block comments along to c++ end of line ones + * negative enum initializers + * do/while loops + * more C functions like isspace, getc, strcmp + * some simplified functions for printing like putstring, putint, + putnl replacing printf-like functions + * BSD-style string functions like strlcpy, strlcat + * strict C89 conformance, mainly use standard comment blocks, also + removed some warnings + * some casts around malloc and memset to fit to non-void + freestanding-libc + * converted printf to putstring/putint/putnl and some helper + functions for error reporting like error() + * removed all memory leaks + * de-POSIX-ified, no open/read/close, use getchar from stdin only + (don't assume the existence of a file system), this also means we + had to create sort of an old style tape-file with FS markers to + separate the files piped to c4. + + The reason for all those adaptions is to minimize the dependency on the + host system and to be able to use libc-freestanding.c. + + Note: Only too late I discovered that there was a C5 version of the + same compiler, which would maybe have served better as a basis. + +Examples + + Running on the host system using the hosts C compiler + + Compiled in either hosted (host libc) or freestanding (our own libc, + currently IA-32 Linux kernel only syscalls): +./build.sh cc hostcc hosted d +./build.sh cc hostcc freestanding d +./cc < test1.c > test1.asm + + Create a plain binary from the assembly code: +fasm test1.asm test1.bin + + Disassemble it to verify it's correctness: +ndisasm -b32 -o1000000h -a test1.bin + + You can choose gcc, clang, tcc or pcc as host compiler (hostcc). + + Running on the host in the C4 interpreter + + Running in C4 interpreter, again, the C4 program can be compiled in + hosted or freestanding mode: +./build.sh c4 hostcc hosted d +./build.sh c4 hostcc freestanding d + + Here again you can choose the host compiler for compiling C4. + + Then we have to create the standard input for C4 using: +echo -n -e "\034" > EOF +cat cc.c EOF hello.c | ./c4 +cat c4.c EOF cc.c EOF hello.c | ./c4 +cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4 + + EOF contains the traditional FS (file separator) character in the ASCII + character set. Every time c4 is invoked it reads exacly one input file + up to the first FS character (or stops at the end of stdin). + + We can also use -s, or -d on every level as follows: +cat cc.c EOF hello.c | ./c4 -d + +Features and Requirements + + We have to careful what to put in a bootstrapping compiler, there is a + tradeoff between + * costs to implement it in the language: code complexity and size + must be kept at a minimun + * required features from the C runtime: for instance support of the + whole Unicde characters + * required features from the operating system: for instance the + requirement for a POSIX layer + * ugliness of the resulting code: it doesn't make sense to omit + features which are somewhat hard to implement but can render + readability of the code much better: best examples are the + switch/case vs. if/else and the funny array indexing vs. structures + in C4. + + So we collect some ideas here about features we add or do not add and + why. We also collect here the implications when we are implementing + them. + + We also have to be careful what C4 can do for us and either add it + there (but only if small enough) in order not to loose this test case. + + Preprocessor for modularisation + + Implementation status: no + + Reasoning: + * #include <filename.h>: requires a filename and thus an operating + system with a filesystem early on, needs directory functions and + open/close/read/write on files + * C4 ignores all prepocessor commands and treats them as comments + till the end of line + + Alternative: + * have a cat *.c concatenating all the source code + + Counter arguments: + * we could have sort of a special tape filesystem with a directory at + the beginning and offset to allow the preprocessor to find the + files for the early destination OS + * there is no reason not to use the preprocessor on the host + * We can end up in nasty hen-and-egg problems with functions needing + each other for different falvours of implementation (e. g. atoi in + hosted and freestading), so we might end up duplicating source + code. + + Preprocessor for conditional compilation + + Implementation status: no + + Reasoning: + * especially simple #ifdefs would be simple to do, conditional + compilation in the same file is questionable, there is nothing + wrong with including files like linux_386.c, linux_x86_64.c + * they are not part of C, they are an addon (though they are nowadays + sometimes treated as if they were part of C) + * C4 ignores all prepocessor commands and treats them as comments + till the end of line + + Alternative: + * Use special files and concatenate them at the right place, e. g. + _start-stub.c for tcc + + Preprocessor for constant declarations + + Implementation status: no + + Reasoning: + * enum constants serve the same purpose, we prefer those + * even C89 can work with enum constants + + Caveats: + * C4 has only positive enums, we need negative ones for EOF + * enum constants are signed integer, so we have to be careful when + using them for chars + * we don't use enum for enumeration types + + Variable Initializers + + Implementation status: no + + Reasoning: + * initializers of global and locals, not that important as we use C89 + anyway, forcing us to separate declaration and usage of variables + per scope + + Counter arguments: + * for example symname[t]: printing the symbol and not the number, + requires static initializers for array of char* to make the code + look nice. Of course we can also just initialize it in a piece of + code. + + Inline Assembly + + Implementation status: yes + + Reasoning: + * somehow we have to communicate to the operating system, it's either + inline assembly or external linking to assembly code + + Counter arguments: + * C4 has no inline assembly, we must add it there too + + Alternative: + * we could implement a linker and object format early on, we could + use an external assembler as for instance asm-i386 (which + understands a subset of fasm) + * we could implement an early dump format for symbols (function + signatures mainly) and use them during linking + + Some general notes: + + GNU inline asm statement has become the de-facto standard (which is too + complicated IMHO): I would require sort of a .byte 0xXX instruction + only, for readablility maybe simple fasm-like syntax. We must be + careful that our invention of an inline assembler can be mapped somehow + to the GNU inline asm version, so that we can use that one on the host + with gcc/clang/tcc/pcc.. + + c.c in swieros (the c4 successor) has asm(NOP), this is something we + could implement easily. u.h contains an enum with opcodes (most likely + doable or an easy architecture like the one in swieros, I doubt this + works for Intel opcodes, but we should check if it works for our + simplified Intel opcode subset). + + There should though be only one single point of information for opcodes + per architecture, so asm gets sort of an inline string generator for + the assembly output. Or we share a common C-file with enums for the + opcodes and cat it to both the assembler and the compiler during the + build (should not result in increaed code size, as those are enums). + + The asm(x) or asm(x,y) constructs can be mapped on the host compilers + to asm __volatile__ .byte ugliness. In cc and c4 we can take the + swieros approach. This should give us nice lowlevel inline assembly in + a really simplified way (basically embedding bytes). + + Not having inline assembly means you need compilation units written and + linked to the program in assembly, which - well - adds a linker and + calling conventions, which might be too early in bootstrapping. + + Object formats and linkers + + Implementation status: no + + Reasoning: + * requires file systems and linker formats like ELF + + Alternative: + * make compilation units from a bunch of source code, this results in + bigger binaries, as we cannot share too much. This might be + acceptable for early bootstrapping. Later on it would be nice to + add a linker as an optional addon (which can be used outside of + bootstrapping). + + Forward declarations of function prototypes + + Implementation status: yes + + Reasoning: + * Recursive descent parsing requires forward function declarations. + + Caveats: + * Forward function declarations are not that easy to implement, + because you have to generate a placeholder for the call address + before you get the whole definition of the forwarded function + (especially its entry address). Or we have to create sort of a + temporary jump into a jump table (sort of a GOT) which we patch + when we know the address of the implementation of the function. + Either way we loose the 1-pass output of the generated code. Having + a global table at one place scales easier, as we don't have to keep + the whole generated code around just for patching (remember, we + have tapes and memory, no seek of files). + * Must be added to c4 too, might not be that easy (TODO) + + Counter arguments: + * We could write a non-recursive parser using tables + + Functions with variable arguments + + Implementation status: no + + Reasoning: + * not type-safe + * only used for (s)printfs string manipulation and output functions + * C4 doesn't implement variable arguments for defining functions, so + we cannot bootstrap a freestanding version + + Requirements + * on the hosted Linux envorinment we need syscalls to syscall, int + 0x80, etc. + * we need inline assembly to create the syscalls + + Alternative: + * create simple functions like putchar, putint, putstring (one per + basic type) and some helpers like putnl + * use stlcat and stlcpy should be enought to compose label names + * instead of a generic syscall(..) we can easily work around this by + having syscall1, syscall2, syscall3, ... with a fixed number of + parameters + + Counter arguments: + * we deviate from the C standard, printf just belongs to C + * printf is actually not hard to implement in a type-unsafe-way + * syscalls have variable arguments + + FILE* and stderr + + Implementation status: no + + Reasoning: + * requires FILE * structures, requires various write channels from + the operating system + * we can write error messages into the output stream as comments like + ; ERROR in line 32, pos 2: generic error (TODO) + + Counter arguments: + * unbuffered FILE * is simple to implement + * almost all operating systems have the notion of stdout and stderr, + we can set them to the same channel if an OS doesn't have them. + + Typedefs + + Implementation status: no + + Reasoning: + * typedefs are syntactic sugar for typedef struct T as T, not + strictly necessary and they don't make our code look too ugly. + + Counter arguments: + * portability without preprocessor could make use of typedef ulong64 + unsigned long int and similar constructs + + For-loops + + Implementation status: no + + Reasoning: + * unless we start optimizing (SIMD) there is no real benefit for a + generic 'for', a strict for i=0 to N, i++ is easier to optimize, + when you have a grammatical construct to help recognizing it. + * the C for loop has a funny ending semicolon issue and you can add + arbitraty code blocks into the three parts of the for, this is way + too generic for a minimalistic language + + Counter arguments: + * for-loops are not hard to implement + + Passing arguments to main + + Implementation status: yes + + Reasoning: + * this is just passing an int (argc) and a char** (argv) from _start + to main. This is also a convention shared with the operating system + when it calls our process. Without that we would have a hard time + to parametrize the calls to our process both on the host and in the + target OS. + + Counter argument: + * Do really everything over the input stream, but this would feel a + little bit too mainframe-ish. + + Boolean Type + + Implementation status: no + + Reasoning: + * C89 has no bool type + * useful, but not strigtly necessary, we can live with int holding a + boolean value + + Counter arguments: + * boolean and integer values and variables form a nice little type + system for expressions so introducing booleans might have some + educational value. + + Union + + Implementation status: no + + Reasoning: + * in the compiler there is little benefit of compressing parts of e. + g. ASTs into unions + * unions allow accessing the same memory via different means, this + can also be achieved with code accessing the memory differentlyand + doing some byte/word conversions for instance. + + Counter arguments: + * for bootstrapping the operating system we might need unions (as + well as packing and alignment) + + Dangling else + + Implementation status: no + + Reasoning: + * Don't allow non-blocked if/else, just avoid the dangling else + problem, it's uninteresting and in C not as bad as in PASCAL-like + languages, as the block markers are just one character big. + + Short-circuit conditions + + Implementations status: yes + + Reasoning: + * Not really necessary as we could get away with if( a != null ) { + if( a->x == 7 ) { ... } }, but it makes code look rather ugly. + + Return statement + + Implemention status: yes, but.. + + Reasoning: + * There are good reasons for not allowing return everywhere in the + code, see newest Oberon revisions allowing RETURN only at the end + of the function declaration. There are benefits like easier + detection whether the function returns a value, easier flow + analysis (imagine returns in complicated if-else-statements). For + now we allow it everywhere, but we should try hard not to use it in + the middle of code blocks. + * There is an argument from the code correctness point of view as + return in the middle of code makes the code hard to reason about + (similar to too many if-else-statements) + * Allowing return only at the end of a function and nowhere else + makes tail-recursion easy. + * Error handling is really hard when return appears everywhere in the + body, it's much easier to check whether there is a return at the + end and whether the returned type matches the prototype of the + function, + + Register Allocation + + Implementation status: yes + + Reasoning: + * Compared to a stack machine even the simplest register allocation + algorithm produces much better code + + Counter arguments: + * Stack machine with the top of stack in EAX is also quite a simple + and efficient solution (see Write Your Own Compiler, Holm). + + Abstract Syntax Trees + + Implementation status: yes + + Reasoning: + * Delaying code generation is essential when doing an assignment + (rvalue must be evaluated before lvalue), in const folding (do no + generate code as you don't know the expression is a constant but + instead just compute the current value of the constant expression). + * Separate semantic operation array index evaluation from definition + of the size of an array for instance with the '[' character (we do + not want to react on the scanner symbol directly). + * When evaluating a boolean expression we don't know yet its context + (can be in a if/while condition, in which case we would generate a + conditional jump or it can be in an an assignment, in which case we + would store the value into a boolean variable). + + Caveats: + * Try to keep the scope of an AST as small as possible and as big as + necessary (the output of the parser should not be the complete + source code). The mininum we need is an expression and some + context, the maximum maybe is the scope of a function. + + Counter arguments: + * Readable AST code needs some trees and pointers to structs, but we + want those anyway for better readable code (C4 for instance is not + that readable in it's original array indexing version). + + Builtin functions + + Implementation status: yes + + Reasoning: + * Adding putint/getchar style of functions as elements of the + language is tempting, as it allows early debugging and testing (as + in PASCAL). The fundemental conflict here is that bootstrapping is + better with stdout and stdin in the language (no function calls, no + linker, etc. needed). But later on we want those functions be part + of a language library and not of the language itself. + + Caveats: + * Avoid code duplication (inline assembly in the compiler for the + keyword implementation and with inline assembly in the language + library). (TODO) + +References + + Compiler construction in general: + * "Compiler Construction"", Niklaus Wirth + * [1]https://github.com/DoctorWkt/acwj: a nice series on building a C + compiler, step by step with lots of good explanations + * [2]https://github.com/lotabout/write-a-C-interpreter/blob/master/tu + torial/en/, tutorial based on C4 how to build a C interpreter, + explains nicely details in C4. + + Some special compiler building topics: + * [3]https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing, + [4]https://en.wikipedia.org/wiki/Operator-precedence_parser#Precede + nce_climbing_method + * [5]https://en.wikipedia.org/wiki/Strahler_number: justification for + register numbers for register alloation (TODO: clarify) + + C4: + * [6]https://github.com/rswier/c4.git, C4 - C in four functions, + Robert Swierczek, minimalistic C compiler running on an emulator on + the IR, inspiration for this project + * [7]https://github.com/rswier/c4/blob/switch-and-structs/c4.c, c4 + adaptions to provide switch and structs + * [8]https://github.com/EarlGray/c4: a X86 JIT version of c4 + * [9]https://github.com/jserv/amacc: based on C4, JIT or native code, + for ARM, quite well documented, also very nice list of compiler + resources on Github page + + Other minimal compilers and systems: + * [10]http://selfie.cs.uni-salzburg.at/: Christoph Kirsch, C* + self-hosting C compiler (also emulator, hypervisor) for RISCV, + inspiration for what makes up a minimal C language + * [11]http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Complem + ents/tinyc.c, Marc Feeley, really easy and much more readable, + meant as educational compiler + * [12]https://github.com/rswier/swieros.git: Robert Swierczek, c.c in + swieros + * [13]https://github.com/ras52/boostrap: Richard Smith, bootstrapping + experiment + * [14]http://t3x.org/t3x: Nils M Holm, the T3X programming language, + especially the bootstrapping version T3X9 + + Assembly: + * [15]https://github.com/felipensp/assembly/blob/master/x86/itoa.s, + for putint (early debugging keyword) + * [16]https://baptiste-wicht.com/posts/2011/11/print-strings-integers + -intel-assembly.htm (early debugging keyword) + + Documentation: + * [17]http://cowlark.com/wordgrinder/index.html: the fabulous editor + which just does what it should do + * [18]https://github.com/mity/md4c: markdown to HTML in C + +References + + 1. https://github.com/DoctorWkt/acwj + 2. https://github.com/lotabout/write-a-C-interpreter/blob/master/tutorial/en/ + 3. https://www.engr.mun.ca/%7Etheo/Misc/exp + 4. https://en.wikipedia.org/wiki/Operator-precedence + 5. https://en.wikipedia.org/wiki/Strahler + 6. https://github.com/rswier/c4.git + 7. https://github.com/rswier/c4/blob/switch-and-structs/c4.c + 8. https://github.com/EarlGray/c4 + 9. https://github.com/jserv/amacc + 10. http://selfie.cs.uni-salzburg.at/ + 11. http://www.iro.umontreal.ca/%7Efelipe/IFT2030-Automne2002/Complements/tinyc.c + 12. https://github.com/rswier/swieros.git + 13. https://github.com/ras52/boostrap + 14. http://t3x.org/t3x + 15. https://github.com/felipensp/assembly/blob/master/x86/itoa.s + 16. https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.htm + 17. http://cowlark.com/wordgrinder/index.html + 18. https://github.com/mity/md4c |