summaryrefslogtreecommitdiff
path: root/miniany/README.txt
diff options
context:
space:
mode:
Diffstat (limited to 'miniany/README.txt')
-rw-r--r--miniany/README.txt580
1 files changed, 580 insertions, 0 deletions
diff --git a/miniany/README.txt b/miniany/README.txt
new file mode 100644
index 0000000..1234b4a
--- /dev/null
+++ b/miniany/README.txt
@@ -0,0 +1,580 @@
+ CC - a self-hosting, bootstrappable, minimal C compiler
+
+Introduction
+
+ On the never-ending quest of a minimal system I found Swieros and C4
+ (the C compiler in 4 functions). Inspired and intrigued I started to
+ implement my own.
+
+ For abaos (a small operating system of mine, also in C) I cloned the
+ minimal C library, so we can build a freestanding version of C4.
+
+ C4 serves as a test whether my own CC is minimal enough and doesn't use
+ silly functions. Additionally C4 as well as CC are compiled both in a
+ (on Linux) hosted version and a freestanding version. We use a series
+ of compilers like gcc, clang, tcc and pcc to make sure that we are not
+ using more silly C constructs.
+
+ In order to be able to port easily we make almost no use of system
+ calls, the ones we need are:
+ * brk: for malloc/free, change the start address of the heap segment
+ of the process, if the OS only assigns a single static space, then
+ brk results in a NOP.
+ * exit: terminate the process, return does not always work in all
+ combinations (for instance with pcc on Linux). Can be a NOP, we
+ don't require any trickery as atexit and we don't use buffering
+ anywhere (for instance flushing stdout on exit).
+ * read/write: read from stdin linearly, write to stdout linearly,
+ this is essentially a model using an input and an output tape.
+ Those two functions must really exist. This basically eliminates
+ the need for a file system which we might not have during early
+ bootstrapping.
+
+ Similarly we simplify the C language to not use certain features which
+ can cause trouble when bootstrapping:
+ * variable arguments: though simple in principle (just some pointers
+ into the stack if you use a stack for function parameters), it is
+ not typesafe. And the only example in practice it's really heavily
+ used for is in printf-like functions.
+ * preprocessor: it needs a filesystem, we take this outside of the
+ compiler by feeding it an (eventually) concatenated list of *.c
+ files. Note: in the hosted environment we (and glibc) can use as
+ many preprocessor features as they want, they just don't have to
+ get visible in our code.
+ * two types: int and char, so we can interpret memory as words or as
+ bytes.
+
+Local version of C4
+
+ The local version of C4 has the following adaoptions and extensions:
+ * switch statement from the switch-and-structs branch, adapted c4
+ itself to use switch statements instead of if's (as in the
+ switch-and-structs branch)
+ * struct support from switch-and-structs
+ * constants like EOF, EXIT_SUCCESS, NULL
+ * standard C block comments along to c++ end of line ones
+ * negative enum initializers
+ * do/while loops
+ * more C functions like isspace, getc, strcmp
+ * some simplified functions for printing like putstring, putint,
+ putnl replacing printf-like functions
+ * BSD-style string functions like strlcpy, strlcat
+ * strict C89 conformance, mainly use standard comment blocks, also
+ removed some warnings
+ * some casts around malloc and memset to fit to non-void
+ freestanding-libc
+ * converted printf to putstring/putint/putnl and some helper
+ functions for error reporting like error()
+ * removed all memory leaks
+ * de-POSIX-ified, no open/read/close, use getchar from stdin only
+ (don't assume the existence of a file system), this also means we
+ had to create sort of an old style tape-file with FS markers to
+ separate the files piped to c4.
+
+ The reason for all those adaptions is to minimize the dependency on the
+ host system and to be able to use libc-freestanding.c.
+
+ Note: Only too late I discovered that there was a C5 version of the
+ same compiler, which would maybe have served better as a basis.
+
+Examples
+
+ Running on the host system using the hosts C compiler
+
+ Compiled in either hosted (host libc) or freestanding (our own libc,
+ currently IA-32 Linux kernel only syscalls):
+./build.sh cc hostcc hosted d
+./build.sh cc hostcc freestanding d
+./cc < test1.c > test1.asm
+
+ Create a plain binary from the assembly code:
+fasm test1.asm test1.bin
+
+ Disassemble it to verify it's correctness:
+ndisasm -b32 -o1000000h -a test1.bin
+
+ You can choose gcc, clang, tcc or pcc as host compiler (hostcc).
+
+ Running on the host in the C4 interpreter
+
+ Running in C4 interpreter, again, the C4 program can be compiled in
+ hosted or freestanding mode:
+./build.sh c4 hostcc hosted d
+./build.sh c4 hostcc freestanding d
+
+ Here again you can choose the host compiler for compiling C4.
+
+ Then we have to create the standard input for C4 using:
+echo -n -e "\034" > EOF
+cat cc.c EOF hello.c | ./c4
+cat c4.c EOF cc.c EOF hello.c | ./c4
+cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
+
+ EOF contains the traditional FS (file separator) character in the ASCII
+ character set. Every time c4 is invoked it reads exacly one input file
+ up to the first FS character (or stops at the end of stdin).
+
+ We can also use -s, or -d on every level as follows:
+cat cc.c EOF hello.c | ./c4 -d
+
+Features and Requirements
+
+ We have to careful what to put in a bootstrapping compiler, there is a
+ tradeoff between
+ * costs to implement it in the language: code complexity and size
+ must be kept at a minimun
+ * required features from the C runtime: for instance support of the
+ whole Unicde characters
+ * required features from the operating system: for instance the
+ requirement for a POSIX layer
+ * ugliness of the resulting code: it doesn't make sense to omit
+ features which are somewhat hard to implement but can render
+ readability of the code much better: best examples are the
+ switch/case vs. if/else and the funny array indexing vs. structures
+ in C4.
+
+ So we collect some ideas here about features we add or do not add and
+ why. We also collect here the implications when we are implementing
+ them.
+
+ We also have to be careful what C4 can do for us and either add it
+ there (but only if small enough) in order not to loose this test case.
+
+ Preprocessor for modularisation
+
+ Implementation status: no
+
+ Reasoning:
+ * #include <filename.h>: requires a filename and thus an operating
+ system with a filesystem early on, needs directory functions and
+ open/close/read/write on files
+ * C4 ignores all prepocessor commands and treats them as comments
+ till the end of line
+
+ Alternative:
+ * have a cat *.c concatenating all the source code
+
+ Counter arguments:
+ * we could have sort of a special tape filesystem with a directory at
+ the beginning and offset to allow the preprocessor to find the
+ files for the early destination OS
+ * there is no reason not to use the preprocessor on the host
+ * We can end up in nasty hen-and-egg problems with functions needing
+ each other for different falvours of implementation (e. g. atoi in
+ hosted and freestading), so we might end up duplicating source
+ code.
+
+ Preprocessor for conditional compilation
+
+ Implementation status: no
+
+ Reasoning:
+ * especially simple #ifdefs would be simple to do, conditional
+ compilation in the same file is questionable, there is nothing
+ wrong with including files like linux_386.c, linux_x86_64.c
+ * they are not part of C, they are an addon (though they are nowadays
+ sometimes treated as if they were part of C)
+ * C4 ignores all prepocessor commands and treats them as comments
+ till the end of line
+
+ Alternative:
+ * Use special files and concatenate them at the right place, e. g.
+ _start-stub.c for tcc
+
+ Preprocessor for constant declarations
+
+ Implementation status: no
+
+ Reasoning:
+ * enum constants serve the same purpose, we prefer those
+ * even C89 can work with enum constants
+
+ Caveats:
+ * C4 has only positive enums, we need negative ones for EOF
+ * enum constants are signed integer, so we have to be careful when
+ using them for chars
+ * we don't use enum for enumeration types
+
+ Variable Initializers
+
+ Implementation status: no
+
+ Reasoning:
+ * initializers of global and locals, not that important as we use C89
+ anyway, forcing us to separate declaration and usage of variables
+ per scope
+
+ Counter arguments:
+ * for example symname[t]: printing the symbol and not the number,
+ requires static initializers for array of char* to make the code
+ look nice. Of course we can also just initialize it in a piece of
+ code.
+
+ Inline Assembly
+
+ Implementation status: yes
+
+ Reasoning:
+ * somehow we have to communicate to the operating system, it's either
+ inline assembly or external linking to assembly code
+
+ Counter arguments:
+ * C4 has no inline assembly, we must add it there too
+
+ Alternative:
+ * we could implement a linker and object format early on, we could
+ use an external assembler as for instance asm-i386 (which
+ understands a subset of fasm)
+ * we could implement an early dump format for symbols (function
+ signatures mainly) and use them during linking
+
+ Some general notes:
+
+ GNU inline asm statement has become the de-facto standard (which is too
+ complicated IMHO): I would require sort of a .byte 0xXX instruction
+ only, for readablility maybe simple fasm-like syntax. We must be
+ careful that our invention of an inline assembler can be mapped somehow
+ to the GNU inline asm version, so that we can use that one on the host
+ with gcc/clang/tcc/pcc..
+
+ c.c in swieros (the c4 successor) has asm(NOP), this is something we
+ could implement easily. u.h contains an enum with opcodes (most likely
+ doable or an easy architecture like the one in swieros, I doubt this
+ works for Intel opcodes, but we should check if it works for our
+ simplified Intel opcode subset).
+
+ There should though be only one single point of information for opcodes
+ per architecture, so asm gets sort of an inline string generator for
+ the assembly output. Or we share a common C-file with enums for the
+ opcodes and cat it to both the assembler and the compiler during the
+ build (should not result in increaed code size, as those are enums).
+
+ The asm(x) or asm(x,y) constructs can be mapped on the host compilers
+ to asm __volatile__ .byte ugliness. In cc and c4 we can take the
+ swieros approach. This should give us nice lowlevel inline assembly in
+ a really simplified way (basically embedding bytes).
+
+ Not having inline assembly means you need compilation units written and
+ linked to the program in assembly, which - well - adds a linker and
+ calling conventions, which might be too early in bootstrapping.
+
+ Object formats and linkers
+
+ Implementation status: no
+
+ Reasoning:
+ * requires file systems and linker formats like ELF
+
+ Alternative:
+ * make compilation units from a bunch of source code, this results in
+ bigger binaries, as we cannot share too much. This might be
+ acceptable for early bootstrapping. Later on it would be nice to
+ add a linker as an optional addon (which can be used outside of
+ bootstrapping).
+
+ Forward declarations of function prototypes
+
+ Implementation status: yes
+
+ Reasoning:
+ * Recursive descent parsing requires forward function declarations.
+
+ Caveats:
+ * Forward function declarations are not that easy to implement,
+ because you have to generate a placeholder for the call address
+ before you get the whole definition of the forwarded function
+ (especially its entry address). Or we have to create sort of a
+ temporary jump into a jump table (sort of a GOT) which we patch
+ when we know the address of the implementation of the function.
+ Either way we loose the 1-pass output of the generated code. Having
+ a global table at one place scales easier, as we don't have to keep
+ the whole generated code around just for patching (remember, we
+ have tapes and memory, no seek of files).
+ * Must be added to c4 too, might not be that easy (TODO)
+
+ Counter arguments:
+ * We could write a non-recursive parser using tables
+
+ Functions with variable arguments
+
+ Implementation status: no
+
+ Reasoning:
+ * not type-safe
+ * only used for (s)printfs string manipulation and output functions
+ * C4 doesn't implement variable arguments for defining functions, so
+ we cannot bootstrap a freestanding version
+
+ Requirements
+ * on the hosted Linux envorinment we need syscalls to syscall, int
+ 0x80, etc.
+ * we need inline assembly to create the syscalls
+
+ Alternative:
+ * create simple functions like putchar, putint, putstring (one per
+ basic type) and some helpers like putnl
+ * use stlcat and stlcpy should be enought to compose label names
+ * instead of a generic syscall(..) we can easily work around this by
+ having syscall1, syscall2, syscall3, ... with a fixed number of
+ parameters
+
+ Counter arguments:
+ * we deviate from the C standard, printf just belongs to C
+ * printf is actually not hard to implement in a type-unsafe-way
+ * syscalls have variable arguments
+
+ FILE* and stderr
+
+ Implementation status: no
+
+ Reasoning:
+ * requires FILE * structures, requires various write channels from
+ the operating system
+ * we can write error messages into the output stream as comments like
+ ; ERROR in line 32, pos 2: generic error (TODO)
+
+ Counter arguments:
+ * unbuffered FILE * is simple to implement
+ * almost all operating systems have the notion of stdout and stderr,
+ we can set them to the same channel if an OS doesn't have them.
+
+ Typedefs
+
+ Implementation status: no
+
+ Reasoning:
+ * typedefs are syntactic sugar for typedef struct T as T, not
+ strictly necessary and they don't make our code look too ugly.
+
+ Counter arguments:
+ * portability without preprocessor could make use of typedef ulong64
+ unsigned long int and similar constructs
+
+ For-loops
+
+ Implementation status: no
+
+ Reasoning:
+ * unless we start optimizing (SIMD) there is no real benefit for a
+ generic 'for', a strict for i=0 to N, i++ is easier to optimize,
+ when you have a grammatical construct to help recognizing it.
+ * the C for loop has a funny ending semicolon issue and you can add
+ arbitraty code blocks into the three parts of the for, this is way
+ too generic for a minimalistic language
+
+ Counter arguments:
+ * for-loops are not hard to implement
+
+ Passing arguments to main
+
+ Implementation status: yes
+
+ Reasoning:
+ * this is just passing an int (argc) and a char** (argv) from _start
+ to main. This is also a convention shared with the operating system
+ when it calls our process. Without that we would have a hard time
+ to parametrize the calls to our process both on the host and in the
+ target OS.
+
+ Counter argument:
+ * Do really everything over the input stream, but this would feel a
+ little bit too mainframe-ish.
+
+ Boolean Type
+
+ Implementation status: no
+
+ Reasoning:
+ * C89 has no bool type
+ * useful, but not strigtly necessary, we can live with int holding a
+ boolean value
+
+ Counter arguments:
+ * boolean and integer values and variables form a nice little type
+ system for expressions so introducing booleans might have some
+ educational value.
+
+ Union
+
+ Implementation status: no
+
+ Reasoning:
+ * in the compiler there is little benefit of compressing parts of e.
+ g. ASTs into unions
+ * unions allow accessing the same memory via different means, this
+ can also be achieved with code accessing the memory differentlyand
+ doing some byte/word conversions for instance.
+
+ Counter arguments:
+ * for bootstrapping the operating system we might need unions (as
+ well as packing and alignment)
+
+ Dangling else
+
+ Implementation status: no
+
+ Reasoning:
+ * Don't allow non-blocked if/else, just avoid the dangling else
+ problem, it's uninteresting and in C not as bad as in PASCAL-like
+ languages, as the block markers are just one character big.
+
+ Short-circuit conditions
+
+ Implementations status: yes
+
+ Reasoning:
+ * Not really necessary as we could get away with if( a != null ) {
+ if( a->x == 7 ) { ... } }, but it makes code look rather ugly.
+
+ Return statement
+
+ Implemention status: yes, but..
+
+ Reasoning:
+ * There are good reasons for not allowing return everywhere in the
+ code, see newest Oberon revisions allowing RETURN only at the end
+ of the function declaration. There are benefits like easier
+ detection whether the function returns a value, easier flow
+ analysis (imagine returns in complicated if-else-statements). For
+ now we allow it everywhere, but we should try hard not to use it in
+ the middle of code blocks.
+ * There is an argument from the code correctness point of view as
+ return in the middle of code makes the code hard to reason about
+ (similar to too many if-else-statements)
+ * Allowing return only at the end of a function and nowhere else
+ makes tail-recursion easy.
+ * Error handling is really hard when return appears everywhere in the
+ body, it's much easier to check whether there is a return at the
+ end and whether the returned type matches the prototype of the
+ function,
+
+ Register Allocation
+
+ Implementation status: yes
+
+ Reasoning:
+ * Compared to a stack machine even the simplest register allocation
+ algorithm produces much better code
+
+ Counter arguments:
+ * Stack machine with the top of stack in EAX is also quite a simple
+ and efficient solution (see Write Your Own Compiler, Holm).
+
+ Abstract Syntax Trees
+
+ Implementation status: yes
+
+ Reasoning:
+ * Delaying code generation is essential when doing an assignment
+ (rvalue must be evaluated before lvalue), in const folding (do no
+ generate code as you don't know the expression is a constant but
+ instead just compute the current value of the constant expression).
+ * Separate semantic operation array index evaluation from definition
+ of the size of an array for instance with the '[' character (we do
+ not want to react on the scanner symbol directly).
+ * When evaluating a boolean expression we don't know yet its context
+ (can be in a if/while condition, in which case we would generate a
+ conditional jump or it can be in an an assignment, in which case we
+ would store the value into a boolean variable).
+
+ Caveats:
+ * Try to keep the scope of an AST as small as possible and as big as
+ necessary (the output of the parser should not be the complete
+ source code). The mininum we need is an expression and some
+ context, the maximum maybe is the scope of a function.
+
+ Counter arguments:
+ * Readable AST code needs some trees and pointers to structs, but we
+ want those anyway for better readable code (C4 for instance is not
+ that readable in it's original array indexing version).
+
+ Builtin functions
+
+ Implementation status: yes
+
+ Reasoning:
+ * Adding putint/getchar style of functions as elements of the
+ language is tempting, as it allows early debugging and testing (as
+ in PASCAL). The fundemental conflict here is that bootstrapping is
+ better with stdout and stdin in the language (no function calls, no
+ linker, etc. needed). But later on we want those functions be part
+ of a language library and not of the language itself.
+
+ Caveats:
+ * Avoid code duplication (inline assembly in the compiler for the
+ keyword implementation and with inline assembly in the language
+ library). (TODO)
+
+References
+
+ Compiler construction in general:
+ * "Compiler Construction"", Niklaus Wirth
+ * [1]https://github.com/DoctorWkt/acwj: a nice series on building a C
+ compiler, step by step with lots of good explanations
+ * [2]https://github.com/lotabout/write-a-C-interpreter/blob/master/tu
+ torial/en/, tutorial based on C4 how to build a C interpreter,
+ explains nicely details in C4.
+
+ Some special compiler building topics:
+ * [3]https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing,
+ [4]https://en.wikipedia.org/wiki/Operator-precedence_parser#Precede
+ nce_climbing_method
+ * [5]https://en.wikipedia.org/wiki/Strahler_number: justification for
+ register numbers for register alloation (TODO: clarify)
+
+ C4:
+ * [6]https://github.com/rswier/c4.git, C4 - C in four functions,
+ Robert Swierczek, minimalistic C compiler running on an emulator on
+ the IR, inspiration for this project
+ * [7]https://github.com/rswier/c4/blob/switch-and-structs/c4.c, c4
+ adaptions to provide switch and structs
+ * [8]https://github.com/EarlGray/c4: a X86 JIT version of c4
+ * [9]https://github.com/jserv/amacc: based on C4, JIT or native code,
+ for ARM, quite well documented, also very nice list of compiler
+ resources on Github page
+
+ Other minimal compilers and systems:
+ * [10]http://selfie.cs.uni-salzburg.at/: Christoph Kirsch, C*
+ self-hosting C compiler (also emulator, hypervisor) for RISCV,
+ inspiration for what makes up a minimal C language
+ * [11]http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Complem
+ ents/tinyc.c, Marc Feeley, really easy and much more readable,
+ meant as educational compiler
+ * [12]https://github.com/rswier/swieros.git: Robert Swierczek, c.c in
+ swieros
+ * [13]https://github.com/ras52/boostrap: Richard Smith, bootstrapping
+ experiment
+ * [14]http://t3x.org/t3x: Nils M Holm, the T3X programming language,
+ especially the bootstrapping version T3X9
+
+ Assembly:
+ * [15]https://github.com/felipensp/assembly/blob/master/x86/itoa.s,
+ for putint (early debugging keyword)
+ * [16]https://baptiste-wicht.com/posts/2011/11/print-strings-integers
+ -intel-assembly.htm (early debugging keyword)
+
+ Documentation:
+ * [17]http://cowlark.com/wordgrinder/index.html: the fabulous editor
+ which just does what it should do
+ * [18]https://github.com/mity/md4c: markdown to HTML in C
+
+References
+
+ 1. https://github.com/DoctorWkt/acwj
+ 2. https://github.com/lotabout/write-a-C-interpreter/blob/master/tutorial/en/
+ 3. https://www.engr.mun.ca/%7Etheo/Misc/exp
+ 4. https://en.wikipedia.org/wiki/Operator-precedence
+ 5. https://en.wikipedia.org/wiki/Strahler
+ 6. https://github.com/rswier/c4.git
+ 7. https://github.com/rswier/c4/blob/switch-and-structs/c4.c
+ 8. https://github.com/EarlGray/c4
+ 9. https://github.com/jserv/amacc
+ 10. http://selfie.cs.uni-salzburg.at/
+ 11. http://www.iro.umontreal.ca/%7Efelipe/IFT2030-Automne2002/Complements/tinyc.c
+ 12. https://github.com/rswier/swieros.git
+ 13. https://github.com/ras52/boostrap
+ 14. http://t3x.org/t3x
+ 15. https://github.com/felipensp/assembly/blob/master/x86/itoa.s
+ 16. https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.htm
+ 17. http://cowlark.com/wordgrinder/index.html
+ 18. https://github.com/mity/md4c