summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAndreas Baumann <mail@andreasbaumann.cc>2022-05-13 10:57:30 +0200
committerAndreas Baumann <mail@andreasbaumann.cc>2022-05-13 10:57:30 +0200
commit79610f324861f32fff88ac34c97e6bc5a7f3bda8 (patch)
tree0cb5aa7731fe4b948807c0083ae7a97f101ea951
parent5a1bc4fc3c5990f7e71ccf5191bc33c5874dea04 (diff)
downloadcompilertests-79610f324861f32fff88ac34c97e6bc5a7f3bda8.tar.gz
compilertests-79610f324861f32fff88ac34c97e6bc5a7f3bda8.tar.bz2
fixed main documentation
-rw-r--r--miniany/README.html58
-rw-r--r--miniany/README.txt580
-rw-r--r--miniany/cc.wg72
3 files changed, 652 insertions, 58 deletions
diff --git a/miniany/README.html b/miniany/README.html
index a736885..ab8a828 100644
--- a/miniany/README.html
+++ b/miniany/README.html
@@ -73,8 +73,8 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
<li>required features from the operating system: for instance the requirement for a POSIX layer</li>
<li>ugliness of the resulting code: it doesn't make sense to omit features which are somewhat hard to implement but can render readability of the code much better: best examples are the <i>switch/case </i>vs. <i>if/else </i>and the funny array indexing vs. structures in C4.</li>
</ul>
-<p>So we collect some ideas here about features we add or do not add and why. We also collect here what their implications are, when we are implementing them.</p>
-<p>We also have to be careful what C4 can do for us and either add it there (but only if small enough) in order no to loose this test case.</p>
+<p>So we collect some ideas here about features we add or do not add and why. We also collect here the implications when we are implementing them.</p>
+<p>We also have to be careful what C4 can do for us and either add it there (but only if small enough) in order not to loose this test case.</p>
<h3>Preprocessor for modularisation</h3>
<p>Implementation status: no</p>
<p>Reasoning:</p>
@@ -90,6 +90,7 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
<ul>
<li>we could have sort of a special tape filesystem with a directory at the beginning and offset to allow the preprocessor to find the files for the early destination OS</li>
<li>there is no reason not to use the preprocessor on the host</li>
+<li>We can end up in nasty hen-and-egg problems with functions needing each other for different falvours of implementation (e. g. <i>atoi </i>in hosted and freestading), so we might end up duplicating source code.</li>
</ul>
<h3>Preprocessor for conditional compilation</h3>
<p>Implementation status: no</p>
@@ -144,8 +145,8 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
<p>Some general notes:</p>
<p>GNU inline asm statement has become the de-facto standard (which is too complicated IMHO): I would require sort of a <i>.byte </i><i>0xX</i>X instruction only, for readablility maybe simple fasm-like syntax. We must be careful that our invention of an inline assembler can be mapped somehow to the GNU inline asm version, so that we can use that one on the host with gcc/clang/tcc/pcc..</p>
<p>c.c in swieros (the c4 successor) has <i>asm(NOP)</i>, this is something we could implement easily. u.h contains an enum with opcodes (most likely doable or an easy architecture like the one in swieros, I doubt this works for Intel opcodes, but we should check if it works for our simplified Intel opcode subset).</p>
-<p>There should though be only one single point of information for opcodes per architecture, so asm gets sort of an inline string generator for the assembly output. Or we share a common C-file with enums for the opcodes and cat it to both the assembler and the compiler during the build (should not result in increaed code size, as those are enums).</p>
-<p>The asm(x) or asm(x,y) constructs can be mapped on the host compilers to asm __volatile__ .byte ugliness. In cc and c4 we can take the swieros approach. This should give us nice lowlevel inline assembly in a really simplified way (basically embedding bytes).</p>
+<p>There should though be only one single point of information for opcodes per architecture, so <i>asm </i>gets sort of an inline string generator for the assembly output. Or we share a common C-file with enums for the opcodes and cat it to both the assembler and the compiler during the build (should not result in increaed code size, as those are enums).</p>
+<p>The <i>asm(x) </i>or <i>asm(x,y) </i>constructs can be mapped on the host compilers to <i>asm </i><i>__volatile__ </i><i>.byte </i>ugliness. In cc and c4 we can take the swieros approach. This should give us nice lowlevel inline assembly in a really simplified way (basically embedding bytes).</p>
<p>Not having inline assembly means you need compilation units written and linked to the program in assembly, which - well - adds a linker and calling conventions, which might be too early in bootstrapping.</p>
<h3>Object formats and linkers</h3>
<p>Implementation status: no</p>
@@ -158,7 +159,7 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
<li>make compilation units from a bunch of source code, this results in bigger binaries, as we cannot share too much. This might be acceptable for early bootstrapping. Later on it would be nice to add a linker as an optional addon (which can be used outside of bootstrapping).</li>
</ul>
<h3>Forward declarations of function prototypes</h3>
-<p>Implementation status: yes (<b>TODO</b>)</p>
+<p>Implementation status: yes</p>
<p>Reasoning:</p>
<ul>
<li>Recursive descent parsing requires forward function declarations.</li>
@@ -183,7 +184,7 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
<p>Requirements</p>
<ul>
<li>on the hosted Linux envorinment we need syscalls to <i>syscall</i>, <i>int </i><i>0x80</i>, etc.</li>
-<li>we need inline assembly to create to syscalls</li>
+<li>we need inline assembly to create the syscalls</li>
</ul>
<p>Alternative:</p>
<ul>
@@ -193,16 +194,16 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
</ul>
<p>Counter arguments:</p>
<ul>
-<li>we deviate from the C standard, printf just belongs to C</li>
-<li>printf is actually not hard to implement in a type-unsafe-way</li>
-<li>syscalls are variable arguments</li>
+<li>we deviate from the C standard, <i>printf </i>just belongs to C</li>
+<li><i>printf </i>is actually not hard to implement in a type-unsafe-way</li>
+<li>syscalls have variable arguments</li>
</ul>
<h3>FILE* and stderr</h3>
<p>Implementation status: no</p>
<p>Reasoning:</p>
<ul>
<li>requires FILE * structures, requires various write channels from the operating system</li>
-<li>we can write error messages into the output stream as comment <i>; </i><i>ERROR </i><i>in </i><i>line </i><i>32, </i><i>pos </i><i>2: </i><i>generic </i><i>error </i>(<b>TODO</b>)</li>
+<li>we can write error messages into the output stream as comments like <i>; </i><i>ERROR </i><i>in </i><i>line </i><i>32, </i><i>pos </i><i>2: </i><i>generic </i><i>error </i>(<b>TODO</b>)</li>
</ul>
<p>Counter arguments:</p>
<ul>
@@ -213,7 +214,7 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
<p>Implementation status: no</p>
<p>Reasoning:</p>
<ul>
-<li>typedefs are syntactic sugar for <i>typedef </i><i>struct </i><i>T </i>as T, not strictly necessary and they don't make our code look too ugly.</li>
+<li>typedefs are syntactic sugar for <i>typedef </i><i>struct </i><i>T </i>as <i>T</i>, not strictly necessary and they don't make our code look too ugly.</li>
</ul>
<p>Counter arguments:</p>
<ul>
@@ -223,12 +224,12 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
<p>Implementation status: no</p>
<p>Reasoning:</p>
<ul>
-<li>unless we start optimizing (SIMD) there is no real benefit, for a generic 'for', a strict for i=0 to N, i++ is easier to optimize, when you have a grammatical construct to help recognizing it.</li>
+<li>unless we start optimizing (SIMD) there is no real benefit for a generic 'for', a strict for i=0 to N, i++ is easier to optimize, when you have a grammatical construct to help recognizing it.</li>
<li>the C for loop has a funny ending semicolon issue and you can add arbitraty code blocks into the three parts of the for, this is way too generic for a minimalistic language</li>
</ul>
<p>Counter arguments:</p>
<ul>
-<li>for loops are not hard to implement</li>
+<li>for-loops are not hard to implement</li>
</ul>
<h3>Passing arguments to main</h3>
<p>Implementation status: yes</p>
@@ -240,18 +241,23 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
<ul>
<li>Do really everything over the input stream, but this would feel a little bit too mainframe-ish.</li>
</ul>
-<h3>bool</h3>
+<h3>Boolean Type</h3>
<p>Implementation status: no</p>
<p>Reasoning:</p>
<ul>
-<li>C89 has no boolean</li>
+<li>C89 has no <i>bool </i>type</li>
<li>useful, but not strigtly necessary, we can live with <i>int </i>holding a boolean value</li>
</ul>
+<p>Counter arguments:</p>
+<ul>
+<li>boolean and integer values and variables form a nice little type system for expressions so introducing booleans might have some educational value.</li>
+</ul>
<h3>Union</h3>
<p>Implementation status: no</p>
<p>Reasoning:</p>
<ul>
<li>in the compiler there is little benefit of compressing parts of e. g. ASTs into unions</li>
+<li>unions allow accessing the same memory via different means, this can also be achieved with code accessing the memory differentlyand doing some byte/word conversions for instance.</li>
</ul>
<p>Counter arguments:</p>
<ul>
@@ -273,7 +279,7 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
<p>Implemention status: yes, but..</p>
<p>Reasoning:</p>
<ul>
-<li>There are good reasons not to allow <i>return </i>everywhere in the code, see newest Oberon revisions allowing <i>RETURN </i>only at the end of the function declaration. There are benefits like easier detection whether the function returns a function, easier flow analysis (image <i>returns </i>in complicated <i>if-else</i>-statements). For now we allow it everywhere, but we should try hard not to use it in the middle of code blocks.</li>
+<li>There are good reasons for not allowing <i>return </i>everywhere in the code, see newest Oberon revisions allowing <i>RETURN </i>only at the end of the function declaration. There are benefits like easier detection whether the function returns a value, easier flow analysis (imagine <i>returns </i>in complicated <i>if-else</i>-statements). For now we allow it everywhere, but we should try hard not to use it in the middle of code blocks.</li>
<li>There is an argument from the code correctness point of view as <i>return </i>in the middle of code makes the code hard to reason about (similar to too many if-else-statements)</li>
<li>Allowing <i>return </i>only at the end of a function and nowhere else makes tail-recursion easy.</li>
<li>Error handling is really hard when <i>return </i>appears everywhere in the body, it's much easier to check whether there is a <i>return </i>at the end and whether the returned type matches the prototype of the function,</li>
@@ -286,7 +292,7 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
</ul>
<p>Counter arguments:</p>
<ul>
-<li>Stack machine with the top of stack in EAX is also quite a simple solution.</li>
+<li>Stack machine with the top of stack in EAX is also quite a simple and efficient solution (see <i>Write </i><i>Your </i><i>Own </i><i>Compiler, </i>Holm).</li>
</ul>
<h3>Abstract Syntax Trees</h3>
<p>Implementation status: yes</p>
@@ -294,11 +300,11 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
<ul>
<li>Delaying code generation is essential when doing an assignment (rvalue must be evaluated before lvalue), in const folding (do no generate code as you don't know the expression is a constant but instead just compute the current value of the constant expression).</li>
<li>Separate semantic operation <i>array </i><i>index </i><i>evaluation </i>from <i>definition </i><i>of </i><i>the </i><i>size </i><i>of </i><i>an </i><i>array </i>for instance with the '[' character (we do not want to react on the scanner symbol directly).</li>
-<li>When evaluating a boolean expression we don't know yet it's context (can be in a <i>if/while </i>condition, in which case we would geneate a conditional jump or whether in an assignment where we would store the value to the boolean variable).</li>
+<li>When evaluating a boolean expression we don't know yet its context (can be in a <i>if/while </i>condition, in which case we would generate a conditional jump or it can be in an an <i>assignment</i>, in which case we would store the value into a boolean variable).</li>
</ul>
<p>Caveats:</p>
<ul>
-<li>Try to keep the scope of an AST as small as possible and as big as necessary (the output of the parser should not be the complete source code). Minimal is an expression and some context, maybe maximal is the scope of a function.</li>
+<li>Try to keep the scope of an AST as small as possible and as big as necessary (the output of the parser should not be the complete source code). The mininum we need is an expression and some context, the maximum maybe is the scope of a function.</li>
</ul>
<p>Counter arguments:</p>
<ul>
@@ -308,11 +314,11 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
<p>Implementation status: yes</p>
<p>Reasoning:</p>
<ul>
-<li>Adding <i>putint/getchar </i>style of functions as elements of the language is tempting, as it allows early debugging and testing (as in PASCAL). The fundemental conflict here is betweet Bootstrapping is better with stdout and stdin in the language (no function calls, no linker, etc. needed). But later on we want those functions be part of a language library and not of the langyage itself.</li>
+<li>Adding <i>putint/getchar </i>style of functions as elements of the language is tempting, as it allows early debugging and testing (as in PASCAL). The fundemental conflict here is that bootstrapping is better with stdout and stdin in the language (no function calls, no linker, etc. needed). But later on we want those functions be part of a language library and not of the language itself.</li>
</ul>
<p>Caveats:</p>
<ul>
-<li>Avoid code duplication (inline assembly in the compiler for the keyword implementation and with intline assembly in the language library). (<b>TODO</b>)</li>
+<li>Avoid code duplication (inline assembly in the compiler for the keyword implementation and with inline assembly in the language library). (<b>TODO</b>)</li>
</ul>
<h2>References</h2>
<p>Compiler construction in general:</p>
@@ -335,14 +341,16 @@ cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
</ul>
<p>Other minimal compilers and systems:</p>
<ul>
-<li><a href="http://selfie.cs.uni-salzburg.at/">http://selfie.cs.uni-salzburg.at/</a>: C* self-hosting C compiler (also emulator, hypervisor) for RISCV, inspiration for what makes up a minimal C language</li>
-<li><a href="http://www.iro.umontreal.ca/%7Efelipe/IFT2030-Automne2002/Complements/tinyc.c">http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Complements/tinyc.c</a>, Marc Feeley, really easy and much more readable, meant as educational compiler</li>
-<li><a href="https://github.com/rswier/swieros.git">https://github.com/rswier/swieros.git</a>: c.c in swieros, Robert Swierczek</li>
+<li><a href="http://selfie.cs.uni-salzburg.at/">http://selfie.cs.uni-salzburg.at/</a>: <i>Christoph </i><i>Kirsch</i>, C* self-hosting C compiler (also emulator, hypervisor) for RISCV, inspiration for what makes up a minimal C language</li>
+<li><a href="http://www.iro.umontreal.ca/%7Efelipe/IFT2030-Automne2002/Complements/tinyc.c">http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Complements/tinyc.c</a>, <i>Marc </i><i>Feeley</i>, really easy and much more readable, meant as educational compiler</li>
+<li><a href="https://github.com/rswier/swieros.git">https://github.com/rswier/swieros.git</a>: <i>Robert </i><i>Swierczek</i>, c.c in swieros</li>
+<li><a href="https://github.com/ras52/boostrap">https://github.com/ras52/boostrap</a>: <i>Richard </i><i>Smith</i>, bootstrapping experiment</li>
+<li><a href="http://t3x.org/t3x">http://t3x.org/t3x</a>: <i>Nils </i><i>M </i><i>Holm</i>, the T3X programming language, especially the bootstrapping version T3X9</li>
</ul>
<p>Assembly:</p>
<ul>
<li><a href="https://github.com/felipensp/assembly/blob/master/x86/itoa.s">https://github.com/felipensp/assembly/blob/master/x86/itoa.s</a>, for putint (early debugging keyword)</li>
-<li><a href="https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.htm">https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.htm</a> (earldy debugging keyword)</li>
+<li><a href="https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.htm">https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.htm</a> (early debugging keyword)</li>
</ul>
<p>Documentation:</p>
<ul>
diff --git a/miniany/README.txt b/miniany/README.txt
new file mode 100644
index 0000000..1234b4a
--- /dev/null
+++ b/miniany/README.txt
@@ -0,0 +1,580 @@
+ CC - a self-hosting, bootstrappable, minimal C compiler
+
+Introduction
+
+ On the never-ending quest of a minimal system I found Swieros and C4
+ (the C compiler in 4 functions). Inspired and intrigued I started to
+ implement my own.
+
+ For abaos (a small operating system of mine, also in C) I cloned the
+ minimal C library, so we can build a freestanding version of C4.
+
+ C4 serves as a test whether my own CC is minimal enough and doesn't use
+ silly functions. Additionally C4 as well as CC are compiled both in a
+ (on Linux) hosted version and a freestanding version. We use a series
+ of compilers like gcc, clang, tcc and pcc to make sure that we are not
+ using more silly C constructs.
+
+ In order to be able to port easily we make almost no use of system
+ calls, the ones we need are:
+ * brk: for malloc/free, change the start address of the heap segment
+ of the process, if the OS only assigns a single static space, then
+ brk results in a NOP.
+ * exit: terminate the process, return does not always work in all
+ combinations (for instance with pcc on Linux). Can be a NOP, we
+ don't require any trickery as atexit and we don't use buffering
+ anywhere (for instance flushing stdout on exit).
+ * read/write: read from stdin linearly, write to stdout linearly,
+ this is essentially a model using an input and an output tape.
+ Those two functions must really exist. This basically eliminates
+ the need for a file system which we might not have during early
+ bootstrapping.
+
+ Similarly we simplify the C language to not use certain features which
+ can cause trouble when bootstrapping:
+ * variable arguments: though simple in principle (just some pointers
+ into the stack if you use a stack for function parameters), it is
+ not typesafe. And the only example in practice it's really heavily
+ used for is in printf-like functions.
+ * preprocessor: it needs a filesystem, we take this outside of the
+ compiler by feeding it an (eventually) concatenated list of *.c
+ files. Note: in the hosted environment we (and glibc) can use as
+ many preprocessor features as they want, they just don't have to
+ get visible in our code.
+ * two types: int and char, so we can interpret memory as words or as
+ bytes.
+
+Local version of C4
+
+ The local version of C4 has the following adaoptions and extensions:
+ * switch statement from the switch-and-structs branch, adapted c4
+ itself to use switch statements instead of if's (as in the
+ switch-and-structs branch)
+ * struct support from switch-and-structs
+ * constants like EOF, EXIT_SUCCESS, NULL
+ * standard C block comments along to c++ end of line ones
+ * negative enum initializers
+ * do/while loops
+ * more C functions like isspace, getc, strcmp
+ * some simplified functions for printing like putstring, putint,
+ putnl replacing printf-like functions
+ * BSD-style string functions like strlcpy, strlcat
+ * strict C89 conformance, mainly use standard comment blocks, also
+ removed some warnings
+ * some casts around malloc and memset to fit to non-void
+ freestanding-libc
+ * converted printf to putstring/putint/putnl and some helper
+ functions for error reporting like error()
+ * removed all memory leaks
+ * de-POSIX-ified, no open/read/close, use getchar from stdin only
+ (don't assume the existence of a file system), this also means we
+ had to create sort of an old style tape-file with FS markers to
+ separate the files piped to c4.
+
+ The reason for all those adaptions is to minimize the dependency on the
+ host system and to be able to use libc-freestanding.c.
+
+ Note: Only too late I discovered that there was a C5 version of the
+ same compiler, which would maybe have served better as a basis.
+
+Examples
+
+ Running on the host system using the hosts C compiler
+
+ Compiled in either hosted (host libc) or freestanding (our own libc,
+ currently IA-32 Linux kernel only syscalls):
+./build.sh cc hostcc hosted d
+./build.sh cc hostcc freestanding d
+./cc < test1.c > test1.asm
+
+ Create a plain binary from the assembly code:
+fasm test1.asm test1.bin
+
+ Disassemble it to verify it's correctness:
+ndisasm -b32 -o1000000h -a test1.bin
+
+ You can choose gcc, clang, tcc or pcc as host compiler (hostcc).
+
+ Running on the host in the C4 interpreter
+
+ Running in C4 interpreter, again, the C4 program can be compiled in
+ hosted or freestanding mode:
+./build.sh c4 hostcc hosted d
+./build.sh c4 hostcc freestanding d
+
+ Here again you can choose the host compiler for compiling C4.
+
+ Then we have to create the standard input for C4 using:
+echo -n -e "\034" > EOF
+cat cc.c EOF hello.c | ./c4
+cat c4.c EOF cc.c EOF hello.c | ./c4
+cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4
+
+ EOF contains the traditional FS (file separator) character in the ASCII
+ character set. Every time c4 is invoked it reads exacly one input file
+ up to the first FS character (or stops at the end of stdin).
+
+ We can also use -s, or -d on every level as follows:
+cat cc.c EOF hello.c | ./c4 -d
+
+Features and Requirements
+
+ We have to careful what to put in a bootstrapping compiler, there is a
+ tradeoff between
+ * costs to implement it in the language: code complexity and size
+ must be kept at a minimun
+ * required features from the C runtime: for instance support of the
+ whole Unicde characters
+ * required features from the operating system: for instance the
+ requirement for a POSIX layer
+ * ugliness of the resulting code: it doesn't make sense to omit
+ features which are somewhat hard to implement but can render
+ readability of the code much better: best examples are the
+ switch/case vs. if/else and the funny array indexing vs. structures
+ in C4.
+
+ So we collect some ideas here about features we add or do not add and
+ why. We also collect here the implications when we are implementing
+ them.
+
+ We also have to be careful what C4 can do for us and either add it
+ there (but only if small enough) in order not to loose this test case.
+
+ Preprocessor for modularisation
+
+ Implementation status: no
+
+ Reasoning:
+ * #include <filename.h>: requires a filename and thus an operating
+ system with a filesystem early on, needs directory functions and
+ open/close/read/write on files
+ * C4 ignores all prepocessor commands and treats them as comments
+ till the end of line
+
+ Alternative:
+ * have a cat *.c concatenating all the source code
+
+ Counter arguments:
+ * we could have sort of a special tape filesystem with a directory at
+ the beginning and offset to allow the preprocessor to find the
+ files for the early destination OS
+ * there is no reason not to use the preprocessor on the host
+ * We can end up in nasty hen-and-egg problems with functions needing
+ each other for different falvours of implementation (e. g. atoi in
+ hosted and freestading), so we might end up duplicating source
+ code.
+
+ Preprocessor for conditional compilation
+
+ Implementation status: no
+
+ Reasoning:
+ * especially simple #ifdefs would be simple to do, conditional
+ compilation in the same file is questionable, there is nothing
+ wrong with including files like linux_386.c, linux_x86_64.c
+ * they are not part of C, they are an addon (though they are nowadays
+ sometimes treated as if they were part of C)
+ * C4 ignores all prepocessor commands and treats them as comments
+ till the end of line
+
+ Alternative:
+ * Use special files and concatenate them at the right place, e. g.
+ _start-stub.c for tcc
+
+ Preprocessor for constant declarations
+
+ Implementation status: no
+
+ Reasoning:
+ * enum constants serve the same purpose, we prefer those
+ * even C89 can work with enum constants
+
+ Caveats:
+ * C4 has only positive enums, we need negative ones for EOF
+ * enum constants are signed integer, so we have to be careful when
+ using them for chars
+ * we don't use enum for enumeration types
+
+ Variable Initializers
+
+ Implementation status: no
+
+ Reasoning:
+ * initializers of global and locals, not that important as we use C89
+ anyway, forcing us to separate declaration and usage of variables
+ per scope
+
+ Counter arguments:
+ * for example symname[t]: printing the symbol and not the number,
+ requires static initializers for array of char* to make the code
+ look nice. Of course we can also just initialize it in a piece of
+ code.
+
+ Inline Assembly
+
+ Implementation status: yes
+
+ Reasoning:
+ * somehow we have to communicate to the operating system, it's either
+ inline assembly or external linking to assembly code
+
+ Counter arguments:
+ * C4 has no inline assembly, we must add it there too
+
+ Alternative:
+ * we could implement a linker and object format early on, we could
+ use an external assembler as for instance asm-i386 (which
+ understands a subset of fasm)
+ * we could implement an early dump format for symbols (function
+ signatures mainly) and use them during linking
+
+ Some general notes:
+
+ GNU inline asm statement has become the de-facto standard (which is too
+ complicated IMHO): I would require sort of a .byte 0xXX instruction
+ only, for readablility maybe simple fasm-like syntax. We must be
+ careful that our invention of an inline assembler can be mapped somehow
+ to the GNU inline asm version, so that we can use that one on the host
+ with gcc/clang/tcc/pcc..
+
+ c.c in swieros (the c4 successor) has asm(NOP), this is something we
+ could implement easily. u.h contains an enum with opcodes (most likely
+ doable or an easy architecture like the one in swieros, I doubt this
+ works for Intel opcodes, but we should check if it works for our
+ simplified Intel opcode subset).
+
+ There should though be only one single point of information for opcodes
+ per architecture, so asm gets sort of an inline string generator for
+ the assembly output. Or we share a common C-file with enums for the
+ opcodes and cat it to both the assembler and the compiler during the
+ build (should not result in increaed code size, as those are enums).
+
+ The asm(x) or asm(x,y) constructs can be mapped on the host compilers
+ to asm __volatile__ .byte ugliness. In cc and c4 we can take the
+ swieros approach. This should give us nice lowlevel inline assembly in
+ a really simplified way (basically embedding bytes).
+
+ Not having inline assembly means you need compilation units written and
+ linked to the program in assembly, which - well - adds a linker and
+ calling conventions, which might be too early in bootstrapping.
+
+ Object formats and linkers
+
+ Implementation status: no
+
+ Reasoning:
+ * requires file systems and linker formats like ELF
+
+ Alternative:
+ * make compilation units from a bunch of source code, this results in
+ bigger binaries, as we cannot share too much. This might be
+ acceptable for early bootstrapping. Later on it would be nice to
+ add a linker as an optional addon (which can be used outside of
+ bootstrapping).
+
+ Forward declarations of function prototypes
+
+ Implementation status: yes
+
+ Reasoning:
+ * Recursive descent parsing requires forward function declarations.
+
+ Caveats:
+ * Forward function declarations are not that easy to implement,
+ because you have to generate a placeholder for the call address
+ before you get the whole definition of the forwarded function
+ (especially its entry address). Or we have to create sort of a
+ temporary jump into a jump table (sort of a GOT) which we patch
+ when we know the address of the implementation of the function.
+ Either way we loose the 1-pass output of the generated code. Having
+ a global table at one place scales easier, as we don't have to keep
+ the whole generated code around just for patching (remember, we
+ have tapes and memory, no seek of files).
+ * Must be added to c4 too, might not be that easy (TODO)
+
+ Counter arguments:
+ * We could write a non-recursive parser using tables
+
+ Functions with variable arguments
+
+ Implementation status: no
+
+ Reasoning:
+ * not type-safe
+ * only used for (s)printfs string manipulation and output functions
+ * C4 doesn't implement variable arguments for defining functions, so
+ we cannot bootstrap a freestanding version
+
+ Requirements
+ * on the hosted Linux envorinment we need syscalls to syscall, int
+ 0x80, etc.
+ * we need inline assembly to create the syscalls
+
+ Alternative:
+ * create simple functions like putchar, putint, putstring (one per
+ basic type) and some helpers like putnl
+ * use stlcat and stlcpy should be enought to compose label names
+ * instead of a generic syscall(..) we can easily work around this by
+ having syscall1, syscall2, syscall3, ... with a fixed number of
+ parameters
+
+ Counter arguments:
+ * we deviate from the C standard, printf just belongs to C
+ * printf is actually not hard to implement in a type-unsafe-way
+ * syscalls have variable arguments
+
+ FILE* and stderr
+
+ Implementation status: no
+
+ Reasoning:
+ * requires FILE * structures, requires various write channels from
+ the operating system
+ * we can write error messages into the output stream as comments like
+ ; ERROR in line 32, pos 2: generic error (TODO)
+
+ Counter arguments:
+ * unbuffered FILE * is simple to implement
+ * almost all operating systems have the notion of stdout and stderr,
+ we can set them to the same channel if an OS doesn't have them.
+
+ Typedefs
+
+ Implementation status: no
+
+ Reasoning:
+ * typedefs are syntactic sugar for typedef struct T as T, not
+ strictly necessary and they don't make our code look too ugly.
+
+ Counter arguments:
+ * portability without preprocessor could make use of typedef ulong64
+ unsigned long int and similar constructs
+
+ For-loops
+
+ Implementation status: no
+
+ Reasoning:
+ * unless we start optimizing (SIMD) there is no real benefit for a
+ generic 'for', a strict for i=0 to N, i++ is easier to optimize,
+ when you have a grammatical construct to help recognizing it.
+ * the C for loop has a funny ending semicolon issue and you can add
+ arbitraty code blocks into the three parts of the for, this is way
+ too generic for a minimalistic language
+
+ Counter arguments:
+ * for-loops are not hard to implement
+
+ Passing arguments to main
+
+ Implementation status: yes
+
+ Reasoning:
+ * this is just passing an int (argc) and a char** (argv) from _start
+ to main. This is also a convention shared with the operating system
+ when it calls our process. Without that we would have a hard time
+ to parametrize the calls to our process both on the host and in the
+ target OS.
+
+ Counter argument:
+ * Do really everything over the input stream, but this would feel a
+ little bit too mainframe-ish.
+
+ Boolean Type
+
+ Implementation status: no
+
+ Reasoning:
+ * C89 has no bool type
+ * useful, but not strigtly necessary, we can live with int holding a
+ boolean value
+
+ Counter arguments:
+ * boolean and integer values and variables form a nice little type
+ system for expressions so introducing booleans might have some
+ educational value.
+
+ Union
+
+ Implementation status: no
+
+ Reasoning:
+ * in the compiler there is little benefit of compressing parts of e.
+ g. ASTs into unions
+ * unions allow accessing the same memory via different means, this
+ can also be achieved with code accessing the memory differentlyand
+ doing some byte/word conversions for instance.
+
+ Counter arguments:
+ * for bootstrapping the operating system we might need unions (as
+ well as packing and alignment)
+
+ Dangling else
+
+ Implementation status: no
+
+ Reasoning:
+ * Don't allow non-blocked if/else, just avoid the dangling else
+ problem, it's uninteresting and in C not as bad as in PASCAL-like
+ languages, as the block markers are just one character big.
+
+ Short-circuit conditions
+
+ Implementations status: yes
+
+ Reasoning:
+ * Not really necessary as we could get away with if( a != null ) {
+ if( a->x == 7 ) { ... } }, but it makes code look rather ugly.
+
+ Return statement
+
+ Implemention status: yes, but..
+
+ Reasoning:
+ * There are good reasons for not allowing return everywhere in the
+ code, see newest Oberon revisions allowing RETURN only at the end
+ of the function declaration. There are benefits like easier
+ detection whether the function returns a value, easier flow
+ analysis (imagine returns in complicated if-else-statements). For
+ now we allow it everywhere, but we should try hard not to use it in
+ the middle of code blocks.
+ * There is an argument from the code correctness point of view as
+ return in the middle of code makes the code hard to reason about
+ (similar to too many if-else-statements)
+ * Allowing return only at the end of a function and nowhere else
+ makes tail-recursion easy.
+ * Error handling is really hard when return appears everywhere in the
+ body, it's much easier to check whether there is a return at the
+ end and whether the returned type matches the prototype of the
+ function,
+
+ Register Allocation
+
+ Implementation status: yes
+
+ Reasoning:
+ * Compared to a stack machine even the simplest register allocation
+ algorithm produces much better code
+
+ Counter arguments:
+ * Stack machine with the top of stack in EAX is also quite a simple
+ and efficient solution (see Write Your Own Compiler, Holm).
+
+ Abstract Syntax Trees
+
+ Implementation status: yes
+
+ Reasoning:
+ * Delaying code generation is essential when doing an assignment
+ (rvalue must be evaluated before lvalue), in const folding (do no
+ generate code as you don't know the expression is a constant but
+ instead just compute the current value of the constant expression).
+ * Separate semantic operation array index evaluation from definition
+ of the size of an array for instance with the '[' character (we do
+ not want to react on the scanner symbol directly).
+ * When evaluating a boolean expression we don't know yet its context
+ (can be in a if/while condition, in which case we would generate a
+ conditional jump or it can be in an an assignment, in which case we
+ would store the value into a boolean variable).
+
+ Caveats:
+ * Try to keep the scope of an AST as small as possible and as big as
+ necessary (the output of the parser should not be the complete
+ source code). The mininum we need is an expression and some
+ context, the maximum maybe is the scope of a function.
+
+ Counter arguments:
+ * Readable AST code needs some trees and pointers to structs, but we
+ want those anyway for better readable code (C4 for instance is not
+ that readable in it's original array indexing version).
+
+ Builtin functions
+
+ Implementation status: yes
+
+ Reasoning:
+ * Adding putint/getchar style of functions as elements of the
+ language is tempting, as it allows early debugging and testing (as
+ in PASCAL). The fundemental conflict here is that bootstrapping is
+ better with stdout and stdin in the language (no function calls, no
+ linker, etc. needed). But later on we want those functions be part
+ of a language library and not of the language itself.
+
+ Caveats:
+ * Avoid code duplication (inline assembly in the compiler for the
+ keyword implementation and with inline assembly in the language
+ library). (TODO)
+
+References
+
+ Compiler construction in general:
+ * "Compiler Construction"", Niklaus Wirth
+ * [1]https://github.com/DoctorWkt/acwj: a nice series on building a C
+ compiler, step by step with lots of good explanations
+ * [2]https://github.com/lotabout/write-a-C-interpreter/blob/master/tu
+ torial/en/, tutorial based on C4 how to build a C interpreter,
+ explains nicely details in C4.
+
+ Some special compiler building topics:
+ * [3]https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing,
+ [4]https://en.wikipedia.org/wiki/Operator-precedence_parser#Precede
+ nce_climbing_method
+ * [5]https://en.wikipedia.org/wiki/Strahler_number: justification for
+ register numbers for register alloation (TODO: clarify)
+
+ C4:
+ * [6]https://github.com/rswier/c4.git, C4 - C in four functions,
+ Robert Swierczek, minimalistic C compiler running on an emulator on
+ the IR, inspiration for this project
+ * [7]https://github.com/rswier/c4/blob/switch-and-structs/c4.c, c4
+ adaptions to provide switch and structs
+ * [8]https://github.com/EarlGray/c4: a X86 JIT version of c4
+ * [9]https://github.com/jserv/amacc: based on C4, JIT or native code,
+ for ARM, quite well documented, also very nice list of compiler
+ resources on Github page
+
+ Other minimal compilers and systems:
+ * [10]http://selfie.cs.uni-salzburg.at/: Christoph Kirsch, C*
+ self-hosting C compiler (also emulator, hypervisor) for RISCV,
+ inspiration for what makes up a minimal C language
+ * [11]http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Complem
+ ents/tinyc.c, Marc Feeley, really easy and much more readable,
+ meant as educational compiler
+ * [12]https://github.com/rswier/swieros.git: Robert Swierczek, c.c in
+ swieros
+ * [13]https://github.com/ras52/boostrap: Richard Smith, bootstrapping
+ experiment
+ * [14]http://t3x.org/t3x: Nils M Holm, the T3X programming language,
+ especially the bootstrapping version T3X9
+
+ Assembly:
+ * [15]https://github.com/felipensp/assembly/blob/master/x86/itoa.s,
+ for putint (early debugging keyword)
+ * [16]https://baptiste-wicht.com/posts/2011/11/print-strings-integers
+ -intel-assembly.htm (early debugging keyword)
+
+ Documentation:
+ * [17]http://cowlark.com/wordgrinder/index.html: the fabulous editor
+ which just does what it should do
+ * [18]https://github.com/mity/md4c: markdown to HTML in C
+
+References
+
+ 1. https://github.com/DoctorWkt/acwj
+ 2. https://github.com/lotabout/write-a-C-interpreter/blob/master/tutorial/en/
+ 3. https://www.engr.mun.ca/%7Etheo/Misc/exp
+ 4. https://en.wikipedia.org/wiki/Operator-precedence
+ 5. https://en.wikipedia.org/wiki/Strahler
+ 6. https://github.com/rswier/c4.git
+ 7. https://github.com/rswier/c4/blob/switch-and-structs/c4.c
+ 8. https://github.com/EarlGray/c4
+ 9. https://github.com/jserv/amacc
+ 10. http://selfie.cs.uni-salzburg.at/
+ 11. http://www.iro.umontreal.ca/%7Efelipe/IFT2030-Automne2002/Complements/tinyc.c
+ 12. https://github.com/rswier/swieros.git
+ 13. https://github.com/ras52/boostrap
+ 14. http://t3x.org/t3x
+ 15. https://github.com/felipensp/assembly/blob/master/x86/itoa.s
+ 16. https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.htm
+ 17. http://cowlark.com/wordgrinder/index.html
+ 18. https://github.com/mity/md4c
diff --git a/miniany/cc.wg b/miniany/cc.wg
index e686dd4..b546c7c 100644
--- a/miniany/cc.wg
+++ b/miniany/cc.wg
@@ -28,17 +28,17 @@ WordGrinder dumpfile v3: this is a text file; diff me!
.clipboard.cw: 1
.clipboard.margin: 0
.clipboard.viewmode: 1
-.clipboard.wordcount: 6
-.documents.1.co: 9
-.documents.1.cp: 134
-.documents.1.cw: 10
+.clipboard.wordcount: 3
+.documents.1.co: 6
+.documents.1.cp: 244
+.documents.1.cw: 2
.documents.1.margin: 0
.documents.1.name: "main"
.documents.1.sticky_selection: false
.documents.1.viewmode: 1
-.documents.1.wordcount: 3073
+.documents.1.wordcount: 3192
.fileformat: 8
-.findtext: "syscal,"
+.findtext: "geneate"
.menu.accelerators.^@: "ZM"
.menu.accelerators.^B: "SB"
.menu.accelerators.BACKSPACE: "ZDPC"
@@ -146,7 +146,7 @@ WordGrinder dumpfile v3: this is a text file; diff me!
.statusbar: true
.current: 1
#clipboard
-P TODO: add rest of REQUIREMENTS here!
+LB c.c in swieros
.
#1
H1 CC - a self-hosting, bootstrappable, minimal C compiler
@@ -210,8 +210,8 @@ LB costs to implement it in the language: code complexity and size must be kept
LB required features from the C runtime: for instance support of the whole Unicde characters
LB required features from the operating system: for instance the requirement for a POSIX layer
LB ugliness of the resulting code: it doesn't make sense to omit features which are somewhat hard to implement but can render readability of the code much better: best examples are the switch/case vs. if/else and the funny array indexing vs. structures in C4.
-P So we collect some ideas here about features we add or do not add and why. We also collect here what their implications are, when we are implementing them.
-P We also have to be careful what C4 can do for us and either add it there (but only if small enough) in order no to loose this test case.
+P So we collect some ideas here about features we add or do not add and why. We also collect here the implications when we are implementing them.
+P We also have to be careful what C4 can do for us and either add it there (but only if small enough) in order not to loose this test case.
H3 Preprocessor for modularisation
P Implementation status: no
P Reasoning:
@@ -222,6 +222,7 @@ LB have a cat *.c concatenating all the source code
P Counter arguments:
LB we could have sort of a special tape filesystem with a directory at the beginning and offset to allow the preprocessor to find the files for the early destination OS
LB there is no reason not to use the preprocessor on the host
+LB We can end up in nasty hen-and-egg problems with functions needing each other for different falvours of implementation (e. g. atoi in hosted and freestading), so we might end up duplicating source code.
H3 Preprocessor for conditional compilation
P Implementation status: no
P Reasoning:
@@ -257,8 +258,8 @@ LB we could implement an early dump format for symbols (function signatures main
P Some general notes:
P GNU inline asm statement has become the de-facto standard (which is too complicated IMHO): I would require sort of a .byte 0xXX instruction only, for readablility maybe simple fasm-like syntax. We must be careful that our invention of an inline assembler can be mapped somehow to the GNU inline asm version, so that we can use that one on the host with gcc/clang/tcc/pcc..
P c.c in swieros (the c4 successor) has asm(NOP), this is something we could implement easily. u.h contains an enum with opcodes (most likely doable or an easy architecture like the one in swieros, I doubt this works for Intel opcodes, but we should check if it works for our simplified Intel opcode subset).
-P There should though be only one single point of information for opcodes per architecture, so asm gets sort of an inline string generator for the assembly output. Or we share a common C-file with enums for the opcodes and cat it to both the assembler and the compiler during the build (should not result in increaed code size, as those are enums).
-P The asm(x) or asm(x,y) constructs can be mapped on the host compilers to asm __volatile__ .byte ugliness. In cc and c4 we can take the swieros approach. This should give us nice lowlevel inline assembly in a really simplified way (basically embedding bytes).
+P There should though be only one single point of information for opcodes per architecture, so asm gets sort of an inline string generator for the assembly output. Or we share a common C-file with enums for the opcodes and cat it to both the assembler and the compiler during the build (should not result in increaed code size, as those are enums).
+P The asm(x) or asm(x,y) constructs can be mapped on the host compilers to asm __volatile__ .byte ugliness. In cc and c4 we can take the swieros approach. This should give us nice lowlevel inline assembly in a really simplified way (basically embedding bytes).
P Not having inline assembly means you need compilation units written and linked to the program in assembly, which - well - adds a linker and calling conventions, which might be too early in bootstrapping.
H3 Object formats and linkers
P Implementation status: no
@@ -267,7 +268,7 @@ LB requires file systems and linker formats like ELF
P Alternative:
LB make compilation units from a bunch of source code, this results in bigger binaries, as we cannot share too much. This might be acceptable for early bootstrapping. Later on it would be nice to add a linker as an optional addon (which can be used outside of bootstrapping).
H3 Forward declarations of function prototypes
-P Implementation status: yes (TODO)
+P Implementation status: yes
P Reasoning:
LB Recursive descent parsing requires forward function declarations.
P Caveats:
@@ -283,51 +284,54 @@ LB only used for (s)printfs string manipulation and output functions
LB C4 doesn't implement variable arguments for defining functions, so we cannot bootstrap a freestanding version
P Requirements
LB on the hosted Linux envorinment we need syscalls to syscall, int 0x80, etc.
-LB we need inline assembly to create to syscalls
+LB we need inline assembly to create the syscalls
P Alternative:
LB create simple functions like putchar, putint, putstring (one per basic type) and some helpers like putnl
LB use stlcat and stlcpy should be enought to compose label names
LB instead of a generic syscall(..) we can easily work around this by having syscall1, syscall2, syscall3, ... with a fixed number of parameters
P Counter arguments:
-LB we deviate from the C standard, printf just belongs to C
-LB printf is actually not hard to implement in a type-unsafe-way
-LB syscalls are variable arguments
+LB we deviate from the C standard, printf just belongs to C
+LB printf is actually not hard to implement in a type-unsafe-way
+LB syscalls have variable arguments
H3 FILE* and stderr
P Implementation status: no
P Reasoning:
LB requires FILE * structures, requires various write channels from the operating system
-LB we can write error messages into the output stream as comment ; ERROR in line 32, pos 2: generic error (TODO)
+LB we can write error messages into the output stream as comments like ; ERROR in line 32, pos 2: generic error (TODO)
P Counter arguments:
LB unbuffered FILE * is simple to implement
LB almost all operating systems have the notion of stdout and stderr, we can set them to the same channel if an OS doesn't have them.
H3 Typedefs
P Implementation status: no
P Reasoning:
-LB typedefs are syntactic sugar for typedef struct T as T, not strictly necessary and they don't make our code look too ugly.
+LB typedefs are syntactic sugar for typedef struct T as T, not strictly necessary and they don't make our code look too ugly.
P Counter arguments:
LB portability without preprocessor could make use of typedef ulong64 unsigned long int and similar constructs
H3 For-loops
P Implementation status: no
P Reasoning:
-LB unless we start optimizing (SIMD) there is no real benefit, for a generic 'for', a strict for i=0 to N, i++ is easier to optimize, when you have a grammatical construct to help recognizing it.
+LB unless we start optimizing (SIMD) there is no real benefit for a generic 'for', a strict for i=0 to N, i++ is easier to optimize, when you have a grammatical construct to help recognizing it.
LB the C for loop has a funny ending semicolon issue and you can add arbitraty code blocks into the three parts of the for, this is way too generic for a minimalistic language
P Counter arguments:
-LB for loops are not hard to implement
+LB for-loops are not hard to implement
H3 Passing arguments to main
P Implementation status: yes
P Reasoning:
LB this is just passing an int (argc) and a char** (argv) from _start to main. This is also a convention shared with the operating system when it calls our process. Without that we would have a hard time to parametrize the calls to our process both on the host and in the target OS.
P Counter argument:
LB Do really everything over the input stream, but this would feel a little bit too mainframe-ish.
-H3 bool
+H3 Boolean Type
P Implementation status: no
P Reasoning:
-LB C89 has no boolean
+LB C89 has no bool type
LB useful, but not strigtly necessary, we can live with int holding a boolean value
-H3 Union
+P Counter arguments:
+LB boolean and integer values and variables form a nice little type system for expressions so introducing booleans might have some educational value.
+H3 Union
P Implementation status: no
P Reasoning:
LB in the compiler there is little benefit of compressing parts of e. g. ASTs into unions
+LB unions allow accessing the same memory via different means, this can also be achieved with code accessing the memory differentlyand doing some byte/word conversions for instance.
P Counter arguments:
LB for bootstrapping the operating system we might need unions (as well as packing and alignment)
H3 Dangling else
@@ -341,7 +345,7 @@ LB Not really necessary as we could get away with if( a != null ) { if( 
H3 Return statement
P Implemention status: yes, but..
P Reasoning:
-LB There are good reasons not to allow return everywhere in the code, see newest Oberon revisions allowing RETURN only at the end of the function declaration. There are benefits like easier detection whether the function returns a function, easier flow analysis (image returns in complicated if-else-statements). For now we allow it everywhere, but we should try hard not to use it in the middle of code blocks.
+LB There are good reasons for not allowing return everywhere in the code, see newest Oberon revisions allowing RETURN only at the end of the function declaration. There are benefits like easier detection whether the function returns a value, easier flow analysis (imagine returns in complicated if-else-statements). For now we allow it everywhere, but we should try hard not to use it in the middle of code blocks.
LB There is an argument from the code correctness point of view as return in the middle of code makes the code hard to reason about (similar to too many if-else-statements)
LB Allowing return only at the end of a function and nowhere else makes tail-recursion easy.
LB Error handling is really hard when return appears everywhere in the body, it's much easier to check whether there is a return at the end and whether the returned type matches the prototype of the function,
@@ -350,23 +354,23 @@ P Implementation status: yes
P Reasoning:
LB Compared to a stack machine even the simplest register allocation algorithm produces much better code
P Counter arguments:
-LB Stack machine with the top of stack in EAX is also quite a simple solution.
+LB Stack machine with the top of stack in EAX is also quite a simple and efficient solution (see Write Your Own Compiler, Holm).
H3 Abstract Syntax Trees
P Implementation status: yes
P Reasoning:
LB Delaying code generation is essential when doing an assignment (rvalue must be evaluated before lvalue), in const folding (do no generate code as you don't know the expression is a constant but instead just compute the current value of the constant expression).
LB Separate semantic operation array index evaluation from definition of the size of an array for instance with the '[' character (we do not want to react on the scanner symbol directly).
-LB When evaluating a boolean expression we don't know yet it's context (can be in a if/while condition, in which case we would geneate a conditional jump or whether in an assignment where we would store the value to the boolean variable).
+LB When evaluating a boolean expression we don't know yet its context (can be in a if/while condition, in which case we would generate a conditional jump or it can be in an an assignment, in which case we would store the value into a boolean variable).
P Caveats:
-LB Try to keep the scope of an AST as small as possible and as big as necessary (the output of the parser should not be the complete source code). Minimal is an expression and some context, maybe maximal is the scope of a function.
+LB Try to keep the scope of an AST as small as possible and as big as necessary (the output of the parser should not be the complete source code). The mininum we need is an expression and some context, the maximum maybe is the scope of a function.
P Counter arguments:
LB Readable AST code needs some trees and pointers to structs, but we want those anyway for better readable code (C4 for instance is not that readable in it's original array indexing version).
H3 Builtin functions
P Implementation status: yes
P Reasoning:
-LB Adding putint/getchar style of functions as elements of the language is tempting, as it allows early debugging and testing (as in PASCAL). The fundemental conflict here is betweet Bootstrapping is better with stdout and stdin in the language (no function calls, no linker, etc. needed). But later on we want those functions be part of a language library and not of the langyage itself.
+LB Adding putint/getchar style of functions as elements of the language is tempting, as it allows early debugging and testing (as in PASCAL). The fundemental conflict here is that bootstrapping is better with stdout and stdin in the language (no function calls, no linker, etc. needed). But later on we want those functions be part of a language library and not of the language itself.
P Caveats:
-LB Avoid code duplication (inline assembly in the compiler for the keyword implementation and with intline assembly in the language library). (TODO)
+LB Avoid code duplication (inline assembly in the compiler for the keyword implementation and with inline assembly in the language library). (TODO)
H2 References
P Compiler construction in general:
LB "Compiler Construction"", Niklaus Wirth
@@ -381,12 +385,14 @@ LB https://github.com/rswier/c4/blob/switch-and-structs/c4.c, c4 adaptions to p
LB https://github.com/EarlGray/c4: a X86 JIT version of c4
LB https://github.com/jserv/amacc: based on C4, JIT or native code, for ARM, quite well documented, also very nice list of compiler resources on Github page
P Other minimal compilers and systems:
-LB http://selfie.cs.uni-salzburg.at/: C* self-hosting C compiler (also emulator, hypervisor) for RISCV, inspiration for what makes up a minimal C language
-LB http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Complements/tinyc.c, Marc Feeley, really easy and much more readable, meant as educational compiler
-LB https://github.com/rswier/swieros.git: c.c in swieros, Robert Swierczek
+LB http://selfie.cs.uni-salzburg.at/: Christoph Kirsch, C* self-hosting C compiler (also emulator, hypervisor) for RISCV, inspiration for what makes up a minimal C language
+LB http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Complements/tinyc.c, Marc Feeley, really easy and much more readable, meant as educational compiler
+LB https://github.com/rswier/swieros.git: Robert Swierczek, c.c in swieros
+LB https://github.com/ras52/boostrap: Richard Smith, bootstrapping experiment
+LB http://t3x.org/t3x: Nils M Holm, the T3X programming language, especially the bootstrapping version T3X9
P Assembly:
LB https://github.com/felipensp/assembly/blob/master/x86/itoa.s, for putint (early debugging keyword)
-LB https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.htm (earldy debugging keyword)
+LB https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.htm (early debugging keyword)
P Documentation:
LB http://cowlark.com/wordgrinder/index.html: the fabulous editor which just does what it should do
LB https://github.com/mity/md4c: markdown to HTML in C