summaryrefslogtreecommitdiff
path: root/miniany/README.txt
blob: d321f720f0591bcfd6257a23f7ef7004a716ab67 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
            CC - a self-hosting, bootstrappable, minimal C compiler

Introduction

   On the never-ending quest of a minimal system I found Swieros and C4
   (the C compiler in 4 functions). Inspired and intrigued I started to
   implement my own.

   For abaos (a small operating system of mine, also in C) I cloned the
   minimal C library, so we can build a freestanding version of C4.

   C4 serves as a test whether my own CC is minimal enough and doesn't use
   silly functions. Additionally C4 as well as CC are compiled both in a
   (on Linux) hosted version and a freestanding version. We use a series
   of compilers like gcc, clang, tcc and pcc to make sure that we are not
   using more silly C constructs.

   In order to be able to port easily we make almost no use of system
   calls, the ones we need are:
     * brk: for malloc/free, change the start address of the heap segment
       of the process, if the OS only assigns a single static space, then
       brk results in a NOP.
     * exit: terminate the process, return does not always work in all
       combinations (for instance with pcc on Linux). Can be a NOP, we
       don't require any trickery as atexit and we don't use buffering
       anywhere (for instance flushing stdout on exit).
     * read/write: read from stdin linearly, write to stdout linearly,
       this is essentially a model using an input and an output tape.
       Those two functions must really exist. This basically eliminates
       the need for a file system which we might not have during early
       bootstrapping.

   Similarly we simplify the C language to not use certain features which
   can cause trouble when bootstrapping:
     * variable arguments: though simple in principle (just some pointers
       into the stack if you use a stack for function parameters), it is
       not typesafe. And the only example in practice it's really heavily
       used for is in printf-like functions.
     * preprocessor: it needs a filesystem, we take this outside of the
       compiler by feeding it an (eventually) concatenated list of *.c
       files. Note: in the hosted environment we (and glibc) can use as
       many preprocessor features as they want, they just don't have to
       get visible in our code.
     * two types: int and char, so we can interpret memory as words or as
       bytes.

Local version of C4

   The local version of C4 has the following adaoptions and extensions:
     * switch statement from the switch-and-structs branch, adapted c4
       itself to use switch statements instead of if's (as in the
       switch-and-structs branch)
     * struct support from switch-and-structs
     * constants like EOF, EXIT_SUCCESS, NULL
     * standard C block comments along to c++ end of line ones
     * negative enum initializers
     * do/while loops
     * more C functions like isspace, getc, strcmp
     * some simplified functions for printing like putstring, putint,
       putnl replacing printf-like functions
     * BSD-style string functions like strlcpy, strlcat
     * strict C89 conformance, mainly use standard comment blocks, also
       removed some warnings
     * some casts around malloc and memset to fit to non-void
       freestanding-libc
     * converted printf to putstring/putint/putnl and some helper
       functions for error reporting like error()
     * removed all memory leaks
     * de-POSIX-ified, no open/read/close, use getchar from stdin only
       (don't assume the existence of a file system), this also means we
       had to create sort of an old style tape-file with FS markers to
       separate the files piped to c4.

   The reason for all those adaptions is to minimize the dependency on the
   host system and to be able to use libc-freestanding.c.

   Note: Only too late I discovered that there was a C5 version of the
   same compiler, which would maybe have served better as a basis.

Examples

  Running on the host system using the hosts C compiler

   Compiled in either hosted (host libc) or freestanding (our own libc,
   currently IA-32 Linux kernel only syscalls):
./build.sh cc hostcc hosted d
./build.sh cc hostcc freestanding d
./cc < test1.c > test1.asm

   Create a plain binary from the assembly code:
fasm test1.asm test1.bin

   Disassemble it to verify it's correctness:
ndisasm -b32 -o1000000h -a test1.bin

   You can choose gcc, clang, tcc or pcc as host compiler (hostcc).

  Running on the host in the C4 interpreter

   Running in C4 interpreter, again, the C4 program can be compiled in
   hosted or freestanding mode:
./build.sh c4 hostcc hosted d
./build.sh c4 hostcc freestanding d

   Here again you can choose the host compiler for compiling C4.

   Then we have to create the standard input for C4 using:
echo -n -e "\034" > EOF
cat cc.c EOF hello.c | ./c4
cat c4.c EOF cc.c EOF hello.c | ./c4
cat c4.c4 EOF c4.c EOF cc.c EOF hello.c | ./c4

   EOF contains the traditional FS (file separator) character in the ASCII
   character set. Every time c4 is invoked it reads exacly one input file
   up to the first FS character (or stops at the end of stdin).

   We can also use -s, or -d on every level as follows:
cat cc.c EOF hello.c | ./c4 -d

Features and Requirements

   We have to careful what to put in a bootstrapping compiler, there is a
   tradeoff between
     * costs to implement it in the language: code complexity and size
       must be kept at a minimun
     * required features from the C runtime: for instance support of the
       whole Unicde characters
     * required features from the operating system: for instance the
       requirement for a POSIX layer
     * ugliness of the resulting code: it doesn't make sense to omit
       features which are somewhat hard to implement but can render
       readability of the code much better: best examples are the
       switch/case vs. if/else and the funny array indexing vs. structures
       in C4.

   So we collect some ideas here about features we add or do not add and
   why. We also collect here the implications when we are implementing
   them.

   We also have to be careful what C4 can do for us and either add it
   there (but only if small enough) in order not to loose this test case.

  Preprocessor for modularisation

   Implementation status: no

   Reasoning:
     * #include <filename.h>: requires a filename and thus an operating
       system with a filesystem early on, needs directory functions and
       open/close/read/write on files
     * C4 ignores all prepocessor commands and treats them as comments
       till the end of line

   Alternative:
     * have a cat *.c concatenating all the source code

   Counter arguments:
     * we could have sort of a special tape filesystem with a directory at
       the beginning and offset to allow the preprocessor to find the
       files for the early destination OS
     * there is no reason not to use the preprocessor on the host
     * We can end up in nasty hen-and-egg problems with functions needing
       each other for different falvours of implementation (e. g. atoi in
       hosted and freestading), so we might end up duplicating source
       code.

  Preprocessor for conditional compilation

   Implementation status: no

   Reasoning:
     * especially simple #ifdefs would be simple to do, conditional
       compilation in the same file is questionable, there is nothing
       wrong with including files like linux_386.c, linux_x86_64.c
     * they are not part of C, they are an addon (though they are nowadays
       sometimes treated as if they were part of C)
     * C4 ignores all prepocessor commands and treats them as comments
       till the end of line

   Alternative:
     * Use special files and concatenate them at the right place, e. g.
       _start-stub.c for tcc

  Preprocessor for constant declarations

   Implementation status: no

   Reasoning:
     * enum constants serve the same purpose, we prefer those
     * even C89 can work with enum constants

   Caveats:
     * C4 has only positive enums, we need negative ones for EOF
     * enum constants are signed integer, so we have to be careful when
       using them for chars
     * we don't use enum for enumeration types

  Variable Initializers

   Implementation status: no

   Reasoning:
     * initializers of global and locals, not that important as we use C89
       anyway, forcing us to separate declaration and usage of variables
       per scope

   Counter arguments:
     * for example symname[t]: printing the symbol and not the number,
       requires static initializers for array of char* to make the code
       look nice. Of course we can also just initialize it in a piece of
       code.

  Inline Assembly

   Implementation status: yes

   Reasoning:
     * somehow we have to communicate to the operating system, it's either
       inline assembly or external linking to assembly code

   Counter arguments:
     * C4 has no inline assembly, we must add it there too

   Alternative:
     * we could implement a linker and object format early on, we could
       use an external assembler as for instance asm-i386 (which
       understands a subset of fasm)
     * we could implement an early dump format for symbols (function
       signatures mainly) and use them during linking

   Some general notes:

   GNU inline asm statement has become the de-facto standard (which is too
   complicated IMHO): I would require sort of a .byte 0xXX instruction
   only, for readablility maybe simple fasm-like syntax. We must be
   careful that our invention of an inline assembler can be mapped somehow
   to the GNU inline asm version, so that we can use that one on the host
   with gcc/clang/tcc/pcc..

   c.c in swieros (the c4 successor) has asm(NOP), this is something we
   could implement easily. u.h contains an enum with opcodes (most likely
   doable or an easy architecture like the one in swieros, I doubt this
   works for Intel opcodes, but we should check if it works for our
   simplified Intel opcode subset).

   There should though be only one single point of information for opcodes
   per architecture, so asm gets sort of an inline string generator for
   the assembly output. Or we share a common C-file with enums for the
   opcodes and cat it to both the assembler and the compiler during the
   build (should not result in increaed code size, as those are enums).

   The asm(x) or asm(x,y) constructs can be mapped on the host compilers
   to asm __volatile__ .byte ugliness. In cc and c4 we can take the
   swieros approach. This should give us nice lowlevel inline assembly in
   a really simplified way (basically embedding bytes).

   Not having inline assembly means you need compilation units written and
   linked to the program in assembly, which - well - adds a linker and
   calling conventions, which might be too early in bootstrapping.

  Object formats and linkers

   Implementation status: no

   Reasoning:
     * requires file systems and linker formats like ELF

   Alternative:
     * make compilation units from a bunch of source code, this results in
       bigger binaries, as we cannot share too much. This might be
       acceptable for early bootstrapping. Later on it would be nice to
       add a linker as an optional addon (which can be used outside of
       bootstrapping).

  Forward declarations of function prototypes

   Implementation status: yes

   Reasoning:
     * Recursive descent parsing requires forward function declarations.

   Caveats:
     * Forward function declarations are not that easy to implement,
       because you have to generate a placeholder for the call address
       before you get the whole definition of the forwarded function
       (especially its entry address). Or we have to create sort of a
       temporary jump into a jump table (sort of a GOT) which we patch
       when we know the address of the implementation of the function.
       Either way we loose the 1-pass output of the generated code. Having
       a global table at one place scales easier, as we don't have to keep
       the whole generated code around just for patching (remember, we
       have tapes and memory, no seek of files).
     * Must be added to c4 too, might not be that easy (TODO)

   Counter arguments:
     * We could write a non-recursive parser using tables

  Functions with variable arguments

   Implementation status: no

   Reasoning:
     * not type-safe
     * only used for (s)printfs string manipulation and output functions
     * C4 doesn't implement variable arguments for defining functions, so
       we cannot bootstrap a freestanding version

   Requirements
     * on the hosted Linux envorinment we need syscalls to syscall, int
       0x80, etc.
     * we need inline assembly to create the syscalls

   Alternative:
     * create simple functions like putchar, putint, putstring (one per
       basic type) and some helpers like putnl
     * use stlcat and stlcpy should be enought to compose label names
     * instead of a generic syscall(..) we can easily work around this by
       having syscall1, syscall2, syscall3, ... with a fixed number of
       parameters

   Counter arguments:
     * we deviate from the C standard, printf just belongs to C
     * printf is actually not hard to implement in a type-unsafe-way
     * syscalls have variable arguments

  FILE* and stderr

   Implementation status: no

   Reasoning:
     * requires FILE * structures, requires various write channels from
       the operating system
     * we can write error messages into the output stream as comments like
       ; ERROR in line 32, pos 2: generic error (TODO)

   Counter arguments:
     * unbuffered FILE * is simple to implement
     * almost all operating systems have the notion of stdout and stderr,
       we can set them to the same channel if an OS doesn't have them.

  Typedefs

   Implementation status: no

   Reasoning:
     * typedefs are syntactic sugar for typedef struct T as T, not
       strictly necessary and they don't make our code look too ugly.

   Counter arguments:
     * portability without preprocessor could make use of typedef ulong64
       unsigned long int and similar constructs

  For-loops

   Implementation status: no

   Reasoning:
     * unless we start optimizing (SIMD) there is no real benefit for a
       generic 'for', a strict for i=0 to N, i++ is easier to optimize,
       when you have a grammatical construct to help recognizing it.
     * the C for loop has a funny ending semicolon issue and you can add
       arbitraty code blocks into the three parts of the for, this is way
       too generic for a minimalistic language

   Counter arguments:
     * for-loops are not hard to implement

  Passing arguments to main

   Implementation status: yes

   Reasoning:
     * this is just passing an int (argc) and a char** (argv) from _start
       to main. This is also a convention shared with the operating system
       when it calls our process. Without that we would have a hard time
       to parametrize the calls to our process both on the host and in the
       target OS.

   Counter argument:
     * Do really everything over the input stream, but this would feel a
       little bit too mainframe-ish.

  Boolean Type

   Implementation status: no

   Reasoning:
     * C89 has no bool type
     * useful, but not strigtly necessary, we can live with int holding a
       boolean value

   Counter arguments:
     * boolean and integer values and variables form a nice little type
       system for expressions so introducing booleans might have some
       educational value.

  Union

   Implementation status: no

   Reasoning:
     * in the compiler there is little benefit of compressing parts of e.
       g. ASTs into unions
     * unions allow accessing the same memory via different means, this
       can also be achieved with code accessing the memory differentlyand
       doing some byte/word conversions for instance.

   Counter arguments:
     * for bootstrapping the operating system we might need unions (as
       well as packing and alignment)

  Dangling else

   Implementation status: no

   Reasoning:
     * Don't allow non-blocked if/else, just avoid the dangling else
       problem, it's uninteresting and in C not as bad as in PASCAL-like
       languages, as the block markers are just one character big.

  Short-circuit conditions

   Implementations status: yes

   Reasoning:
     * Not really necessary as we could get away with if( a != null ) {
       if( a->x == 7 ) { ... } }, but it makes code look rather ugly.

  Return statement

   Implemention status: yes, but..

   Reasoning:
     * There are good reasons for not allowing return everywhere in the
       code, see newest Oberon revisions allowing RETURN only at the end
       of the function declaration. There are benefits like easier
       detection whether the function returns a value, easier flow
       analysis (imagine returns in complicated if-else-statements). For
       now we allow it everywhere, but we should try hard not to use it in
       the middle of code blocks.
     * There is an argument from the code correctness point of view as
       return in the middle of code makes the code hard to reason about
       (similar to too many if-else-statements)
     * Allowing return only at the end of a function and nowhere else
       makes tail-recursion easy.
     * Error handling is really hard when return appears everywhere in the
       body, it's much easier to check whether there is a return at the
       end and whether the returned type matches the prototype of the
       function,

  Register Allocation

   Implementation status: yes

   Reasoning:
     * Compared to a stack machine even the simplest register allocation
       algorithm produces much better code

   Counter arguments:
     * Stack machine with the top of stack in EAX is also quite a simple
       and efficient solution (see Write Your Own Compiler, Holm).

  Abstract Syntax Trees

   Implementation status: yes

   Reasoning:
     * Delaying code generation is essential when doing an assignment
       (rvalue must be evaluated before lvalue), in const folding (do no
       generate code as you don't know the expression is a constant but
       instead just compute the current value of the constant expression).
     * Separate semantic operation array index evaluation from definition
       of the size of an array for instance with the '[' character (we do
       not want to react on the scanner symbol directly).
     * When evaluating a boolean expression we don't know yet its context
       (can be in a if/while condition, in which case we would generate a
       conditional jump or it can be in an an assignment, in which case we
       would store the value into a boolean variable).

   Caveats:
     * Try to keep the scope of an AST as small as possible and as big as
       necessary (the output of the parser should not be the complete
       source code). The mininum we need is an expression and some
       context, the maximum maybe is the scope of a function.

   Counter arguments:
     * Readable AST code needs some trees and pointers to structs, but we
       want those anyway for better readable code (C4 for instance is not
       that readable in it's original array indexing version).

  Builtin functions

   Implementation status: yes

   Reasoning:
     * Adding putint/getchar style of functions as elements of the
       language is tempting, as it allows early debugging and testing (as
       in PASCAL). The fundemental conflict here is that bootstrapping is
       better with stdout and stdin in the language (no function calls, no
       linker, etc. needed). But later on we want those functions be part
       of a language library and not of the language itself.

   Caveats:
     * Avoid code duplication (inline assembly in the compiler for the
       keyword implementation and with inline assembly in the language
       library). (TODO)

  Function calling conventions

   Implementation status: yes

   Reasoning:
     * Calling conventions, EAX for int or pointer returns, stack as in
       Pascal in calling order (otherwise we need an AST of the parameters
       if we want to push them in reverse order). Reverse order is there
       so that the first parameter is on top of the stack and we now the
       start of the stack frame, this helps implementing varargs, which we
       don't want to support.
     * Currently we only have ints, chars and pointers, which should fit
       nicely into simple memory models where pointers and integers are
       not completely different. char/byte arguments can be pushed as
       4-bytes, we could do some stack alignment to simplify things.

   Caveats:
     * Operating syscalls follow the EAX, EBX, ECX, EDX, .. int 80h calls
       on the 32-bit host. There might be our own operating syscalls we
       want to support later. The compiler should not know about that and
       a thin layer in the standard library can do the conversion.

   Counter arguments:
     * thiscall conventions would be handy if we had some limited C++ this
       pointer support.

References

   Compiler construction in general:
     * "Compiler Construction"", Niklaus Wirth
     * [1]https://github.com/DoctorWkt/acwj: a nice series on building a C
       compiler, step by step with lots of good explanations
     * [2]https://github.com/lotabout/write-a-C-interpreter/blob/master/tu
       torial/en/, tutorial based on C4 how to build a C interpreter,
       explains nicely details in C4.

   Some special compiler building topics:
     * [3]https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing,
       [4]https://en.wikipedia.org/wiki/Operator-precedence_parser#Precede
       nce_climbing_method
     * [5]https://en.wikipedia.org/wiki/Strahler_number: justification for
       register numbers for register alloation (TODO: clarify)
     * [6]https://en.wikipedia.org/wiki/X86_calling_conventions: calling
       conventions on the IA-32 architecture

   C4:
     * [7]https://github.com/rswier/c4.git, C4 - C in four functions,
       Robert Swierczek, minimalistic C compiler running on an emulator on
       the IR, inspiration for this project
     * [8]https://github.com/rswier/c4/blob/switch-and-structs/c4.c, c4
       adaptions to provide switch and structs
     * [9]https://github.com/EarlGray/c4: a X86 JIT version of c4
     * [10]https://github.com/jserv/amacc: based on C4, JIT or native
       code, for ARM, quite well documented, also very nice list of
       compiler resources on Github page

   Other minimal compilers and systems:
     * [11]http://selfie.cs.uni-salzburg.at/: Christoph Kirsch, C*
       self-hosting C compiler (also emulator, hypervisor) for RISCV,
       inspiration for what makes up a minimal C language
     * [12]http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Complem
       ents/tinyc.c, Marc Feeley, really easy and much more readable,
       meant as educational compiler
     * [13]https://github.com/rswier/swieros.git: Robert Swierczek, c.c in
       swieros
     * [14]https://github.com/ras52/boostrap: Richard Smith, bootstrapping
       experiment
     * [15]http://t3x.org/t3x: Nils M Holm, the T3X programming language,
       especially the bootstrapping version T3X9

   Assembly:
     * [16]https://github.com/felipensp/assembly/blob/master/x86/itoa.s,
       for putint (early debugging keyword)
     * [17]https://baptiste-wicht.com/posts/2011/11/print-strings-integers
       -intel-assembly.htm (early debugging keyword)

   Documentation:
     * [18]http://cowlark.com/wordgrinder/index.html: the fabulous editor
       which just does what it should do
     * [19]https://github.com/mity/md4c: markdown to HTML in C

References

   1. https://github.com/DoctorWkt/acwj
   2. https://github.com/lotabout/write-a-C-interpreter/blob/master/tutorial/en/
   3. https://www.engr.mun.ca/~theo/Misc/exp
   4. https://en.wikipedia.org/wiki/Operator-precedence
   5. https://en.wikipedia.org/wiki/Strahler
   6. https://en.wikipedia.org/wiki/X86
   7. https://github.com/rswier/c4.git
   8. https://github.com/rswier/c4/blob/switch-and-structs/c4.c
   9. https://github.com/EarlGray/c4
  10. https://github.com/jserv/amacc
  11. http://selfie.cs.uni-salzburg.at/
  12. http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Complements/tinyc.c
  13. https://github.com/rswier/swieros.git
  14. https://github.com/ras52/boostrap
  15. http://t3x.org/t3x
  16. https://github.com/felipensp/assembly/blob/master/x86/itoa.s
  17. https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.htm
  18. http://cowlark.com/wordgrinder/index.html
  19. https://github.com/mity/md4c