debugging --------- set disassemble-next-line on display $eax display $ebx display $ecx info registers start si run in emulator (also for unit testing) --------------- Non-ELF, linear layout, start address 0x01000000, stack starts at 0x01800000 growing downwards. No segments or protections for now. gcc -g -Wall -std=c99 -o emul emul.c -lunicorn -lcapstone -pthread fpc main.pas ./main < test.prog > test.asm nasm -o test.bin -f bin test.asm ./emul test.bin TODO: capstone to actually assemble the code, currently we use nasm to produce a binary and we use capstone to decode it. What's better? TODO: architecture-independend 'hlt' instruction to stop emulation at the right point? TODO: we would like to set flavours of the CPU, a real i486 instruction set or enable/disable certain features like SSE2. links ----- https://en.wikibooks.org/wiki/68000_Assembly https://en.wikipedia.org/wiki/X86_instruction_listings file:///media/sd/compilertests/doc/Intel 80x86 Assembly Language OpCodes.html#div https://github.com/Rewzilla/asemu.git http://www.unicorn-engine.org/ http://www.capstone-engine.org/ https://github.com/lotabout/Let-s-build-a-compiler.git findings -------- general: doesn't follow the normal path I personally don't like the omition of a scanner, as a scanner helps to deal with weird issues I encountered, like when parsing conditions. tutor2: the expression and data stacks should be better explained and linked to the abstract description of Dijkstra's Two-Stack Algorithm. The trick is here that one stack representing the operators is the function call stack of the recursive descent parser and the other one lives on the stack (or could live in registers, if you have many of them or mixed - registers for < 8 operands, rest onto stack) - but conceptually a stack of operands. tutor3, local variables: local variables can be addresses after the code, letting the assembler fill in the dword ptr [X] addresses. The compiler merely adds [X] references and X: dw 0 labels and data initializers. https://stackoverflow.com/questions/18447627/what-is-pc-relative-addressing-and-how-can-i-use-it-in-masm (stackoverflow_what-is-pc-relative-addressing-and-how-can-i-use-it-in-masm.txt) emulate the (PC) relative addressing of the MIPS: CALL x x: ; Now address "x" is on the stack POP EDI ; Now EDI contains address of "x" ; Now we can do (pseudo-)PC-Relative addressing: MOV EAX,[EDI+1234] but we still have to calculate relative addresses. On the other hand we loose a register as sort of an address base register. PC-relative adressing is a good thing, as it makes code relocatable for free. Some survey on which CPUs have/had PC-relative adressing: https://www10.dict.cc/wp_examples.php?lp_id=1&lang=en&s=relative (tutor3-wp_examples_relative.txt) There is no initializer yet, so we can just address the initialized variable in the expression for now. => there is a tradeoff here, how much do we do in the compiler, how much in the assember, linker => security comes into play later: we don't want code and variables to mix! We don't want constants (like string literals, variable constants) to be in the same segment as mutable variables. So PC-relative adressing is something from the past for this point of view. tutor3, functions We could make the parser deterministic by having one special character, for instance: 4 digits Vxxx variables Fxxx functions then we could decide on the LookAhead character. But that's hardly a benefit for the people using the language. Deciding AFTER reading a symbol or in this case Ident and use the lookahead '('. Or we now after reading what type it is (declared). The approach here is that 'x' is the variable and 'x()' is the function call. The generated code is not complete, we also need at least to generate some function stubs with a 'ret'. Also the distinction between variable and function names would be better suited to introduce here. tutor3, getchar/white space handling Interestingly he starts with a non-scanner, parser-only approach and introduces lexing stuff afterwards. Checking for LF feels hacky. Mentions now the lexer. The mean reason for a lexer is to keep the parser simple, for instance it can work on one lookahead 'character' like GREATER_EQUALS instead of individually '>=' and '=>'. tutor4 Good point, even an expression parser can use an interpreter to simplify constant expressions before generating code for them. The other option is to let the programmer use the final constant and add a comment for it. Sort of an extreme here would be modern C++, where const expressions is THE thing nowadays.