crenshaw/README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151

debugging
---------

set disassemble-next-line on 
display $eax
display $ebx
display $ecx
info registers
start
si

run in emulator (also for unit testing)
---------------

Non-ELF, linear layout, start address 0x01000000,
stack starts at 0x01800000 growing downwards.
No segments or protections for now.

gcc -g -Wall -std=c99 -o emul emul.c -lunicorn -lcapstone -pthread
fpc main.pas 
./main < test.prog > test.asm
nasm -o test.bin -f bin test.asm
./emul test.bin

TODO: capstone to actually assemble the code, currently
we use nasm to produce a binary and we use capstone to decode
it. What's better?

TODO: architecture-independend 'hlt' instruction to stop
emulation at the right point?

TODO: we would like to set flavours of the CPU, a real i486
instruction set or enable/disable certain features like SSE2.

links
-----

https://en.wikibooks.org/wiki/68000_Assembly
https://en.wikipedia.org/wiki/X86_instruction_listings
file:///media/sd/compilertests/doc/Intel 80x86 Assembly Language OpCodes.html#div

https://github.com/Rewzilla/asemu.git
http://www.unicorn-engine.org/
http://www.capstone-engine.org/

https://github.com/lotabout/Let-s-build-a-compiler.git

findings
--------

general: doesn't follow the normal path

I personally don't like the omition of a scanner, as a scanner helps
to deal with weird issues I encountered, like when parsing conditions.

tutor2:

the expression and data stacks should be better explained and
linked to the abstract description of Dijkstra's Two-Stack Algorithm.
The trick is here that one stack representing the operators is the
function call stack of the recursive descent parser and the other one
lives on the stack (or could live in registers, if you have many of them
or mixed - registers for < 8 operands, rest onto stack) - but
conceptually a stack of operands.

tutor3, local variables:

local variables can be addresses after the code, letting
the assembler fill in the dword ptr [X] addresses. The
compiler merely adds [X] references and X: dw 0 labels and
data initializers.

https://stackoverflow.com/questions/18447627/what-is-pc-relative-addressing-and-how-can-i-use-it-in-masm
(stackoverflow_what-is-pc-relative-addressing-and-how-can-i-use-it-in-masm.txt)

emulate the (PC) relative addressing of the MIPS:

CALL x
x:
  ; Now address "x" is on the stack
  POP EDI
  ; Now EDI contains address of "x"
  ; Now we can do (pseudo-)PC-Relative addressing:
  MOV EAX,[EDI+1234]

but we still have to calculate relative addresses. On the
other hand we loose a register as sort of an address base
register.

PC-relative adressing is a good thing, as it makes
code relocatable for free.

Some survey on which CPUs have/had PC-relative adressing:

https://www10.dict.cc/wp_examples.php?lp_id=1&lang=en&s=relative
(tutor3-wp_examples_relative.txt)

There is no initializer yet, so we can just address the initialized
variable in the expression for now.

=> there is a tradeoff here, how much do we do in the compiler, how
   much in the assember, linker

=> security comes into play later: we don't want code and variables
   to mix! We don't want constants (like string literals, variable
   constants) to be in the same segment as mutable variables. So
   PC-relative adressing is something from the past for this point of
   view.
      
tutor3, functions

We could make the parser deterministic by having one special
character, for instance:

4 digits
Vxxx variables
Fxxx functions

then we could decide on the LookAhead character. But that's
hardly a benefit for the people using the language. Deciding
AFTER reading a symbol or in this case Ident and use the
lookahead '('. Or we now after reading what type it is (declared).

The approach here is that 'x' is the variable and 'x()' is the function
call. 

The generated code is not complete, we also need at least to generate
some function stubs with a 'ret'.

Also the distinction between variable and function names would be
better suited to introduce here.

tutor3, getchar/white space handling

Interestingly he starts with a non-scanner, parser-only approach and
introduces lexing stuff afterwards.

Checking for LF feels hacky.

Mentions now the lexer. The mean reason for a lexer is to keep the
parser simple, for instance it can work on one lookahead 'character'
like GREATER_EQUALS instead of individually '>=' and '=>'.

tutor4

Good point, even an expression parser can use an interpreter to
simplify constant expressions before generating code for them.
The other option is to let the programmer use the final constant
and add a comment for it. Sort of an extreme here would be modern
C++, where const expressions is THE thing nowadays.