README.Aitken


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205

Cool
spim
string -> lexical -> token = <class,string>
single characters token classes (brakets, semicolon)
= assignment token, but == operator class
words of a program. lexems
<token class, lexeme> -> token
lexical systems should have a minimal lookahead (1 character is normal)
- ahead examples: == and =
- T<T> vs >>. V<W<T>>
regular languages
-> sigma, alphabet, sigma* (kleen)
- regular expression: e, 1 char, union, concatenation, iteration (kleen)
- regular language for the set of strings in tokens
  - epsilon and on character-only language
  - union, concat, iteration
- format languages
- formal lanugae
- meaning function L(e)=M
  - map something in syntax to semantics
  - L: expression -> sets(string)
  - semantic the same, syntax is arabic vs roman numbers, very good example
  - mulitple expression usually have the same meaning/semantic
  - L: N:1 function, basis of optimization
- regex for parsing
- deducelt
- lexical specification
  - monster regex
  - maximal munch, also choose the longer token
  - ambiguities: keyword vs. identifier
    => priority ordering (firstly defined)
  - no rules matches: make an error production
  - finite automata, implementation for regular expressions
  - implementation of lexical analyzers using regular expressions
    - epsilon-move, state transition without input
      DFA have no choice, no epsilon-moves, one transition per input per state
      deterministic
      NFA: non-deterministic
    - DFA: 2-dim array with state and input as dimensions, resulting state
      as matrix element, share duplicate state transitions in lists of
      vectors lists per input,
    - table-driven DFA
    - no NFA to DFA transition at all
    - table-driven NFA, 2-dim matrix
    - CoCo/R: DFA or NFA, which switches
- chapter 5
  - regular languages can only count mod k
  - context-free grammars (CFG)
  - CFG gives us: program valid or not, nothing more
  - there is a N:1 relation from grammar to language, quite likely
    to fit the grammar to the specific tools at hand
  - parse tree derivation with production rules
    N derivations define a parse tree of a language
    left-mode derivation, right-mode derivation
  - ambiguous grammer
    - rewrite grammar
    - prioritize production rules
    - example: expression -> term and factor
    - classical optional else and nested if problem
      else -> closest then
    - ambiguous grammar, precedence and associativity (%left +)
      -> more natural language, used in tools
    - Oberon and hand-written parsers use the first approach
  - errors: lexical, syntax, semantical
    currently used:
    - panic mode (bison: error, skip)
    - error productions (e.g gcc warnings about problematic constructs)
  - we don't want the parse tree, we need AST, abstract syntax tree
    AST is simpler than the parse tree, algorithms get easier on an AST
  - top-down
  - bottom-up
  - rd parser (recursive descendant): most common one, very easy to
    write by hand. top-down
    - very easy to implement
    - left recursion -> use right recursion (rewrite grammar), must
      have terminals in productions to make the problem decidabale
      S -> Sa | b1 | b2 | .. | bn
      =>
      S -> b1 S' | b2 S' | ... | bn S'
      S' -> a S'
      -> generic case is loops of non-terminals inducing left-recusion
      => parser generators can do this automatically making the
         grammar human-readable!
      => nevertheless done by hand to keep control over semantics
    - gcc used bison/flex, now a hand-written recursive decent
  - predictive parsing, top-down, no backtracking, LL(k)-grammar
    left-to-right-scan, left-most-derivation, k: k tokens look-ahead
    practice k=1, LL(1)
    - needs left-factoring
    - table-constructable
    - Coco/R is LL(k)
    - LL(1) table construction
  - things we cannot do with parsers alone:
    - checking whether identifiers are declared before used
    - checking validity of function parameters and if they fit to
      their declaration
  - first sets first(A)
    t follow(A)
    => construct LL(1) parsing table
  - LL(1) can be checked by double entries in the parsing table
    - non-left factored
    - left-recursive
    - ambiguous
    - k>1 lookahead
    - ...
    - most programming languages are not LL(1)
  - bottom-up, as efficient and more general, this could explain why
    LR and LALR are more popular today
    - work on left-recrusve grammars
    - reduction instead of productions
    - right-most derivation in reverse
    - shift-reduce-parsing is the general form
      - shift-reduce conflicts, happens
      - reduce-reduce conflicts, grammar problem
    - reduce only at handles, otherwise shift
    - handle are places on the top of the stack which still
      lead to the start symbol by n reductions
    - algorithm is very easy compared to LL(1), but how to
      find the handles in the grammar is tricky
  - LR(k) grammars, used in practice LALR(k) grammars,
    SLR(k) grammar is even more restrictive
  - set of viable prefixes (handles) is a regular language
  - LR(0) items (dots in the grammar)
  - stack contains stacked prefixes of rhs of productions
    we are currently working on, so if one matches we are
    guarnateed we don't get stuck and can append it to
    the stack of prefixes
  - NFA on the stack of the shift/reduce parser, every state
    is accepting, e-transitions are when we hope for another
    production to produce a valid reduction, seeing terminals
    or non-terminals on the stack leads to a state, where the
    symbol has been consumed.
  - canonical collections of LR(0) items, DFA of the NFA
  - LR(0) parsing 
    - if X->S in DFA(stack) -> reduce, S is terminal or non-termonal
    - if X->t in DFA(stack) -> shift, consume t
    - reduce/reduce conflict: N choices for a reduction to pick
    - shift/reduce conflict: cannot decide whether to shift or reduce
    - LR(0) -> SLR, avoid conflict by seing beyong he stack,
      don't recude if the terminal is not in a follow statte of
      a production
    - SLR grammar is defined by heuristical algorithm, not precise
    - SLR precedence declarations, E*E higher than E+E so we can
      reduce the * instead of shifting the +.
      => should be named shift/reduce conflict resolution rules
    - the tools show the parsing automaton, so we see were we
      have conflicts in the grammar
    - in practice SLR(1)
    - having eos always in follow(X) means if we are out of output
      our only hope is to reduce.
  - SLR improvements
    - remember state of each stack prefix
  - SLR(1) again cannot handle certain situations in real grammars:
    - LR(1) is used: lookahead of one is one item + one symbol in
      input instead of just the one symbol and the follow sets
      those are the things we see in yacc
- semantic analysis
  - scoping
    - static scoping won over dynamic scoping
  - types
    - classes == types in modern OOP languages
    - statically typed (C, C++, Java, well mostly, unsafe casts for exceptions)
    - dynamically typed (Python)
    - untyped (machine code)
    - declarations of types for every identifier
    - infer types for all expressions
    - type checking and type inference
  - logical rules of inference
    - find best valid rule, "34" is int is better than "34" is object, though latter is also valid
    - no external data for proving an inference rule
    - type environment , types for free variables
    - least upper bound of two subtypes
  - dynamic type of an object at runtime, compile time type (static time)
    x:A, later x = new B (static type A, dynamic type B)
  - SELF_TYPE, return value can change at runtime
    - type checking environemnt is O,M,C (C for SELF_TYPE)
    - x@T: static dispatch
    - academic to show the concepts, in practice not important
  - type errors:
    - assign Object to unknown types => cascading not-understandable errors
    - No_type: internal type, subtype of all classes, every operation defined and
      results in No_type => not a tree, harder to implemnent
- front-end, lexer, parser, semantic analysis
- back-end
  - compile-time and runtime correlation
  - code, data arangment
  - code genreation
    - correct and fast
    - method activations
      - exceptions, corooutines, call/cc
  - data
    - stack of activations (currently running procedure frame)
      activation record (AR) or frame
    - AR: no control link as we know where we are (array as stack), also return
      values are usually passed in registers
    - frame contents is in registers for real compilers
    - global, statically allocated, also new objects (heap)
    - classic code (text), data, stack and heap
    - alignment is important, SIBGUS SPARC, ARM we choose, Intel allowed but performance penalty
      - padding
  - code generation
    - stack machines
    - register machine
    - 1-register stack machine, accumulator, acc is the 1-position extensions of the top of the stack