diff options
author | Andreas Baumann <mail@andreasbaumann.cc> | 2017-01-01 19:31:12 +0100 |
---|---|---|
committer | Andreas Baumann <mail@andreasbaumann.cc> | 2017-01-01 19:31:12 +0100 |
commit | 69fe7b182a1eedfb75c611f7dd35fa60200426f4 (patch) | |
tree | 329a0c6cc9b06c23d8782ece09f0f7dfa9b16b13 /README.Aitken | |
download | compilertests-69fe7b182a1eedfb75c611f7dd35fa60200426f4.tar.gz compilertests-69fe7b182a1eedfb75c611f7dd35fa60200426f4.tar.bz2 |
initial checkin
Diffstat (limited to 'README.Aitken')
-rw-r--r-- | README.Aitken | 205 |
1 files changed, 205 insertions, 0 deletions
diff --git a/README.Aitken b/README.Aitken new file mode 100644 index 0000000..35648e5 --- /dev/null +++ b/README.Aitken @@ -0,0 +1,205 @@ +Cool +spim +string -> lexical -> token = <class,string> +single characters token classes (brakets, semicolon) += assignment token, but == operator class +words of a program. lexems +<token class, lexeme> -> token +lexical systems should have a minimal lookahead (1 character is normal) +- ahead examples: == and = +- T<T> vs >>. V<W<T>> +regular languages +-> sigma, alphabet, sigma* (kleen) +- regular expression: e, 1 char, union, concatenation, iteration (kleen) +- regular language for the set of strings in tokens + - epsilon and on character-only language + - union, concat, iteration +- format languages +- formal lanugae +- meaning function L(e)=M + - map something in syntax to semantics + - L: expression -> sets(string) + - semantic the same, syntax is arabic vs roman numbers, very good example + - mulitple expression usually have the same meaning/semantic + - L: N:1 function, basis of optimization +- regex for parsing +- deducelt +- lexical specification + - monster regex + - maximal munch, also choose the longer token + - ambiguities: keyword vs. identifier + => priority ordering (firstly defined) + - no rules matches: make an error production + - finite automata, implementation for regular expressions + - implementation of lexical analyzers using regular expressions + - epsilon-move, state transition without input + DFA have no choice, no epsilon-moves, one transition per input per state + deterministic + NFA: non-deterministic + - DFA: 2-dim array with state and input as dimensions, resulting state + as matrix element, share duplicate state transitions in lists of + vectors lists per input, + - table-driven DFA + - no NFA to DFA transition at all + - table-driven NFA, 2-dim matrix + - CoCo/R: DFA or NFA, which switches +- chapter 5 + - regular languages can only count mod k + - context-free grammars (CFG) + - CFG gives us: program valid or not, nothing more + - there is a N:1 relation from grammar to language, quite likely + to fit the grammar to the specific tools at hand + - parse tree derivation with production rules + N derivations define a parse tree of a language + left-mode derivation, right-mode derivation + - ambiguous grammer + - rewrite grammar + - prioritize production rules + - example: expression -> term and factor + - classical optional else and nested if problem + else -> closest then + - ambiguous grammar, precedence and associativity (%left +) + -> more natural language, used in tools + - Oberon and hand-written parsers use the first approach + - errors: lexical, syntax, semantical + currently used: + - panic mode (bison: error, skip) + - error productions (e.g gcc warnings about problematic constructs) + - we don't want the parse tree, we need AST, abstract syntax tree + AST is simpler than the parse tree, algorithms get easier on an AST + - top-down + - bottom-up + - rd parser (recursive descendant): most common one, very easy to + write by hand. top-down + - very easy to implement + - left recursion -> use right recursion (rewrite grammar), must + have terminals in productions to make the problem decidabale + S -> Sa | b1 | b2 | .. | bn + => + S -> b1 S' | b2 S' | ... | bn S' + S' -> a S' + -> generic case is loops of non-terminals inducing left-recusion + => parser generators can do this automatically making the + grammar human-readable! + => nevertheless done by hand to keep control over semantics + - gcc used bison/flex, now a hand-written recursive decent + - predictive parsing, top-down, no backtracking, LL(k)-grammar + left-to-right-scan, left-most-derivation, k: k tokens look-ahead + practice k=1, LL(1) + - needs left-factoring + - table-constructable + - Coco/R is LL(k) + - LL(1) table construction + - things we cannot do with parsers alone: + - checking whether identifiers are declared before used + - checking validity of function parameters and if they fit to + their declaration + - first sets first(A) + t follow(A) + => construct LL(1) parsing table + - LL(1) can be checked by double entries in the parsing table + - non-left factored + - left-recursive + - ambiguous + - k>1 lookahead + - ... + - most programming languages are not LL(1) + - bottom-up, as efficient and more general, this could explain why + LR and LALR are more popular today + - work on left-recrusve grammars + - reduction instead of productions + - right-most derivation in reverse + - shift-reduce-parsing is the general form + - shift-reduce conflicts, happens + - reduce-reduce conflicts, grammar problem + - reduce only at handles, otherwise shift + - handle are places on the top of the stack which still + lead to the start symbol by n reductions + - algorithm is very easy compared to LL(1), but how to + find the handles in the grammar is tricky + - LR(k) grammars, used in practice LALR(k) grammars, + SLR(k) grammar is even more restrictive + - set of viable prefixes (handles) is a regular language + - LR(0) items (dots in the grammar) + - stack contains stacked prefixes of rhs of productions + we are currently working on, so if one matches we are + guarnateed we don't get stuck and can append it to + the stack of prefixes + - NFA on the stack of the shift/reduce parser, every state + is accepting, e-transitions are when we hope for another + production to produce a valid reduction, seeing terminals + or non-terminals on the stack leads to a state, where the + symbol has been consumed. + - canonical collections of LR(0) items, DFA of the NFA + - LR(0) parsing + - if X->S in DFA(stack) -> reduce, S is terminal or non-termonal + - if X->t in DFA(stack) -> shift, consume t + - reduce/reduce conflict: N choices for a reduction to pick + - shift/reduce conflict: cannot decide whether to shift or reduce + - LR(0) -> SLR, avoid conflict by seing beyong he stack, + don't recude if the terminal is not in a follow statte of + a production + - SLR grammar is defined by heuristical algorithm, not precise + - SLR precedence declarations, E*E higher than E+E so we can + reduce the * instead of shifting the +. + => should be named shift/reduce conflict resolution rules + - the tools show the parsing automaton, so we see were we + have conflicts in the grammar + - in practice SLR(1) + - having eos always in follow(X) means if we are out of output + our only hope is to reduce. + - SLR improvements + - remember state of each stack prefix + - SLR(1) again cannot handle certain situations in real grammars: + - LR(1) is used: lookahead of one is one item + one symbol in + input instead of just the one symbol and the follow sets + those are the things we see in yacc +- semantic analysis + - scoping + - static scoping won over dynamic scoping + - types + - classes == types in modern OOP languages + - statically typed (C, C++, Java, well mostly, unsafe casts for exceptions) + - dynamically typed (Python) + - untyped (machine code) + - declarations of types for every identifier + - infer types for all expressions + - type checking and type inference + - logical rules of inference + - find best valid rule, "34" is int is better than "34" is object, though latter is also valid + - no external data for proving an inference rule + - type environment , types for free variables + - least upper bound of two subtypes + - dynamic type of an object at runtime, compile time type (static time) + x:A, later x = new B (static type A, dynamic type B) + - SELF_TYPE, return value can change at runtime + - type checking environemnt is O,M,C (C for SELF_TYPE) + - x@T: static dispatch + - academic to show the concepts, in practice not important + - type errors: + - assign Object to unknown types => cascading not-understandable errors + - No_type: internal type, subtype of all classes, every operation defined and + results in No_type => not a tree, harder to implemnent +- front-end, lexer, parser, semantic analysis +- back-end + - compile-time and runtime correlation + - code, data arangment + - code genreation + - correct and fast + - method activations + - exceptions, corooutines, call/cc + - data + - stack of activations (currently running procedure frame) + activation record (AR) or frame + - AR: no control link as we know where we are (array as stack), also return + values are usually passed in registers + - frame contents is in registers for real compilers + - global, statically allocated, also new objects (heap) + - classic code (text), data, stack and heap + - alignment is important, SIBGUS SPARC, ARM we choose, Intel allowed but performance penalty + - padding + - code generation + - stack machines + - register machine + - 1-register stack machine, accumulator, acc is the 1-position extensions of the top of the stack +
\ No newline at end of file |