summaryrefslogtreecommitdiff
path: root/README.Aitken
diff options
context:
space:
mode:
authorAndreas Baumann <mail@andreasbaumann.cc>2017-01-01 19:31:12 +0100
committerAndreas Baumann <mail@andreasbaumann.cc>2017-01-01 19:31:12 +0100
commit69fe7b182a1eedfb75c611f7dd35fa60200426f4 (patch)
tree329a0c6cc9b06c23d8782ece09f0f7dfa9b16b13 /README.Aitken
downloadcompilertests-69fe7b182a1eedfb75c611f7dd35fa60200426f4.tar.gz
compilertests-69fe7b182a1eedfb75c611f7dd35fa60200426f4.tar.bz2
initial checkin
Diffstat (limited to 'README.Aitken')
-rw-r--r--README.Aitken205
1 files changed, 205 insertions, 0 deletions
diff --git a/README.Aitken b/README.Aitken
new file mode 100644
index 0000000..35648e5
--- /dev/null
+++ b/README.Aitken
@@ -0,0 +1,205 @@
+Cool
+spim
+string -> lexical -> token = <class,string>
+single characters token classes (brakets, semicolon)
+= assignment token, but == operator class
+words of a program. lexems
+<token class, lexeme> -> token
+lexical systems should have a minimal lookahead (1 character is normal)
+- ahead examples: == and =
+- T<T> vs >>. V<W<T>>
+regular languages
+-> sigma, alphabet, sigma* (kleen)
+- regular expression: e, 1 char, union, concatenation, iteration (kleen)
+- regular language for the set of strings in tokens
+ - epsilon and on character-only language
+ - union, concat, iteration
+- format languages
+- formal lanugae
+- meaning function L(e)=M
+ - map something in syntax to semantics
+ - L: expression -> sets(string)
+ - semantic the same, syntax is arabic vs roman numbers, very good example
+ - mulitple expression usually have the same meaning/semantic
+ - L: N:1 function, basis of optimization
+- regex for parsing
+- deducelt
+- lexical specification
+ - monster regex
+ - maximal munch, also choose the longer token
+ - ambiguities: keyword vs. identifier
+ => priority ordering (firstly defined)
+ - no rules matches: make an error production
+ - finite automata, implementation for regular expressions
+ - implementation of lexical analyzers using regular expressions
+ - epsilon-move, state transition without input
+ DFA have no choice, no epsilon-moves, one transition per input per state
+ deterministic
+ NFA: non-deterministic
+ - DFA: 2-dim array with state and input as dimensions, resulting state
+ as matrix element, share duplicate state transitions in lists of
+ vectors lists per input,
+ - table-driven DFA
+ - no NFA to DFA transition at all
+ - table-driven NFA, 2-dim matrix
+ - CoCo/R: DFA or NFA, which switches
+- chapter 5
+ - regular languages can only count mod k
+ - context-free grammars (CFG)
+ - CFG gives us: program valid or not, nothing more
+ - there is a N:1 relation from grammar to language, quite likely
+ to fit the grammar to the specific tools at hand
+ - parse tree derivation with production rules
+ N derivations define a parse tree of a language
+ left-mode derivation, right-mode derivation
+ - ambiguous grammer
+ - rewrite grammar
+ - prioritize production rules
+ - example: expression -> term and factor
+ - classical optional else and nested if problem
+ else -> closest then
+ - ambiguous grammar, precedence and associativity (%left +)
+ -> more natural language, used in tools
+ - Oberon and hand-written parsers use the first approach
+ - errors: lexical, syntax, semantical
+ currently used:
+ - panic mode (bison: error, skip)
+ - error productions (e.g gcc warnings about problematic constructs)
+ - we don't want the parse tree, we need AST, abstract syntax tree
+ AST is simpler than the parse tree, algorithms get easier on an AST
+ - top-down
+ - bottom-up
+ - rd parser (recursive descendant): most common one, very easy to
+ write by hand. top-down
+ - very easy to implement
+ - left recursion -> use right recursion (rewrite grammar), must
+ have terminals in productions to make the problem decidabale
+ S -> Sa | b1 | b2 | .. | bn
+ =>
+ S -> b1 S' | b2 S' | ... | bn S'
+ S' -> a S'
+ -> generic case is loops of non-terminals inducing left-recusion
+ => parser generators can do this automatically making the
+ grammar human-readable!
+ => nevertheless done by hand to keep control over semantics
+ - gcc used bison/flex, now a hand-written recursive decent
+ - predictive parsing, top-down, no backtracking, LL(k)-grammar
+ left-to-right-scan, left-most-derivation, k: k tokens look-ahead
+ practice k=1, LL(1)
+ - needs left-factoring
+ - table-constructable
+ - Coco/R is LL(k)
+ - LL(1) table construction
+ - things we cannot do with parsers alone:
+ - checking whether identifiers are declared before used
+ - checking validity of function parameters and if they fit to
+ their declaration
+ - first sets first(A)
+ t follow(A)
+ => construct LL(1) parsing table
+ - LL(1) can be checked by double entries in the parsing table
+ - non-left factored
+ - left-recursive
+ - ambiguous
+ - k>1 lookahead
+ - ...
+ - most programming languages are not LL(1)
+ - bottom-up, as efficient and more general, this could explain why
+ LR and LALR are more popular today
+ - work on left-recrusve grammars
+ - reduction instead of productions
+ - right-most derivation in reverse
+ - shift-reduce-parsing is the general form
+ - shift-reduce conflicts, happens
+ - reduce-reduce conflicts, grammar problem
+ - reduce only at handles, otherwise shift
+ - handle are places on the top of the stack which still
+ lead to the start symbol by n reductions
+ - algorithm is very easy compared to LL(1), but how to
+ find the handles in the grammar is tricky
+ - LR(k) grammars, used in practice LALR(k) grammars,
+ SLR(k) grammar is even more restrictive
+ - set of viable prefixes (handles) is a regular language
+ - LR(0) items (dots in the grammar)
+ - stack contains stacked prefixes of rhs of productions
+ we are currently working on, so if one matches we are
+ guarnateed we don't get stuck and can append it to
+ the stack of prefixes
+ - NFA on the stack of the shift/reduce parser, every state
+ is accepting, e-transitions are when we hope for another
+ production to produce a valid reduction, seeing terminals
+ or non-terminals on the stack leads to a state, where the
+ symbol has been consumed.
+ - canonical collections of LR(0) items, DFA of the NFA
+ - LR(0) parsing
+ - if X->S in DFA(stack) -> reduce, S is terminal or non-termonal
+ - if X->t in DFA(stack) -> shift, consume t
+ - reduce/reduce conflict: N choices for a reduction to pick
+ - shift/reduce conflict: cannot decide whether to shift or reduce
+ - LR(0) -> SLR, avoid conflict by seing beyong he stack,
+ don't recude if the terminal is not in a follow statte of
+ a production
+ - SLR grammar is defined by heuristical algorithm, not precise
+ - SLR precedence declarations, E*E higher than E+E so we can
+ reduce the * instead of shifting the +.
+ => should be named shift/reduce conflict resolution rules
+ - the tools show the parsing automaton, so we see were we
+ have conflicts in the grammar
+ - in practice SLR(1)
+ - having eos always in follow(X) means if we are out of output
+ our only hope is to reduce.
+ - SLR improvements
+ - remember state of each stack prefix
+ - SLR(1) again cannot handle certain situations in real grammars:
+ - LR(1) is used: lookahead of one is one item + one symbol in
+ input instead of just the one symbol and the follow sets
+ those are the things we see in yacc
+- semantic analysis
+ - scoping
+ - static scoping won over dynamic scoping
+ - types
+ - classes == types in modern OOP languages
+ - statically typed (C, C++, Java, well mostly, unsafe casts for exceptions)
+ - dynamically typed (Python)
+ - untyped (machine code)
+ - declarations of types for every identifier
+ - infer types for all expressions
+ - type checking and type inference
+ - logical rules of inference
+ - find best valid rule, "34" is int is better than "34" is object, though latter is also valid
+ - no external data for proving an inference rule
+ - type environment , types for free variables
+ - least upper bound of two subtypes
+ - dynamic type of an object at runtime, compile time type (static time)
+ x:A, later x = new B (static type A, dynamic type B)
+ - SELF_TYPE, return value can change at runtime
+ - type checking environemnt is O,M,C (C for SELF_TYPE)
+ - x@T: static dispatch
+ - academic to show the concepts, in practice not important
+ - type errors:
+ - assign Object to unknown types => cascading not-understandable errors
+ - No_type: internal type, subtype of all classes, every operation defined and
+ results in No_type => not a tree, harder to implemnent
+- front-end, lexer, parser, semantic analysis
+- back-end
+ - compile-time and runtime correlation
+ - code, data arangment
+ - code genreation
+ - correct and fast
+ - method activations
+ - exceptions, corooutines, call/cc
+ - data
+ - stack of activations (currently running procedure frame)
+ activation record (AR) or frame
+ - AR: no control link as we know where we are (array as stack), also return
+ values are usually passed in registers
+ - frame contents is in registers for real compilers
+ - global, statically allocated, also new objects (heap)
+ - classic code (text), data, stack and heap
+ - alignment is important, SIBGUS SPARC, ARM we choose, Intel allowed but performance penalty
+ - padding
+ - code generation
+ - stack machines
+ - register machine
+ - 1-register stack machine, accumulator, acc is the 1-position extensions of the top of the stack
+ \ No newline at end of file