Cool spim string -> lexical -> token = single characters token classes (brakets, semicolon) = assignment token, but == operator class words of a program. lexems -> token lexical systems should have a minimal lookahead (1 character is normal) - ahead examples: == and = - T vs >>. V> regular languages -> sigma, alphabet, sigma* (kleen) - regular expression: e, 1 char, union, concatenation, iteration (kleen) - regular language for the set of strings in tokens - epsilon and on character-only language - union, concat, iteration - format languages - formal lanugae - meaning function L(e)=M - map something in syntax to semantics - L: expression -> sets(string) - semantic the same, syntax is arabic vs roman numbers, very good example - mulitple expression usually have the same meaning/semantic - L: N:1 function, basis of optimization - regex for parsing - deducelt - lexical specification - monster regex - maximal munch, also choose the longer token - ambiguities: keyword vs. identifier => priority ordering (firstly defined) - no rules matches: make an error production - finite automata, implementation for regular expressions - implementation of lexical analyzers using regular expressions - epsilon-move, state transition without input DFA have no choice, no epsilon-moves, one transition per input per state deterministic NFA: non-deterministic - DFA: 2-dim array with state and input as dimensions, resulting state as matrix element, share duplicate state transitions in lists of vectors lists per input, - table-driven DFA - no NFA to DFA transition at all - table-driven NFA, 2-dim matrix - CoCo/R: DFA or NFA, which switches - chapter 5 - regular languages can only count mod k - context-free grammars (CFG) - CFG gives us: program valid or not, nothing more - there is a N:1 relation from grammar to language, quite likely to fit the grammar to the specific tools at hand - parse tree derivation with production rules N derivations define a parse tree of a language left-mode derivation, right-mode derivation - ambiguous grammer - rewrite grammar - prioritize production rules - example: expression -> term and factor - classical optional else and nested if problem else -> closest then - ambiguous grammar, precedence and associativity (%left +) -> more natural language, used in tools - Oberon and hand-written parsers use the first approach - errors: lexical, syntax, semantical currently used: - panic mode (bison: error, skip) - error productions (e.g gcc warnings about problematic constructs) - we don't want the parse tree, we need AST, abstract syntax tree AST is simpler than the parse tree, algorithms get easier on an AST - top-down - bottom-up - rd parser (recursive descendant): most common one, very easy to write by hand. top-down - very easy to implement - left recursion -> use right recursion (rewrite grammar), must have terminals in productions to make the problem decidabale S -> Sa | b1 | b2 | .. | bn => S -> b1 S' | b2 S' | ... | bn S' S' -> a S' -> generic case is loops of non-terminals inducing left-recusion => parser generators can do this automatically making the grammar human-readable! => nevertheless done by hand to keep control over semantics - gcc used bison/flex, now a hand-written recursive decent - predictive parsing, top-down, no backtracking, LL(k)-grammar left-to-right-scan, left-most-derivation, k: k tokens look-ahead practice k=1, LL(1) - needs left-factoring - table-constructable - Coco/R is LL(k) - LL(1) table construction - things we cannot do with parsers alone: - checking whether identifiers are declared before used - checking validity of function parameters and if they fit to their declaration - first sets first(A) t follow(A) => construct LL(1) parsing table - LL(1) can be checked by double entries in the parsing table - non-left factored - left-recursive - ambiguous - k>1 lookahead - ... - most programming languages are not LL(1) - bottom-up, as efficient and more general, this could explain why LR and LALR are more popular today - work on left-recrusve grammars - reduction instead of productions - right-most derivation in reverse - shift-reduce-parsing is the general form - shift-reduce conflicts, happens - reduce-reduce conflicts, grammar problem - reduce only at handles, otherwise shift - handle are places on the top of the stack which still lead to the start symbol by n reductions - algorithm is very easy compared to LL(1), but how to find the handles in the grammar is tricky - LR(k) grammars, used in practice LALR(k) grammars, SLR(k) grammar is even more restrictive - set of viable prefixes (handles) is a regular language - LR(0) items (dots in the grammar) - stack contains stacked prefixes of rhs of productions we are currently working on, so if one matches we are guarnateed we don't get stuck and can append it to the stack of prefixes - NFA on the stack of the shift/reduce parser, every state is accepting, e-transitions are when we hope for another production to produce a valid reduction, seeing terminals or non-terminals on the stack leads to a state, where the symbol has been consumed. - canonical collections of LR(0) items, DFA of the NFA - LR(0) parsing - if X->S in DFA(stack) -> reduce, S is terminal or non-termonal - if X->t in DFA(stack) -> shift, consume t - reduce/reduce conflict: N choices for a reduction to pick - shift/reduce conflict: cannot decide whether to shift or reduce - LR(0) -> SLR, avoid conflict by seing beyong he stack, don't recude if the terminal is not in a follow statte of a production - SLR grammar is defined by heuristical algorithm, not precise - SLR precedence declarations, E*E higher than E+E so we can reduce the * instead of shifting the +. => should be named shift/reduce conflict resolution rules - the tools show the parsing automaton, so we see were we have conflicts in the grammar - in practice SLR(1) - having eos always in follow(X) means if we are out of output our only hope is to reduce. - SLR improvements - remember state of each stack prefix - SLR(1) again cannot handle certain situations in real grammars: - LR(1) is used: lookahead of one is one item + one symbol in input instead of just the one symbol and the follow sets those are the things we see in yacc - semantic analysis - scoping - static scoping won over dynamic scoping - types - classes == types in modern OOP languages - statically typed (C, C++, Java, well mostly, unsafe casts for exceptions) - dynamically typed (Python) - untyped (machine code) - declarations of types for every identifier - infer types for all expressions - type checking and type inference - logical rules of inference - find best valid rule, "34" is int is better than "34" is object, though latter is also valid - no external data for proving an inference rule - type environment , types for free variables - least upper bound of two subtypes - dynamic type of an object at runtime, compile time type (static time) x:A, later x = new B (static type A, dynamic type B) - SELF_TYPE, return value can change at runtime - type checking environemnt is O,M,C (C for SELF_TYPE) - x@T: static dispatch - academic to show the concepts, in practice not important - type errors: - assign Object to unknown types => cascading not-understandable errors - No_type: internal type, subtype of all classes, every operation defined and results in No_type => not a tree, harder to implemnent - front-end, lexer, parser, semantic analysis - back-end - compile-time and runtime correlation - code, data arangment - code genreation - correct and fast - method activations - exceptions, corooutines, call/cc - data - stack of activations (currently running procedure frame) activation record (AR) or frame - AR: no control link as we know where we are (array as stack), also return values are usually passed in registers - frame contents is in registers for real compilers - global, statically allocated, also new objects (heap) - classic code (text), data, stack and heap - alignment is important, SIBGUS SPARC, ARM we choose, Intel allowed but performance penalty - padding - code generation - stack machines - register machine - 1-register stack machine, accumulator, acc is the 1-position extensions of the top of the stack