1 files changed, 564 insertions, 0 deletions
diff --git a/miniany/doc/journal.stuffwithstuff.com_2011_03_19_pratt-parsers-expression-parsing-made-easy.txt b/miniany/doc/journal.stuffwithstuff.com_2011_03_19_pratt-parsers-expression-parsing-made-easy.txt
new file mode 100644
index 0000000..b1b1b7d
--- /dev/null
+++ b/miniany/doc/journal.stuffwithstuff.com_2011_03_19_pratt-parsers-expression-parsing-made-easy.txt
@@ -0,0 +1,564 @@
+   #[1]RSS 2.0 [2]Atom 1.0
+
+[3]Pratt Parsers: Expression Parsing Made Easy
+
+   [4]&#8617; [5]&#8618;
+
+[6]March 19, 2011 [7]code [8]java [9]js [10]language [11]magpie [12]parsing
+
+   Every now and then, I stumble onto some algorithm or idea that's so
+   clever and such a perfect solution to a problem that I feel like I got
+   smarter or gained [13]a new superpower just by learning it. [14]Heaps
+   (just about the only thing I got out of my truncated CS education) were
+   one thing like this. I recently stumbled onto another: [15]Pratt
+   parsers.
+
+   When you're writing a parser, [16]recursive descent is as easy as
+   spreading peanut butter. It excels when you can figure out what to do
+   next based on the next chunk of code you're parsing. That's usually
+   true at the top level of a language where things like classes are and
+   also for statements since most start with something that uniquely
+   identifies them (if, for, while, etc.).
+
+   But it gets tricky when you get to expressions. When it comes to infix
+   operators like +, postfix ones like ++, and even mixfix expressions
+   like ?:, it can be hard to tell what kind of expression you're parsing
+   until you're halfway through it. You can do this with recursive
+   descent, but it's a chore. You have to write separate functions for
+   each level of precedence (JavaScript has 17 of them, for example),
+   manually handle associativity, and smear your grammar across a bunch of
+   parsing code until it's hard to see.
+
+PB & J, The Secret Weapon
+
+   Pratt parsing fixes just that. If recursive descent is peanut butter,
+   Pratt parsing is jelly. When you mix the two together, you get a
+   simple, terse, readable parser that can handle any grammar you throw at
+   it.
+
+   Pratt's technique for handling operator precedence and infix
+   expressions is so simple and effective it's a mystery why almost no one
+   knows about it. After the seventies, top down operator precedence
+   parsers seem to have fallen off the Earth. Douglas Crockford's
+   [17]JSLint uses one to [18]parse JavaScript, but his treatment is one
+   of the [19]very few remotely modern articles about it.
+
+   Part of the problem, I think, is that Pratt's terminology is opaque,
+   and Crockford's article is itself rather murky. Pratt uses terms like
+   "null denominator" and Crockford mixes in extra stuff like tracking
+   lexical scope that obscures the core idea.
+
+   This is where I come in. I won't do anything revolutionary. I'll just
+   try to get the core concepts behind top down operator precedence
+   parsers and present them as clearly as I can. I'll switch out some
+   terms to (I hope) clarify things. Hopefully I won't offend anyone's
+   purist sensibilities. I'll be coding in Java, the vulgar Latin of
+   programming languages. I figure if you can write it in Java, you can
+   write it in anything.
+
+What We'll Be Making
+
+   I'm a learn-by-doing person, which means I'm also a teach-by-doing one.
+   So to show how Pratt parsers work, we'll build a parser for a [20]tiny
+   little toy language called Bantam. It just has expressions since that's
+   where Pratt parsing is really helpful, but that should be enough to
+   convince of its usefulness.
+
+   Even though it's simple, it has a full gamut of operators: prefix (+,
+   -, ~, !), postfix (!), infix (+, -, *, /, ^), and even a mixfix
+   conditional operator (?:). It has multiple precedence levels and both
+   right and left associative operators. It also has assignment, function
+   calls and parentheses for grouping. If we can parse this, we can parse
+   anything.
+
+What We'll Start With
+
+   All we care about is parsing, so we'll ignore the tokenizing phase. I
+   slapped together [21]a crude lexer that works and we'll just pretend
+   that tokens are raining down from heaven or something.
+
+   A token is just a chunk of meaningful code with a type and a string
+   associated with it. Given a + b(c), the tokens would be:
+NAME "a"
+PLUS "+"
+NAME "b"
+LEFT_PAREN "("
+NAME "c"
+RIGHT_PAREN ")"
+
+   Likewise, we won't be interpreting or compiling this code. We just want
+   to parse it to a nice data structure. For our purposes, that means our
+   parser should chew up a bunch of [22]Token objects and spit out an
+   instance of some class that implements [23]Expression. To give you an
+   idea, here's a simplified version of the class for a [24]conditional
+   expression:
+class ConditionalExpression implements Expression {
+  public ConditionalExpression(
+      Expression condition,
+      Expression thenArm,
+      Expression elseArm) {
+    this.condition = condition;
+    this.thenArm   = thenArm;
+    this.elseArm   = elseArm;
+  }
+
+  public final Expression condition;
+  public final Expression thenArm;
+  public final Expression elseArm;
+}
+
+   (You gotta love Java's "please sign it in quadruplicate" level of
+   bureaucracy here. Like I said, if you can do this in Java, you can do
+   it in any language.)
+
+   We'll be building this starting from a simple [25]Parser class. This
+   owns the token stream, handles lookahead and provides the basic methods
+   you'll need to write a top-down recursive descent parser with a single
+   token of lookahead (i.e. it's LL(1)). This is enough to get us going.
+   If we need more later, it's easy to extend it.
+
+   OK, let's build ourselves a parser!
+
+First Things First
+
+   Even though a "full" Pratt parser is pretty tiny, I found it to be a
+   bit hard to decipher. Sort of like [26]quicksort, the implementation is
+   a deceptively-simple handful of deeply intertwined code. To untangle
+   it, we'll build it up one tiny step at a time.
+
+   The simplest expressions to parse are prefix operators and single-token
+   ones. For those, the current token tells us all that we need to do.
+   Bantam has one single-token expression, named variables, and four
+   prefix operators: +, -, ~, and !. The simplest possible code to parse
+   that would be:
+Expression parseExpression() {
+  if (match(TokenType.NAME))       // return NameExpression...
+  else if (match(TokenType.PLUS))  // return prefix + operator...
+  else if (match(TokenType.MINUS)) // return prefix - operator...
+  else if (match(TokenType.TILDE)) // return prefix ~ operator...
+  else if (match(TokenType.BANG))  // return prefix ! operator...
+  else throw new ParseException();
+}
+
+   But that's a bit monolithic. As you can see, we're switching off of a
+   TokenType to branch to different parsing behavior. Let's encode that
+   directly by making a Map from TokenTypes to chunks of parsing code.
+   We'll call these chunks "parselets", and they will implement this:
+interface PrefixParselet {
+  Expression parse(Parser parser, Token token);
+}
+
+   An implementation of this to parse variable names is just:
+class NameParselet implements PrefixParselet {
+  public Expression parse(Parser parser, Token token) {
+    return new NameExpression(token.getText());
+  }
+}
+
+   We can use a single class for all of the prefix operators since they
+   only differ in the actual operator token itself:
+class PrefixOperatorParselet implements PrefixParselet {
+  public Expression parse(Parser parser, Token token) {
+    Expression operand = parser.parseExpression();
+    return new PrefixExpression(token.getType(), operand);
+  }
+}
+
+   You'll note that it calls back into parseExpression() to parse the
+   operand that appears after the operator (i.e. to parse the a in -a).
+   This recursion takes care of nested operators like -+~!a.
+
+   Back in Parser, the chained if statements are replaced with a cleaner
+   map:
+class Parser {
+  public void register(TokenType token, PrefixParselet parselet) {
+    mPrefixParselets.put(token, parselet);
+  }
+
+  public Expression parseExpression() {
+    Token token = consume();
+    PrefixParselet prefix = mPrefixParselets.get(token.getType());
+
+    if (prefix == null) throw new ParseException(
+        "Could not parse \"" + token.getText() + "\".");
+
+    return prefix.parse(this, token);
+  }
+
+  // Other stuff...
+
+  private final Map<TokenType, PrefixParselet> mPrefixParselets =
+      new HashMap<TokenType, PrefixParselet>();
+}
+
+   To define the grammar we have so far (variables and the four prefix
+   operators), we'll make this helper method:
+void prefix(TokenType token) {
+  register(token, new PrefixOperatorParselet());
+}
+
+   And now we can define the grammar like:
+register(TokenType.NAME, new NameParselet());
+prefix(TokenType.PLUS);
+prefix(TokenType.MINUS);
+prefix(TokenType.TILDE);
+prefix(TokenType.BANG);
+
+   This is already an improvement over a recursive descent parser because
+   our grammar is now more declarative instead of being spread out over a
+   few imperative functions, and we can see the actual grammar all in one
+   place. Even better, we can extend the grammar just by registering new
+   parselets. We don't have to change the Parser class itself.
+
+   If we only had prefix expressions, we'd be done now. Alas, we don't.
+
+Stuck In the Middle
+
+   What we have so far only works if the first token tells us what kind of
+   expression we're parsing, but that isn't always the case. With an
+   expression like a + b, we don't know we have an add expression until
+   after we parse the a and get to +. We'll have to extend the parser to
+   support that.
+
+   Fortunately, we're in a good place to do so. Our current
+   parseExpression() method will parse a complete prefix expression
+   including any nested prefix expressions and then stop. So, if we throw
+   this at it:
+-a + b
+
+   It will parse -a and leave us sitting on +. That's exactly the token we
+   need to tell what infix expression we're parsing. The only difference
+   between an infix expression and a prefix one here is that there's
+   another expression before the infix operator that it needs to have as
+   an argument. Let's define a parselet that supports that:
+interface InfixParselet {
+  Expression parse(Parser parser, Expression left, Token token);
+}
+
+   The only difference is that left argument, which is just the expression
+   we parsed before we got to the infix token. We'll wire this up to our
+   parser by having another table of infix parselets.
+
+   Having separate tables for prefix and infix expressions is important
+   because we'll often have both a prefix and infix parselet for a single
+   TokenType. For example, the prefix parselet for ( handles grouping in
+   an expression like a * (b + c). Meanwhile, the infix parselet handles
+   function calls like a(b).
+
+   Now, after we parse the prefix expression, we hand it off to any infix
+   one that subsumes it:
+class Parser {
+  public void register(TokenType token, InfixParselet parselet) {
+    mInfixParselets.put(token, parselet);
+  }
+
+  public Expression parseExpression() {
+    Token token = consume();
+    PrefixParselet prefix = mPrefixParselets.get(token.getType());
+
+    if (prefix == null) throw new ParseException(
+        "Could not parse \"" + token.getText() + "\".");
+
+    Expression left = prefix.parse(this, token);
+
+    token = lookAhead(0);
+    InfixParselet infix = mInfixParselets.get(token.getType());
+
+    // No infix expression at this point, so we're done.
+    if (infix == null) return left;
+
+    consume();
+    return infix.parse(this, left, token);
+  }
+
+  // Other stuff...
+
+  private final Map<TokenType, InfixParselet> mInfixParselets =
+      new HashMap<TokenType, InfixParselet>();
+}
+
+   Pretty straightforward. We can implement an infix parselet for binary
+   arithmetic operators like + using something like:
+class BinaryOperatorParselet implements InfixParselet {
+  public Expression parse(Parser parser,
+      Expression left, Token token) {
+    Expression right = parser.parseExpression();
+    return new OperatorExpression(left, token.getType(), right);
+  }
+}
+
+   This also works for postfix operators. I'm calling them "infix"
+   parselets, but they're really "anything but prefix". If there's some
+   expression that comes before the token, it will be handled by an infix
+   parselet, and that includes postfix expressions and mixfix ones like
+   ?:.
+
+   Postfix is as easy as a single-token prefix parselet: it just takes the
+   left expression and wraps it in another expression:
+class PostfixOperatorParselet implements InfixParselet {
+  public Expression parse(Parser parser, Expression left,
+      Token token) {
+    return new PostfixExpression(left, token.getType());
+  }
+}
+
+   Mixfix is easy too. It's pretty much a familiar recursive descent
+   parser:
+class ConditionalParselet implements InfixParselet {
+  public Expression parse(Parser parser, Expression left,
+      Token token) {
+    Expression thenArm = parser.parseExpression();
+    parser.consume(TokenType.COLON);
+    Expression elseArm = parser.parseExpression();
+
+    return new ConditionalExpression(left, thenArm, elseArm);
+  }
+}
+
+   Now we can parse prefix expressions, postfix, infix, and even mixfix.
+   With a pretty small amount of code, we can parse expressions like a +
+   (b ? c! : -d). We're done, right? Well... almost.
+
+Excuse You, Aunt Sally
+
+   Our parser can parse all of this stuff, but it doesn't parse it with
+   the right precedence or associativity. If you throw a - b - c at it, it
+   will parse it like a - (b - c), which isn't right. (Well, actually it
+   is right--associative that is. We need it to be left.)
+
+   And this last step is where Pratt parsers go from pretty nice to
+   totally radical. We'll make two simple changes. We'll extend
+   parseExpression() to take a precedence--a number that tells which
+   expressions can be parsed by that call. If it encounters an expression
+   whose precedence is lower than we allow, it just stops parsing and
+   returns what it has so far.
+
+   To make that check we need to know the precedence of any given infix
+   expression. We'll do that by letting the parselet specify it:
+public interface InfixParselet {
+  Expression parse(Parser parser, Expression left, Token token);
+  int getPrecedence();
+}
+
+   Using that, our core expression parser becomes:
+public Expression parseExpression(int precedence) {
+  Token token = consume();
+  PrefixParselet prefix = mPrefixParselets.get(token.getType());
+
+  if (prefix == null) throw new ParseException(
+      "Could not parse \"" + token.getText() + "\".");
+
+  Expression left = prefix.parse(this, token);
+
+  while (precedence < getPrecedence()) {
+    token = consume();
+
+    InfixParselet infix = mInfixParselets.get(token.getType());
+    left = infix.parse(this, left, token);
+  }
+
+  return left;
+}
+
+   That relies on a tiny helper function to get the precedence of the
+   current token or default if there's no infix parselet for it:
+private int getPrecedence() {
+  InfixParselet parser = mInfixParselets.get(
+      lookAhead(0).getType());
+  if (parser != null) return parser.getPrecedence();
+
+  return 0;
+}
+
+   And that's it. To use this, we'll set up a little precedence table:
+public class Precedence {
+  public static final int ASSIGNMENT  = 1;
+  public static final int CONDITIONAL = 2;
+  public static final int SUM         = 3;
+  public static final int PRODUCT     = 4;
+  public static final int EXPONENT    = 5;
+  public static final int PREFIX      = 6;
+  public static final int POSTFIX     = 7;
+  public static final int CALL        = 8;
+}
+
+   To make our operators correctly handle precedence, they'll pass an
+   appropriate value back into parseExpression() when they call it
+   recursively. For example, the [27]BinaryOperatorParselet instance that
+   handles the + operator will pass in Precedence.SUM when it parses its
+   right-hand operand.
+
+   Associativity is easy too. If an infix parselet calls parseExpression()
+   with the same precedence that it returns for its own getPrecedence()
+   call, you'll get left associativity. To be right-associative, it just
+   needs to pass in one less than that instead.
+
+Go Forth and Multiply
+
+   I've rewritten the [28]parser for Magpie using this and it worked like
+   a charm. I'm also working on a JavaScript parser based on this and
+   again it's been a great fit.
+
+   I find parsing like this to be simple, terse, extensible (Magpie, for
+   example, uses this to [29]let you extend its own syntax at runtime),
+   and easy to read. I'm at the point where I can't imagine writing a
+   parser any other way. I never thought I'd say this, but parsers are
+   easy now.
+
+   To see for yourself, just take a look at [30]the complete program.
+   Please enable JavaScript to view the [31]comments powered by Disqus.
+
+   [32][dogshot_square.jpg]
+
+   Hi! I'm Bob Nystrom, the one on the left.
+
+   I wrote a book called [33]Game Programming Patterns. I'm working on
+   another book called [34]Crafting Interpreters.
+
+   You can email me at robert at this site or follow me on twitter at
+   [35]@munificentbob.
+
+Elsewhere
+
+     * Code at [36]github
+     * Tweets at [37]twitter
+     * Photos at [38]500px
+     * Photos at [39]flickr
+
+Categories
+
+     * [40]code 67
+     * [41]language 43
+     * [42]magpie 24
+     * [43]c-sharp 13
+     * [44]dart 13
+     * [45]game-dev 12
+     * [46]java 10
+     * [47]cpp 8
+     * [48]design 7
+     * [49]game-patterns 6
+     * [50]parsing 6
+     * [51]roguelike 6
+     * [52]book 5
+     * [53]go 5
+     * [54]js 4
+     * [55]personal 4
+     * [56]c 3
+     * [57]finch 3
+     * [58]python 3
+     * [59]ruby 3
+     * [60]blog 2
+     * [61]f-sharp 2
+     * [62]lua 2
+     * [63]music 2
+     * [64]ai 1
+     * [65]beta 1
+     * [66]blogofile 1
+     * [67]game 1
+     * [68]jasic 1
+     * [69]javascript 1
+     * [70]oop 1
+     * [71]optimization 1
+     * [72]oscon 1
+     * [73]politics 1
+     * [74]scheme 1
+     * [75]typescript 1
+     * [76]visualization 1
+
+   All [77]76 articles...
+
+   This blog is built using [78]jekyll. The source repo for it is
+   [79]here.
+
+   � 2008-2021 Robert Nystrom
+
+References
+
+   Visible links:
+   1. https://journal.stuffwithstuff.com/rss.xml
+   2. https://journal.stuffwithstuff.com/atom.xml
+   3. https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
+   4. https://journal.stuffwithstuff.com/2011/02/13/extending-syntax-from-within-a-language/
+   5. https://journal.stuffwithstuff.com/2011/04/21/multimethods-multiple-inheritance-multiawesome/
+   6. https://journal.stuffwithstuff.com/archive
+   7. https://journal.stuffwithstuff.com/category/code
+   8. https://journal.stuffwithstuff.com/category/java
+   9. https://journal.stuffwithstuff.com/category/js
+  10. https://journal.stuffwithstuff.com/category/language
+  11. https://journal.stuffwithstuff.com/category/magpie
+  12. https://journal.stuffwithstuff.com/category/parsing
+  13. http://xkcd.com/208/
+  14. http://en.wikipedia.org/wiki/Heap_%28data_structure%29
+  15. http://en.wikipedia.org/wiki/Vaughan_Pratt
+  16. http://en.wikipedia.org/wiki/Recursive_descent
+  17. http://www.jslint.com/
+  18. http://javascript.crockford.com/tdop/tdop.html
+  19. http://effbot.org/zone/simple-top-down-parsing.htm
+  20. https://github.com/munificent/bantam
+  21. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Lexer.java
+  22. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Token.java
+  23. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/expressions/Expression.java
+  24. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/expressions/ConditionalExpression.java
+  25. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Parser.java
+  26. http://en.wikipedia.org/wiki/Quicksort
+  27. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/parselets/BinaryOperatorParselet.java
+  28. https://github.com/munificent/magpie/blob/master/src/com/stuffwithstuff/magpie/parser/MagpieParser.java
+  29. http://journal.stuffwithstuff.com/2011/02/13/extending-syntax-from-within-a-language/
+  30. http://github.com/munificent/bantam
+  31. http://disqus.com/?ref_noscript
+  32. https://journal.stuffwithstuff.com/
+  33. http://gameprogrammingpatterns.com/
+  34. http://craftinginterpreters.com/
+  35. https://twitter.com/intent/user?screen_name=munificentbob
+  36. http://github.com/munificent
+  37. http://twitter.com/munificentbob
+  38. https://500px.com/munificent
+  39. http://www.flickr.com/photos/bobisbob/
+  40. https://journal.stuffwithstuff.com/category/code
+  41. https://journal.stuffwithstuff.com/category/language
+  42. https://journal.stuffwithstuff.com/category/magpie
+  43. https://journal.stuffwithstuff.com/category/c-sharp
+  44. https://journal.stuffwithstuff.com/category/dart
+  45. https://journal.stuffwithstuff.com/category/game-dev
+  46. https://journal.stuffwithstuff.com/category/java
+  47. https://journal.stuffwithstuff.com/category/cpp
+  48. https://journal.stuffwithstuff.com/category/design
+  49. https://journal.stuffwithstuff.com/category/game-patterns
+  50. https://journal.stuffwithstuff.com/category/parsing
+  51. https://journal.stuffwithstuff.com/category/roguelike
+  52. https://journal.stuffwithstuff.com/category/book
+  53. https://journal.stuffwithstuff.com/category/go
+  54. https://journal.stuffwithstuff.com/category/js
+  55. https://journal.stuffwithstuff.com/category/personal
+  56. https://journal.stuffwithstuff.com/category/c
+  57. https://journal.stuffwithstuff.com/category/finch
+  58. https://journal.stuffwithstuff.com/category/python
+  59. https://journal.stuffwithstuff.com/category/ruby
+  60. https://journal.stuffwithstuff.com/category/blog
+  61. https://journal.stuffwithstuff.com/category/f-sharp
+  62. https://journal.stuffwithstuff.com/category/lua
+  63. https://journal.stuffwithstuff.com/category/music
+  64. https://journal.stuffwithstuff.com/category/ai
+  65. https://journal.stuffwithstuff.com/category/beta
+  66. https://journal.stuffwithstuff.com/category/blogofile
+  67. https://journal.stuffwithstuff.com/category/game
+  68. https://journal.stuffwithstuff.com/category/jasic
+  69. https://journal.stuffwithstuff.com/category/javascript
+  70. https://journal.stuffwithstuff.com/category/oop
+  71. https://journal.stuffwithstuff.com/category/optimization
+  72. https://journal.stuffwithstuff.com/category/oscon
+  73. https://journal.stuffwithstuff.com/category/politics
+  74. https://journal.stuffwithstuff.com/category/scheme
+  75. https://journal.stuffwithstuff.com/category/typescript
+  76. https://journal.stuffwithstuff.com/category/visualization
+  77. https://journal.stuffwithstuff.com/archive
+  78. http://jekyllrb.com/
+  79. https://github.com/munificent/journal
+
+   Hidden links:
+  81. https://www.reddit.com/submit?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy//
+  82. https://news.ycombinator.com/submitlink?u=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy//&t=Pratt%20Parsers:%20Expression%20Parsing%20Made%20Easy
+  83. http://twitter.com/share?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/&text=%22Pratt%20Parsers:%20Expression%20Parsing%20Made%20Easy%22%20%40munificentbob
+  84. http://www.facebook.com/share.php?u=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
+  85. https://plus.google.com/share?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
+  86. https://journal.stuffwithstuff.com/rss.xml