summaryrefslogtreecommitdiff
path: root/miniany/doc/journal.stuffwithstuff.com_2011_03_19_pratt-parsers-expression-parsing-made-easy.txt
diff options
context:
space:
mode:
Diffstat (limited to 'miniany/doc/journal.stuffwithstuff.com_2011_03_19_pratt-parsers-expression-parsing-made-easy.txt')
-rw-r--r--miniany/doc/journal.stuffwithstuff.com_2011_03_19_pratt-parsers-expression-parsing-made-easy.txt564
1 files changed, 564 insertions, 0 deletions
diff --git a/miniany/doc/journal.stuffwithstuff.com_2011_03_19_pratt-parsers-expression-parsing-made-easy.txt b/miniany/doc/journal.stuffwithstuff.com_2011_03_19_pratt-parsers-expression-parsing-made-easy.txt
new file mode 100644
index 0000000..b1b1b7d
--- /dev/null
+++ b/miniany/doc/journal.stuffwithstuff.com_2011_03_19_pratt-parsers-expression-parsing-made-easy.txt
@@ -0,0 +1,564 @@
+ #[1]RSS 2.0 [2]Atom 1.0
+
+[3]Pratt Parsers: Expression Parsing Made Easy
+
+ [4]↩ [5]↪
+
+[6]March 19, 2011 [7]code [8]java [9]js [10]language [11]magpie [12]parsing
+
+ Every now and then, I stumble onto some algorithm or idea that's so
+ clever and such a perfect solution to a problem that I feel like I got
+ smarter or gained [13]a new superpower just by learning it. [14]Heaps
+ (just about the only thing I got out of my truncated CS education) were
+ one thing like this. I recently stumbled onto another: [15]Pratt
+ parsers.
+
+ When you're writing a parser, [16]recursive descent is as easy as
+ spreading peanut butter. It excels when you can figure out what to do
+ next based on the next chunk of code you're parsing. That's usually
+ true at the top level of a language where things like classes are and
+ also for statements since most start with something that uniquely
+ identifies them (if, for, while, etc.).
+
+ But it gets tricky when you get to expressions. When it comes to infix
+ operators like +, postfix ones like ++, and even mixfix expressions
+ like ?:, it can be hard to tell what kind of expression you're parsing
+ until you're halfway through it. You can do this with recursive
+ descent, but it's a chore. You have to write separate functions for
+ each level of precedence (JavaScript has 17 of them, for example),
+ manually handle associativity, and smear your grammar across a bunch of
+ parsing code until it's hard to see.
+
+PB & J, The Secret Weapon
+
+ Pratt parsing fixes just that. If recursive descent is peanut butter,
+ Pratt parsing is jelly. When you mix the two together, you get a
+ simple, terse, readable parser that can handle any grammar you throw at
+ it.
+
+ Pratt's technique for handling operator precedence and infix
+ expressions is so simple and effective it's a mystery why almost no one
+ knows about it. After the seventies, top down operator precedence
+ parsers seem to have fallen off the Earth. Douglas Crockford's
+ [17]JSLint uses one to [18]parse JavaScript, but his treatment is one
+ of the [19]very few remotely modern articles about it.
+
+ Part of the problem, I think, is that Pratt's terminology is opaque,
+ and Crockford's article is itself rather murky. Pratt uses terms like
+ "null denominator" and Crockford mixes in extra stuff like tracking
+ lexical scope that obscures the core idea.
+
+ This is where I come in. I won't do anything revolutionary. I'll just
+ try to get the core concepts behind top down operator precedence
+ parsers and present them as clearly as I can. I'll switch out some
+ terms to (I hope) clarify things. Hopefully I won't offend anyone's
+ purist sensibilities. I'll be coding in Java, the vulgar Latin of
+ programming languages. I figure if you can write it in Java, you can
+ write it in anything.
+
+What We'll Be Making
+
+ I'm a learn-by-doing person, which means I'm also a teach-by-doing one.
+ So to show how Pratt parsers work, we'll build a parser for a [20]tiny
+ little toy language called Bantam. It just has expressions since that's
+ where Pratt parsing is really helpful, but that should be enough to
+ convince of its usefulness.
+
+ Even though it's simple, it has a full gamut of operators: prefix (+,
+ -, ~, !), postfix (!), infix (+, -, *, /, ^), and even a mixfix
+ conditional operator (?:). It has multiple precedence levels and both
+ right and left associative operators. It also has assignment, function
+ calls and parentheses for grouping. If we can parse this, we can parse
+ anything.
+
+What We'll Start With
+
+ All we care about is parsing, so we'll ignore the tokenizing phase. I
+ slapped together [21]a crude lexer that works and we'll just pretend
+ that tokens are raining down from heaven or something.
+
+ A token is just a chunk of meaningful code with a type and a string
+ associated with it. Given a + b(c), the tokens would be:
+NAME "a"
+PLUS "+"
+NAME "b"
+LEFT_PAREN "("
+NAME "c"
+RIGHT_PAREN ")"
+
+ Likewise, we won't be interpreting or compiling this code. We just want
+ to parse it to a nice data structure. For our purposes, that means our
+ parser should chew up a bunch of [22]Token objects and spit out an
+ instance of some class that implements [23]Expression. To give you an
+ idea, here's a simplified version of the class for a [24]conditional
+ expression:
+class ConditionalExpression implements Expression {
+ public ConditionalExpression(
+ Expression condition,
+ Expression thenArm,
+ Expression elseArm) {
+ this.condition = condition;
+ this.thenArm = thenArm;
+ this.elseArm = elseArm;
+ }
+
+ public final Expression condition;
+ public final Expression thenArm;
+ public final Expression elseArm;
+}
+
+ (You gotta love Java's "please sign it in quadruplicate" level of
+ bureaucracy here. Like I said, if you can do this in Java, you can do
+ it in any language.)
+
+ We'll be building this starting from a simple [25]Parser class. This
+ owns the token stream, handles lookahead and provides the basic methods
+ you'll need to write a top-down recursive descent parser with a single
+ token of lookahead (i.e. it's LL(1)). This is enough to get us going.
+ If we need more later, it's easy to extend it.
+
+ OK, let's build ourselves a parser!
+
+First Things First
+
+ Even though a "full" Pratt parser is pretty tiny, I found it to be a
+ bit hard to decipher. Sort of like [26]quicksort, the implementation is
+ a deceptively-simple handful of deeply intertwined code. To untangle
+ it, we'll build it up one tiny step at a time.
+
+ The simplest expressions to parse are prefix operators and single-token
+ ones. For those, the current token tells us all that we need to do.
+ Bantam has one single-token expression, named variables, and four
+ prefix operators: +, -, ~, and !. The simplest possible code to parse
+ that would be:
+Expression parseExpression() {
+ if (match(TokenType.NAME)) // return NameExpression...
+ else if (match(TokenType.PLUS)) // return prefix + operator...
+ else if (match(TokenType.MINUS)) // return prefix - operator...
+ else if (match(TokenType.TILDE)) // return prefix ~ operator...
+ else if (match(TokenType.BANG)) // return prefix ! operator...
+ else throw new ParseException();
+}
+
+ But that's a bit monolithic. As you can see, we're switching off of a
+ TokenType to branch to different parsing behavior. Let's encode that
+ directly by making a Map from TokenTypes to chunks of parsing code.
+ We'll call these chunks "parselets", and they will implement this:
+interface PrefixParselet {
+ Expression parse(Parser parser, Token token);
+}
+
+ An implementation of this to parse variable names is just:
+class NameParselet implements PrefixParselet {
+ public Expression parse(Parser parser, Token token) {
+ return new NameExpression(token.getText());
+ }
+}
+
+ We can use a single class for all of the prefix operators since they
+ only differ in the actual operator token itself:
+class PrefixOperatorParselet implements PrefixParselet {
+ public Expression parse(Parser parser, Token token) {
+ Expression operand = parser.parseExpression();
+ return new PrefixExpression(token.getType(), operand);
+ }
+}
+
+ You'll note that it calls back into parseExpression() to parse the
+ operand that appears after the operator (i.e. to parse the a in -a).
+ This recursion takes care of nested operators like -+~!a.
+
+ Back in Parser, the chained if statements are replaced with a cleaner
+ map:
+class Parser {
+ public void register(TokenType token, PrefixParselet parselet) {
+ mPrefixParselets.put(token, parselet);
+ }
+
+ public Expression parseExpression() {
+ Token token = consume();
+ PrefixParselet prefix = mPrefixParselets.get(token.getType());
+
+ if (prefix == null) throw new ParseException(
+ "Could not parse \"" + token.getText() + "\".");
+
+ return prefix.parse(this, token);
+ }
+
+ // Other stuff...
+
+ private final Map<TokenType, PrefixParselet> mPrefixParselets =
+ new HashMap<TokenType, PrefixParselet>();
+}
+
+ To define the grammar we have so far (variables and the four prefix
+ operators), we'll make this helper method:
+void prefix(TokenType token) {
+ register(token, new PrefixOperatorParselet());
+}
+
+ And now we can define the grammar like:
+register(TokenType.NAME, new NameParselet());
+prefix(TokenType.PLUS);
+prefix(TokenType.MINUS);
+prefix(TokenType.TILDE);
+prefix(TokenType.BANG);
+
+ This is already an improvement over a recursive descent parser because
+ our grammar is now more declarative instead of being spread out over a
+ few imperative functions, and we can see the actual grammar all in one
+ place. Even better, we can extend the grammar just by registering new
+ parselets. We don't have to change the Parser class itself.
+
+ If we only had prefix expressions, we'd be done now. Alas, we don't.
+
+Stuck In the Middle
+
+ What we have so far only works if the first token tells us what kind of
+ expression we're parsing, but that isn't always the case. With an
+ expression like a + b, we don't know we have an add expression until
+ after we parse the a and get to +. We'll have to extend the parser to
+ support that.
+
+ Fortunately, we're in a good place to do so. Our current
+ parseExpression() method will parse a complete prefix expression
+ including any nested prefix expressions and then stop. So, if we throw
+ this at it:
+-a + b
+
+ It will parse -a and leave us sitting on +. That's exactly the token we
+ need to tell what infix expression we're parsing. The only difference
+ between an infix expression and a prefix one here is that there's
+ another expression before the infix operator that it needs to have as
+ an argument. Let's define a parselet that supports that:
+interface InfixParselet {
+ Expression parse(Parser parser, Expression left, Token token);
+}
+
+ The only difference is that left argument, which is just the expression
+ we parsed before we got to the infix token. We'll wire this up to our
+ parser by having another table of infix parselets.
+
+ Having separate tables for prefix and infix expressions is important
+ because we'll often have both a prefix and infix parselet for a single
+ TokenType. For example, the prefix parselet for ( handles grouping in
+ an expression like a * (b + c). Meanwhile, the infix parselet handles
+ function calls like a(b).
+
+ Now, after we parse the prefix expression, we hand it off to any infix
+ one that subsumes it:
+class Parser {
+ public void register(TokenType token, InfixParselet parselet) {
+ mInfixParselets.put(token, parselet);
+ }
+
+ public Expression parseExpression() {
+ Token token = consume();
+ PrefixParselet prefix = mPrefixParselets.get(token.getType());
+
+ if (prefix == null) throw new ParseException(
+ "Could not parse \"" + token.getText() + "\".");
+
+ Expression left = prefix.parse(this, token);
+
+ token = lookAhead(0);
+ InfixParselet infix = mInfixParselets.get(token.getType());
+
+ // No infix expression at this point, so we're done.
+ if (infix == null) return left;
+
+ consume();
+ return infix.parse(this, left, token);
+ }
+
+ // Other stuff...
+
+ private final Map<TokenType, InfixParselet> mInfixParselets =
+ new HashMap<TokenType, InfixParselet>();
+}
+
+ Pretty straightforward. We can implement an infix parselet for binary
+ arithmetic operators like + using something like:
+class BinaryOperatorParselet implements InfixParselet {
+ public Expression parse(Parser parser,
+ Expression left, Token token) {
+ Expression right = parser.parseExpression();
+ return new OperatorExpression(left, token.getType(), right);
+ }
+}
+
+ This also works for postfix operators. I'm calling them "infix"
+ parselets, but they're really "anything but prefix". If there's some
+ expression that comes before the token, it will be handled by an infix
+ parselet, and that includes postfix expressions and mixfix ones like
+ ?:.
+
+ Postfix is as easy as a single-token prefix parselet: it just takes the
+ left expression and wraps it in another expression:
+class PostfixOperatorParselet implements InfixParselet {
+ public Expression parse(Parser parser, Expression left,
+ Token token) {
+ return new PostfixExpression(left, token.getType());
+ }
+}
+
+ Mixfix is easy too. It's pretty much a familiar recursive descent
+ parser:
+class ConditionalParselet implements InfixParselet {
+ public Expression parse(Parser parser, Expression left,
+ Token token) {
+ Expression thenArm = parser.parseExpression();
+ parser.consume(TokenType.COLON);
+ Expression elseArm = parser.parseExpression();
+
+ return new ConditionalExpression(left, thenArm, elseArm);
+ }
+}
+
+ Now we can parse prefix expressions, postfix, infix, and even mixfix.
+ With a pretty small amount of code, we can parse expressions like a +
+ (b ? c! : -d). We're done, right? Well... almost.
+
+Excuse You, Aunt Sally
+
+ Our parser can parse all of this stuff, but it doesn't parse it with
+ the right precedence or associativity. If you throw a - b - c at it, it
+ will parse it like a - (b - c), which isn't right. (Well, actually it
+ is right--associative that is. We need it to be left.)
+
+ And this last step is where Pratt parsers go from pretty nice to
+ totally radical. We'll make two simple changes. We'll extend
+ parseExpression() to take a precedence--a number that tells which
+ expressions can be parsed by that call. If it encounters an expression
+ whose precedence is lower than we allow, it just stops parsing and
+ returns what it has so far.
+
+ To make that check we need to know the precedence of any given infix
+ expression. We'll do that by letting the parselet specify it:
+public interface InfixParselet {
+ Expression parse(Parser parser, Expression left, Token token);
+ int getPrecedence();
+}
+
+ Using that, our core expression parser becomes:
+public Expression parseExpression(int precedence) {
+ Token token = consume();
+ PrefixParselet prefix = mPrefixParselets.get(token.getType());
+
+ if (prefix == null) throw new ParseException(
+ "Could not parse \"" + token.getText() + "\".");
+
+ Expression left = prefix.parse(this, token);
+
+ while (precedence < getPrecedence()) {
+ token = consume();
+
+ InfixParselet infix = mInfixParselets.get(token.getType());
+ left = infix.parse(this, left, token);
+ }
+
+ return left;
+}
+
+ That relies on a tiny helper function to get the precedence of the
+ current token or default if there's no infix parselet for it:
+private int getPrecedence() {
+ InfixParselet parser = mInfixParselets.get(
+ lookAhead(0).getType());
+ if (parser != null) return parser.getPrecedence();
+
+ return 0;
+}
+
+ And that's it. To use this, we'll set up a little precedence table:
+public class Precedence {
+ public static final int ASSIGNMENT = 1;
+ public static final int CONDITIONAL = 2;
+ public static final int SUM = 3;
+ public static final int PRODUCT = 4;
+ public static final int EXPONENT = 5;
+ public static final int PREFIX = 6;
+ public static final int POSTFIX = 7;
+ public static final int CALL = 8;
+}
+
+ To make our operators correctly handle precedence, they'll pass an
+ appropriate value back into parseExpression() when they call it
+ recursively. For example, the [27]BinaryOperatorParselet instance that
+ handles the + operator will pass in Precedence.SUM when it parses its
+ right-hand operand.
+
+ Associativity is easy too. If an infix parselet calls parseExpression()
+ with the same precedence that it returns for its own getPrecedence()
+ call, you'll get left associativity. To be right-associative, it just
+ needs to pass in one less than that instead.
+
+Go Forth and Multiply
+
+ I've rewritten the [28]parser for Magpie using this and it worked like
+ a charm. I'm also working on a JavaScript parser based on this and
+ again it's been a great fit.
+
+ I find parsing like this to be simple, terse, extensible (Magpie, for
+ example, uses this to [29]let you extend its own syntax at runtime),
+ and easy to read. I'm at the point where I can't imagine writing a
+ parser any other way. I never thought I'd say this, but parsers are
+ easy now.
+
+ To see for yourself, just take a look at [30]the complete program.
+ Please enable JavaScript to view the [31]comments powered by Disqus.
+
+ [32][dogshot_square.jpg]
+
+ Hi! I'm Bob Nystrom, the one on the left.
+
+ I wrote a book called [33]Game Programming Patterns. I'm working on
+ another book called [34]Crafting Interpreters.
+
+ You can email me at robert at this site or follow me on twitter at
+ [35]@munificentbob.
+
+Elsewhere
+
+ * Code at [36]github
+ * Tweets at [37]twitter
+ * Photos at [38]500px
+ * Photos at [39]flickr
+
+Categories
+
+ * [40]code 67
+ * [41]language 43
+ * [42]magpie 24
+ * [43]c-sharp 13
+ * [44]dart 13
+ * [45]game-dev 12
+ * [46]java 10
+ * [47]cpp 8
+ * [48]design 7
+ * [49]game-patterns 6
+ * [50]parsing 6
+ * [51]roguelike 6
+ * [52]book 5
+ * [53]go 5
+ * [54]js 4
+ * [55]personal 4
+ * [56]c 3
+ * [57]finch 3
+ * [58]python 3
+ * [59]ruby 3
+ * [60]blog 2
+ * [61]f-sharp 2
+ * [62]lua 2
+ * [63]music 2
+ * [64]ai 1
+ * [65]beta 1
+ * [66]blogofile 1
+ * [67]game 1
+ * [68]jasic 1
+ * [69]javascript 1
+ * [70]oop 1
+ * [71]optimization 1
+ * [72]oscon 1
+ * [73]politics 1
+ * [74]scheme 1
+ * [75]typescript 1
+ * [76]visualization 1
+
+ All [77]76 articles...
+
+ This blog is built using [78]jekyll. The source repo for it is
+ [79]here.
+
+ © 2008-2021 Robert Nystrom
+
+References
+
+ Visible links:
+ 1. https://journal.stuffwithstuff.com/rss.xml
+ 2. https://journal.stuffwithstuff.com/atom.xml
+ 3. https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
+ 4. https://journal.stuffwithstuff.com/2011/02/13/extending-syntax-from-within-a-language/
+ 5. https://journal.stuffwithstuff.com/2011/04/21/multimethods-multiple-inheritance-multiawesome/
+ 6. https://journal.stuffwithstuff.com/archive
+ 7. https://journal.stuffwithstuff.com/category/code
+ 8. https://journal.stuffwithstuff.com/category/java
+ 9. https://journal.stuffwithstuff.com/category/js
+ 10. https://journal.stuffwithstuff.com/category/language
+ 11. https://journal.stuffwithstuff.com/category/magpie
+ 12. https://journal.stuffwithstuff.com/category/parsing
+ 13. http://xkcd.com/208/
+ 14. http://en.wikipedia.org/wiki/Heap_%28data_structure%29
+ 15. http://en.wikipedia.org/wiki/Vaughan_Pratt
+ 16. http://en.wikipedia.org/wiki/Recursive_descent
+ 17. http://www.jslint.com/
+ 18. http://javascript.crockford.com/tdop/tdop.html
+ 19. http://effbot.org/zone/simple-top-down-parsing.htm
+ 20. https://github.com/munificent/bantam
+ 21. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Lexer.java
+ 22. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Token.java
+ 23. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/expressions/Expression.java
+ 24. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/expressions/ConditionalExpression.java
+ 25. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Parser.java
+ 26. http://en.wikipedia.org/wiki/Quicksort
+ 27. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/parselets/BinaryOperatorParselet.java
+ 28. https://github.com/munificent/magpie/blob/master/src/com/stuffwithstuff/magpie/parser/MagpieParser.java
+ 29. http://journal.stuffwithstuff.com/2011/02/13/extending-syntax-from-within-a-language/
+ 30. http://github.com/munificent/bantam
+ 31. http://disqus.com/?ref_noscript
+ 32. https://journal.stuffwithstuff.com/
+ 33. http://gameprogrammingpatterns.com/
+ 34. http://craftinginterpreters.com/
+ 35. https://twitter.com/intent/user?screen_name=munificentbob
+ 36. http://github.com/munificent
+ 37. http://twitter.com/munificentbob
+ 38. https://500px.com/munificent
+ 39. http://www.flickr.com/photos/bobisbob/
+ 40. https://journal.stuffwithstuff.com/category/code
+ 41. https://journal.stuffwithstuff.com/category/language
+ 42. https://journal.stuffwithstuff.com/category/magpie
+ 43. https://journal.stuffwithstuff.com/category/c-sharp
+ 44. https://journal.stuffwithstuff.com/category/dart
+ 45. https://journal.stuffwithstuff.com/category/game-dev
+ 46. https://journal.stuffwithstuff.com/category/java
+ 47. https://journal.stuffwithstuff.com/category/cpp
+ 48. https://journal.stuffwithstuff.com/category/design
+ 49. https://journal.stuffwithstuff.com/category/game-patterns
+ 50. https://journal.stuffwithstuff.com/category/parsing
+ 51. https://journal.stuffwithstuff.com/category/roguelike
+ 52. https://journal.stuffwithstuff.com/category/book
+ 53. https://journal.stuffwithstuff.com/category/go
+ 54. https://journal.stuffwithstuff.com/category/js
+ 55. https://journal.stuffwithstuff.com/category/personal
+ 56. https://journal.stuffwithstuff.com/category/c
+ 57. https://journal.stuffwithstuff.com/category/finch
+ 58. https://journal.stuffwithstuff.com/category/python
+ 59. https://journal.stuffwithstuff.com/category/ruby
+ 60. https://journal.stuffwithstuff.com/category/blog
+ 61. https://journal.stuffwithstuff.com/category/f-sharp
+ 62. https://journal.stuffwithstuff.com/category/lua
+ 63. https://journal.stuffwithstuff.com/category/music
+ 64. https://journal.stuffwithstuff.com/category/ai
+ 65. https://journal.stuffwithstuff.com/category/beta
+ 66. https://journal.stuffwithstuff.com/category/blogofile
+ 67. https://journal.stuffwithstuff.com/category/game
+ 68. https://journal.stuffwithstuff.com/category/jasic
+ 69. https://journal.stuffwithstuff.com/category/javascript
+ 70. https://journal.stuffwithstuff.com/category/oop
+ 71. https://journal.stuffwithstuff.com/category/optimization
+ 72. https://journal.stuffwithstuff.com/category/oscon
+ 73. https://journal.stuffwithstuff.com/category/politics
+ 74. https://journal.stuffwithstuff.com/category/scheme
+ 75. https://journal.stuffwithstuff.com/category/typescript
+ 76. https://journal.stuffwithstuff.com/category/visualization
+ 77. https://journal.stuffwithstuff.com/archive
+ 78. http://jekyllrb.com/
+ 79. https://github.com/munificent/journal
+
+ Hidden links:
+ 81. https://www.reddit.com/submit?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy//
+ 82. https://news.ycombinator.com/submitlink?u=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy//&t=Pratt%20Parsers:%20Expression%20Parsing%20Made%20Easy
+ 83. http://twitter.com/share?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/&text=%22Pratt%20Parsers:%20Expression%20Parsing%20Made%20Easy%22%20%40munificentbob
+ 84. http://www.facebook.com/share.php?u=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
+ 85. https://plus.google.com/share?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
+ 86. https://journal.stuffwithstuff.com/rss.xml