summaryrefslogtreecommitdiff
path: root/miniany/doc/journal.stuffwithstuff.com_2011_03_19_pratt-parsers-expression-parsing-made-easy.txt
blob: b1b1b7d02a545bbb6701805090d847cba106dcbd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
   #[1]RSS 2.0 [2]Atom 1.0

[3]Pratt Parsers: Expression Parsing Made Easy

   [4]↩ [5]↪

[6]March 19, 2011 [7]code [8]java [9]js [10]language [11]magpie [12]parsing

   Every now and then, I stumble onto some algorithm or idea that's so
   clever and such a perfect solution to a problem that I feel like I got
   smarter or gained [13]a new superpower just by learning it. [14]Heaps
   (just about the only thing I got out of my truncated CS education) were
   one thing like this. I recently stumbled onto another: [15]Pratt
   parsers.

   When you're writing a parser, [16]recursive descent is as easy as
   spreading peanut butter. It excels when you can figure out what to do
   next based on the next chunk of code you're parsing. That's usually
   true at the top level of a language where things like classes are and
   also for statements since most start with something that uniquely
   identifies them (if, for, while, etc.).

   But it gets tricky when you get to expressions. When it comes to infix
   operators like +, postfix ones like ++, and even mixfix expressions
   like ?:, it can be hard to tell what kind of expression you're parsing
   until you're halfway through it. You can do this with recursive
   descent, but it's a chore. You have to write separate functions for
   each level of precedence (JavaScript has 17 of them, for example),
   manually handle associativity, and smear your grammar across a bunch of
   parsing code until it's hard to see.

PB & J, The Secret Weapon

   Pratt parsing fixes just that. If recursive descent is peanut butter,
   Pratt parsing is jelly. When you mix the two together, you get a
   simple, terse, readable parser that can handle any grammar you throw at
   it.

   Pratt's technique for handling operator precedence and infix
   expressions is so simple and effective it's a mystery why almost no one
   knows about it. After the seventies, top down operator precedence
   parsers seem to have fallen off the Earth. Douglas Crockford's
   [17]JSLint uses one to [18]parse JavaScript, but his treatment is one
   of the [19]very few remotely modern articles about it.

   Part of the problem, I think, is that Pratt's terminology is opaque,
   and Crockford's article is itself rather murky. Pratt uses terms like
   "null denominator" and Crockford mixes in extra stuff like tracking
   lexical scope that obscures the core idea.

   This is where I come in. I won't do anything revolutionary. I'll just
   try to get the core concepts behind top down operator precedence
   parsers and present them as clearly as I can. I'll switch out some
   terms to (I hope) clarify things. Hopefully I won't offend anyone's
   purist sensibilities. I'll be coding in Java, the vulgar Latin of
   programming languages. I figure if you can write it in Java, you can
   write it in anything.

What We'll Be Making

   I'm a learn-by-doing person, which means I'm also a teach-by-doing one.
   So to show how Pratt parsers work, we'll build a parser for a [20]tiny
   little toy language called Bantam. It just has expressions since that's
   where Pratt parsing is really helpful, but that should be enough to
   convince of its usefulness.

   Even though it's simple, it has a full gamut of operators: prefix (+,
   -, ~, !), postfix (!), infix (+, -, *, /, ^), and even a mixfix
   conditional operator (?:). It has multiple precedence levels and both
   right and left associative operators. It also has assignment, function
   calls and parentheses for grouping. If we can parse this, we can parse
   anything.

What We'll Start With

   All we care about is parsing, so we'll ignore the tokenizing phase. I
   slapped together [21]a crude lexer that works and we'll just pretend
   that tokens are raining down from heaven or something.

   A token is just a chunk of meaningful code with a type and a string
   associated with it. Given a + b(c), the tokens would be:
NAME "a"
PLUS "+"
NAME "b"
LEFT_PAREN "("
NAME "c"
RIGHT_PAREN ")"

   Likewise, we won't be interpreting or compiling this code. We just want
   to parse it to a nice data structure. For our purposes, that means our
   parser should chew up a bunch of [22]Token objects and spit out an
   instance of some class that implements [23]Expression. To give you an
   idea, here's a simplified version of the class for a [24]conditional
   expression:
class ConditionalExpression implements Expression {
  public ConditionalExpression(
      Expression condition,
      Expression thenArm,
      Expression elseArm) {
    this.condition = condition;
    this.thenArm   = thenArm;
    this.elseArm   = elseArm;
  }

  public final Expression condition;
  public final Expression thenArm;
  public final Expression elseArm;
}

   (You gotta love Java's "please sign it in quadruplicate" level of
   bureaucracy here. Like I said, if you can do this in Java, you can do
   it in any language.)

   We'll be building this starting from a simple [25]Parser class. This
   owns the token stream, handles lookahead and provides the basic methods
   you'll need to write a top-down recursive descent parser with a single
   token of lookahead (i.e. it's LL(1)). This is enough to get us going.
   If we need more later, it's easy to extend it.

   OK, let's build ourselves a parser!

First Things First

   Even though a "full" Pratt parser is pretty tiny, I found it to be a
   bit hard to decipher. Sort of like [26]quicksort, the implementation is
   a deceptively-simple handful of deeply intertwined code. To untangle
   it, we'll build it up one tiny step at a time.

   The simplest expressions to parse are prefix operators and single-token
   ones. For those, the current token tells us all that we need to do.
   Bantam has one single-token expression, named variables, and four
   prefix operators: +, -, ~, and !. The simplest possible code to parse
   that would be:
Expression parseExpression() {
  if (match(TokenType.NAME))       // return NameExpression...
  else if (match(TokenType.PLUS))  // return prefix + operator...
  else if (match(TokenType.MINUS)) // return prefix - operator...
  else if (match(TokenType.TILDE)) // return prefix ~ operator...
  else if (match(TokenType.BANG))  // return prefix ! operator...
  else throw new ParseException();
}

   But that's a bit monolithic. As you can see, we're switching off of a
   TokenType to branch to different parsing behavior. Let's encode that
   directly by making a Map from TokenTypes to chunks of parsing code.
   We'll call these chunks "parselets", and they will implement this:
interface PrefixParselet {
  Expression parse(Parser parser, Token token);
}

   An implementation of this to parse variable names is just:
class NameParselet implements PrefixParselet {
  public Expression parse(Parser parser, Token token) {
    return new NameExpression(token.getText());
  }
}

   We can use a single class for all of the prefix operators since they
   only differ in the actual operator token itself:
class PrefixOperatorParselet implements PrefixParselet {
  public Expression parse(Parser parser, Token token) {
    Expression operand = parser.parseExpression();
    return new PrefixExpression(token.getType(), operand);
  }
}

   You'll note that it calls back into parseExpression() to parse the
   operand that appears after the operator (i.e. to parse the a in -a).
   This recursion takes care of nested operators like -+~!a.

   Back in Parser, the chained if statements are replaced with a cleaner
   map:
class Parser {
  public void register(TokenType token, PrefixParselet parselet) {
    mPrefixParselets.put(token, parselet);
  }

  public Expression parseExpression() {
    Token token = consume();
    PrefixParselet prefix = mPrefixParselets.get(token.getType());

    if (prefix == null) throw new ParseException(
        "Could not parse \"" + token.getText() + "\".");

    return prefix.parse(this, token);
  }

  // Other stuff...

  private final Map<TokenType, PrefixParselet> mPrefixParselets =
      new HashMap<TokenType, PrefixParselet>();
}

   To define the grammar we have so far (variables and the four prefix
   operators), we'll make this helper method:
void prefix(TokenType token) {
  register(token, new PrefixOperatorParselet());
}

   And now we can define the grammar like:
register(TokenType.NAME, new NameParselet());
prefix(TokenType.PLUS);
prefix(TokenType.MINUS);
prefix(TokenType.TILDE);
prefix(TokenType.BANG);

   This is already an improvement over a recursive descent parser because
   our grammar is now more declarative instead of being spread out over a
   few imperative functions, and we can see the actual grammar all in one
   place. Even better, we can extend the grammar just by registering new
   parselets. We don't have to change the Parser class itself.

   If we only had prefix expressions, we'd be done now. Alas, we don't.

Stuck In the Middle

   What we have so far only works if the first token tells us what kind of
   expression we're parsing, but that isn't always the case. With an
   expression like a + b, we don't know we have an add expression until
   after we parse the a and get to +. We'll have to extend the parser to
   support that.

   Fortunately, we're in a good place to do so. Our current
   parseExpression() method will parse a complete prefix expression
   including any nested prefix expressions and then stop. So, if we throw
   this at it:
-a + b

   It will parse -a and leave us sitting on +. That's exactly the token we
   need to tell what infix expression we're parsing. The only difference
   between an infix expression and a prefix one here is that there's
   another expression before the infix operator that it needs to have as
   an argument. Let's define a parselet that supports that:
interface InfixParselet {
  Expression parse(Parser parser, Expression left, Token token);
}

   The only difference is that left argument, which is just the expression
   we parsed before we got to the infix token. We'll wire this up to our
   parser by having another table of infix parselets.

   Having separate tables for prefix and infix expressions is important
   because we'll often have both a prefix and infix parselet for a single
   TokenType. For example, the prefix parselet for ( handles grouping in
   an expression like a * (b + c). Meanwhile, the infix parselet handles
   function calls like a(b).

   Now, after we parse the prefix expression, we hand it off to any infix
   one that subsumes it:
class Parser {
  public void register(TokenType token, InfixParselet parselet) {
    mInfixParselets.put(token, parselet);
  }

  public Expression parseExpression() {
    Token token = consume();
    PrefixParselet prefix = mPrefixParselets.get(token.getType());

    if (prefix == null) throw new ParseException(
        "Could not parse \"" + token.getText() + "\".");

    Expression left = prefix.parse(this, token);

    token = lookAhead(0);
    InfixParselet infix = mInfixParselets.get(token.getType());

    // No infix expression at this point, so we're done.
    if (infix == null) return left;

    consume();
    return infix.parse(this, left, token);
  }

  // Other stuff...

  private final Map<TokenType, InfixParselet> mInfixParselets =
      new HashMap<TokenType, InfixParselet>();
}

   Pretty straightforward. We can implement an infix parselet for binary
   arithmetic operators like + using something like:
class BinaryOperatorParselet implements InfixParselet {
  public Expression parse(Parser parser,
      Expression left, Token token) {
    Expression right = parser.parseExpression();
    return new OperatorExpression(left, token.getType(), right);
  }
}

   This also works for postfix operators. I'm calling them "infix"
   parselets, but they're really "anything but prefix". If there's some
   expression that comes before the token, it will be handled by an infix
   parselet, and that includes postfix expressions and mixfix ones like
   ?:.

   Postfix is as easy as a single-token prefix parselet: it just takes the
   left expression and wraps it in another expression:
class PostfixOperatorParselet implements InfixParselet {
  public Expression parse(Parser parser, Expression left,
      Token token) {
    return new PostfixExpression(left, token.getType());
  }
}

   Mixfix is easy too. It's pretty much a familiar recursive descent
   parser:
class ConditionalParselet implements InfixParselet {
  public Expression parse(Parser parser, Expression left,
      Token token) {
    Expression thenArm = parser.parseExpression();
    parser.consume(TokenType.COLON);
    Expression elseArm = parser.parseExpression();

    return new ConditionalExpression(left, thenArm, elseArm);
  }
}

   Now we can parse prefix expressions, postfix, infix, and even mixfix.
   With a pretty small amount of code, we can parse expressions like a +
   (b ? c! : -d). We're done, right? Well... almost.

Excuse You, Aunt Sally

   Our parser can parse all of this stuff, but it doesn't parse it with
   the right precedence or associativity. If you throw a - b - c at it, it
   will parse it like a - (b - c), which isn't right. (Well, actually it
   is right--associative that is. We need it to be left.)

   And this last step is where Pratt parsers go from pretty nice to
   totally radical. We'll make two simple changes. We'll extend
   parseExpression() to take a precedence--a number that tells which
   expressions can be parsed by that call. If it encounters an expression
   whose precedence is lower than we allow, it just stops parsing and
   returns what it has so far.

   To make that check we need to know the precedence of any given infix
   expression. We'll do that by letting the parselet specify it:
public interface InfixParselet {
  Expression parse(Parser parser, Expression left, Token token);
  int getPrecedence();
}

   Using that, our core expression parser becomes:
public Expression parseExpression(int precedence) {
  Token token = consume();
  PrefixParselet prefix = mPrefixParselets.get(token.getType());

  if (prefix == null) throw new ParseException(
      "Could not parse \"" + token.getText() + "\".");

  Expression left = prefix.parse(this, token);

  while (precedence < getPrecedence()) {
    token = consume();

    InfixParselet infix = mInfixParselets.get(token.getType());
    left = infix.parse(this, left, token);
  }

  return left;
}

   That relies on a tiny helper function to get the precedence of the
   current token or default if there's no infix parselet for it:
private int getPrecedence() {
  InfixParselet parser = mInfixParselets.get(
      lookAhead(0).getType());
  if (parser != null) return parser.getPrecedence();

  return 0;
}

   And that's it. To use this, we'll set up a little precedence table:
public class Precedence {
  public static final int ASSIGNMENT  = 1;
  public static final int CONDITIONAL = 2;
  public static final int SUM         = 3;
  public static final int PRODUCT     = 4;
  public static final int EXPONENT    = 5;
  public static final int PREFIX      = 6;
  public static final int POSTFIX     = 7;
  public static final int CALL        = 8;
}

   To make our operators correctly handle precedence, they'll pass an
   appropriate value back into parseExpression() when they call it
   recursively. For example, the [27]BinaryOperatorParselet instance that
   handles the + operator will pass in Precedence.SUM when it parses its
   right-hand operand.

   Associativity is easy too. If an infix parselet calls parseExpression()
   with the same precedence that it returns for its own getPrecedence()
   call, you'll get left associativity. To be right-associative, it just
   needs to pass in one less than that instead.

Go Forth and Multiply

   I've rewritten the [28]parser for Magpie using this and it worked like
   a charm. I'm also working on a JavaScript parser based on this and
   again it's been a great fit.

   I find parsing like this to be simple, terse, extensible (Magpie, for
   example, uses this to [29]let you extend its own syntax at runtime),
   and easy to read. I'm at the point where I can't imagine writing a
   parser any other way. I never thought I'd say this, but parsers are
   easy now.

   To see for yourself, just take a look at [30]the complete program.
   Please enable JavaScript to view the [31]comments powered by Disqus.

   [32][dogshot_square.jpg]

   Hi! I'm Bob Nystrom, the one on the left.

   I wrote a book called [33]Game Programming Patterns. I'm working on
   another book called [34]Crafting Interpreters.

   You can email me at robert at this site or follow me on twitter at
   [35]@munificentbob.

Elsewhere

     * Code at [36]github
     * Tweets at [37]twitter
     * Photos at [38]500px
     * Photos at [39]flickr

Categories

     * [40]code 67
     * [41]language 43
     * [42]magpie 24
     * [43]c-sharp 13
     * [44]dart 13
     * [45]game-dev 12
     * [46]java 10
     * [47]cpp 8
     * [48]design 7
     * [49]game-patterns 6
     * [50]parsing 6
     * [51]roguelike 6
     * [52]book 5
     * [53]go 5
     * [54]js 4
     * [55]personal 4
     * [56]c 3
     * [57]finch 3
     * [58]python 3
     * [59]ruby 3
     * [60]blog 2
     * [61]f-sharp 2
     * [62]lua 2
     * [63]music 2
     * [64]ai 1
     * [65]beta 1
     * [66]blogofile 1
     * [67]game 1
     * [68]jasic 1
     * [69]javascript 1
     * [70]oop 1
     * [71]optimization 1
     * [72]oscon 1
     * [73]politics 1
     * [74]scheme 1
     * [75]typescript 1
     * [76]visualization 1

   All [77]76 articles...

   This blog is built using [78]jekyll. The source repo for it is
   [79]here.

   © 2008-2021 Robert Nystrom

References

   Visible links:
   1. https://journal.stuffwithstuff.com/rss.xml
   2. https://journal.stuffwithstuff.com/atom.xml
   3. https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
   4. https://journal.stuffwithstuff.com/2011/02/13/extending-syntax-from-within-a-language/
   5. https://journal.stuffwithstuff.com/2011/04/21/multimethods-multiple-inheritance-multiawesome/
   6. https://journal.stuffwithstuff.com/archive
   7. https://journal.stuffwithstuff.com/category/code
   8. https://journal.stuffwithstuff.com/category/java
   9. https://journal.stuffwithstuff.com/category/js
  10. https://journal.stuffwithstuff.com/category/language
  11. https://journal.stuffwithstuff.com/category/magpie
  12. https://journal.stuffwithstuff.com/category/parsing
  13. http://xkcd.com/208/
  14. http://en.wikipedia.org/wiki/Heap_%28data_structure%29
  15. http://en.wikipedia.org/wiki/Vaughan_Pratt
  16. http://en.wikipedia.org/wiki/Recursive_descent
  17. http://www.jslint.com/
  18. http://javascript.crockford.com/tdop/tdop.html
  19. http://effbot.org/zone/simple-top-down-parsing.htm
  20. https://github.com/munificent/bantam
  21. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Lexer.java
  22. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Token.java
  23. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/expressions/Expression.java
  24. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/expressions/ConditionalExpression.java
  25. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Parser.java
  26. http://en.wikipedia.org/wiki/Quicksort
  27. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/parselets/BinaryOperatorParselet.java
  28. https://github.com/munificent/magpie/blob/master/src/com/stuffwithstuff/magpie/parser/MagpieParser.java
  29. http://journal.stuffwithstuff.com/2011/02/13/extending-syntax-from-within-a-language/
  30. http://github.com/munificent/bantam
  31. http://disqus.com/?ref_noscript
  32. https://journal.stuffwithstuff.com/
  33. http://gameprogrammingpatterns.com/
  34. http://craftinginterpreters.com/
  35. https://twitter.com/intent/user?screen_name=munificentbob
  36. http://github.com/munificent
  37. http://twitter.com/munificentbob
  38. https://500px.com/munificent
  39. http://www.flickr.com/photos/bobisbob/
  40. https://journal.stuffwithstuff.com/category/code
  41. https://journal.stuffwithstuff.com/category/language
  42. https://journal.stuffwithstuff.com/category/magpie
  43. https://journal.stuffwithstuff.com/category/c-sharp
  44. https://journal.stuffwithstuff.com/category/dart
  45. https://journal.stuffwithstuff.com/category/game-dev
  46. https://journal.stuffwithstuff.com/category/java
  47. https://journal.stuffwithstuff.com/category/cpp
  48. https://journal.stuffwithstuff.com/category/design
  49. https://journal.stuffwithstuff.com/category/game-patterns
  50. https://journal.stuffwithstuff.com/category/parsing
  51. https://journal.stuffwithstuff.com/category/roguelike
  52. https://journal.stuffwithstuff.com/category/book
  53. https://journal.stuffwithstuff.com/category/go
  54. https://journal.stuffwithstuff.com/category/js
  55. https://journal.stuffwithstuff.com/category/personal
  56. https://journal.stuffwithstuff.com/category/c
  57. https://journal.stuffwithstuff.com/category/finch
  58. https://journal.stuffwithstuff.com/category/python
  59. https://journal.stuffwithstuff.com/category/ruby
  60. https://journal.stuffwithstuff.com/category/blog
  61. https://journal.stuffwithstuff.com/category/f-sharp
  62. https://journal.stuffwithstuff.com/category/lua
  63. https://journal.stuffwithstuff.com/category/music
  64. https://journal.stuffwithstuff.com/category/ai
  65. https://journal.stuffwithstuff.com/category/beta
  66. https://journal.stuffwithstuff.com/category/blogofile
  67. https://journal.stuffwithstuff.com/category/game
  68. https://journal.stuffwithstuff.com/category/jasic
  69. https://journal.stuffwithstuff.com/category/javascript
  70. https://journal.stuffwithstuff.com/category/oop
  71. https://journal.stuffwithstuff.com/category/optimization
  72. https://journal.stuffwithstuff.com/category/oscon
  73. https://journal.stuffwithstuff.com/category/politics
  74. https://journal.stuffwithstuff.com/category/scheme
  75. https://journal.stuffwithstuff.com/category/typescript
  76. https://journal.stuffwithstuff.com/category/visualization
  77. https://journal.stuffwithstuff.com/archive
  78. http://jekyllrb.com/
  79. https://github.com/munificent/journal

   Hidden links:
  81. https://www.reddit.com/submit?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy//
  82. https://news.ycombinator.com/submitlink?u=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy//&t=Pratt%20Parsers:%20Expression%20Parsing%20Made%20Easy
  83. http://twitter.com/share?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/&text=%22Pratt%20Parsers:%20Expression%20Parsing%20Made%20Easy%22%20%40munificentbob
  84. http://www.facebook.com/share.php?u=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
  85. https://plus.google.com/share?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
  86. https://journal.stuffwithstuff.com/rss.xml