1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
|
#[1]RSS 2.0 [2]Atom 1.0
[3]Pratt Parsers: Expression Parsing Made Easy
[4]↩ [5]↪
[6]March 19, 2011 [7]code [8]java [9]js [10]language [11]magpie [12]parsing
Every now and then, I stumble onto some algorithm or idea that's so
clever and such a perfect solution to a problem that I feel like I got
smarter or gained [13]a new superpower just by learning it. [14]Heaps
(just about the only thing I got out of my truncated CS education) were
one thing like this. I recently stumbled onto another: [15]Pratt
parsers.
When you're writing a parser, [16]recursive descent is as easy as
spreading peanut butter. It excels when you can figure out what to do
next based on the next chunk of code you're parsing. That's usually
true at the top level of a language where things like classes are and
also for statements since most start with something that uniquely
identifies them (if, for, while, etc.).
But it gets tricky when you get to expressions. When it comes to infix
operators like +, postfix ones like ++, and even mixfix expressions
like ?:, it can be hard to tell what kind of expression you're parsing
until you're halfway through it. You can do this with recursive
descent, but it's a chore. You have to write separate functions for
each level of precedence (JavaScript has 17 of them, for example),
manually handle associativity, and smear your grammar across a bunch of
parsing code until it's hard to see.
PB & J, The Secret Weapon
Pratt parsing fixes just that. If recursive descent is peanut butter,
Pratt parsing is jelly. When you mix the two together, you get a
simple, terse, readable parser that can handle any grammar you throw at
it.
Pratt's technique for handling operator precedence and infix
expressions is so simple and effective it's a mystery why almost no one
knows about it. After the seventies, top down operator precedence
parsers seem to have fallen off the Earth. Douglas Crockford's
[17]JSLint uses one to [18]parse JavaScript, but his treatment is one
of the [19]very few remotely modern articles about it.
Part of the problem, I think, is that Pratt's terminology is opaque,
and Crockford's article is itself rather murky. Pratt uses terms like
"null denominator" and Crockford mixes in extra stuff like tracking
lexical scope that obscures the core idea.
This is where I come in. I won't do anything revolutionary. I'll just
try to get the core concepts behind top down operator precedence
parsers and present them as clearly as I can. I'll switch out some
terms to (I hope) clarify things. Hopefully I won't offend anyone's
purist sensibilities. I'll be coding in Java, the vulgar Latin of
programming languages. I figure if you can write it in Java, you can
write it in anything.
What We'll Be Making
I'm a learn-by-doing person, which means I'm also a teach-by-doing one.
So to show how Pratt parsers work, we'll build a parser for a [20]tiny
little toy language called Bantam. It just has expressions since that's
where Pratt parsing is really helpful, but that should be enough to
convince of its usefulness.
Even though it's simple, it has a full gamut of operators: prefix (+,
-, ~, !), postfix (!), infix (+, -, *, /, ^), and even a mixfix
conditional operator (?:). It has multiple precedence levels and both
right and left associative operators. It also has assignment, function
calls and parentheses for grouping. If we can parse this, we can parse
anything.
What We'll Start With
All we care about is parsing, so we'll ignore the tokenizing phase. I
slapped together [21]a crude lexer that works and we'll just pretend
that tokens are raining down from heaven or something.
A token is just a chunk of meaningful code with a type and a string
associated with it. Given a + b(c), the tokens would be:
NAME "a"
PLUS "+"
NAME "b"
LEFT_PAREN "("
NAME "c"
RIGHT_PAREN ")"
Likewise, we won't be interpreting or compiling this code. We just want
to parse it to a nice data structure. For our purposes, that means our
parser should chew up a bunch of [22]Token objects and spit out an
instance of some class that implements [23]Expression. To give you an
idea, here's a simplified version of the class for a [24]conditional
expression:
class ConditionalExpression implements Expression {
public ConditionalExpression(
Expression condition,
Expression thenArm,
Expression elseArm) {
this.condition = condition;
this.thenArm = thenArm;
this.elseArm = elseArm;
}
public final Expression condition;
public final Expression thenArm;
public final Expression elseArm;
}
(You gotta love Java's "please sign it in quadruplicate" level of
bureaucracy here. Like I said, if you can do this in Java, you can do
it in any language.)
We'll be building this starting from a simple [25]Parser class. This
owns the token stream, handles lookahead and provides the basic methods
you'll need to write a top-down recursive descent parser with a single
token of lookahead (i.e. it's LL(1)). This is enough to get us going.
If we need more later, it's easy to extend it.
OK, let's build ourselves a parser!
First Things First
Even though a "full" Pratt parser is pretty tiny, I found it to be a
bit hard to decipher. Sort of like [26]quicksort, the implementation is
a deceptively-simple handful of deeply intertwined code. To untangle
it, we'll build it up one tiny step at a time.
The simplest expressions to parse are prefix operators and single-token
ones. For those, the current token tells us all that we need to do.
Bantam has one single-token expression, named variables, and four
prefix operators: +, -, ~, and !. The simplest possible code to parse
that would be:
Expression parseExpression() {
if (match(TokenType.NAME)) // return NameExpression...
else if (match(TokenType.PLUS)) // return prefix + operator...
else if (match(TokenType.MINUS)) // return prefix - operator...
else if (match(TokenType.TILDE)) // return prefix ~ operator...
else if (match(TokenType.BANG)) // return prefix ! operator...
else throw new ParseException();
}
But that's a bit monolithic. As you can see, we're switching off of a
TokenType to branch to different parsing behavior. Let's encode that
directly by making a Map from TokenTypes to chunks of parsing code.
We'll call these chunks "parselets", and they will implement this:
interface PrefixParselet {
Expression parse(Parser parser, Token token);
}
An implementation of this to parse variable names is just:
class NameParselet implements PrefixParselet {
public Expression parse(Parser parser, Token token) {
return new NameExpression(token.getText());
}
}
We can use a single class for all of the prefix operators since they
only differ in the actual operator token itself:
class PrefixOperatorParselet implements PrefixParselet {
public Expression parse(Parser parser, Token token) {
Expression operand = parser.parseExpression();
return new PrefixExpression(token.getType(), operand);
}
}
You'll note that it calls back into parseExpression() to parse the
operand that appears after the operator (i.e. to parse the a in -a).
This recursion takes care of nested operators like -+~!a.
Back in Parser, the chained if statements are replaced with a cleaner
map:
class Parser {
public void register(TokenType token, PrefixParselet parselet) {
mPrefixParselets.put(token, parselet);
}
public Expression parseExpression() {
Token token = consume();
PrefixParselet prefix = mPrefixParselets.get(token.getType());
if (prefix == null) throw new ParseException(
"Could not parse \"" + token.getText() + "\".");
return prefix.parse(this, token);
}
// Other stuff...
private final Map<TokenType, PrefixParselet> mPrefixParselets =
new HashMap<TokenType, PrefixParselet>();
}
To define the grammar we have so far (variables and the four prefix
operators), we'll make this helper method:
void prefix(TokenType token) {
register(token, new PrefixOperatorParselet());
}
And now we can define the grammar like:
register(TokenType.NAME, new NameParselet());
prefix(TokenType.PLUS);
prefix(TokenType.MINUS);
prefix(TokenType.TILDE);
prefix(TokenType.BANG);
This is already an improvement over a recursive descent parser because
our grammar is now more declarative instead of being spread out over a
few imperative functions, and we can see the actual grammar all in one
place. Even better, we can extend the grammar just by registering new
parselets. We don't have to change the Parser class itself.
If we only had prefix expressions, we'd be done now. Alas, we don't.
Stuck In the Middle
What we have so far only works if the first token tells us what kind of
expression we're parsing, but that isn't always the case. With an
expression like a + b, we don't know we have an add expression until
after we parse the a and get to +. We'll have to extend the parser to
support that.
Fortunately, we're in a good place to do so. Our current
parseExpression() method will parse a complete prefix expression
including any nested prefix expressions and then stop. So, if we throw
this at it:
-a + b
It will parse -a and leave us sitting on +. That's exactly the token we
need to tell what infix expression we're parsing. The only difference
between an infix expression and a prefix one here is that there's
another expression before the infix operator that it needs to have as
an argument. Let's define a parselet that supports that:
interface InfixParselet {
Expression parse(Parser parser, Expression left, Token token);
}
The only difference is that left argument, which is just the expression
we parsed before we got to the infix token. We'll wire this up to our
parser by having another table of infix parselets.
Having separate tables for prefix and infix expressions is important
because we'll often have both a prefix and infix parselet for a single
TokenType. For example, the prefix parselet for ( handles grouping in
an expression like a * (b + c). Meanwhile, the infix parselet handles
function calls like a(b).
Now, after we parse the prefix expression, we hand it off to any infix
one that subsumes it:
class Parser {
public void register(TokenType token, InfixParselet parselet) {
mInfixParselets.put(token, parselet);
}
public Expression parseExpression() {
Token token = consume();
PrefixParselet prefix = mPrefixParselets.get(token.getType());
if (prefix == null) throw new ParseException(
"Could not parse \"" + token.getText() + "\".");
Expression left = prefix.parse(this, token);
token = lookAhead(0);
InfixParselet infix = mInfixParselets.get(token.getType());
// No infix expression at this point, so we're done.
if (infix == null) return left;
consume();
return infix.parse(this, left, token);
}
// Other stuff...
private final Map<TokenType, InfixParselet> mInfixParselets =
new HashMap<TokenType, InfixParselet>();
}
Pretty straightforward. We can implement an infix parselet for binary
arithmetic operators like + using something like:
class BinaryOperatorParselet implements InfixParselet {
public Expression parse(Parser parser,
Expression left, Token token) {
Expression right = parser.parseExpression();
return new OperatorExpression(left, token.getType(), right);
}
}
This also works for postfix operators. I'm calling them "infix"
parselets, but they're really "anything but prefix". If there's some
expression that comes before the token, it will be handled by an infix
parselet, and that includes postfix expressions and mixfix ones like
?:.
Postfix is as easy as a single-token prefix parselet: it just takes the
left expression and wraps it in another expression:
class PostfixOperatorParselet implements InfixParselet {
public Expression parse(Parser parser, Expression left,
Token token) {
return new PostfixExpression(left, token.getType());
}
}
Mixfix is easy too. It's pretty much a familiar recursive descent
parser:
class ConditionalParselet implements InfixParselet {
public Expression parse(Parser parser, Expression left,
Token token) {
Expression thenArm = parser.parseExpression();
parser.consume(TokenType.COLON);
Expression elseArm = parser.parseExpression();
return new ConditionalExpression(left, thenArm, elseArm);
}
}
Now we can parse prefix expressions, postfix, infix, and even mixfix.
With a pretty small amount of code, we can parse expressions like a +
(b ? c! : -d). We're done, right? Well... almost.
Excuse You, Aunt Sally
Our parser can parse all of this stuff, but it doesn't parse it with
the right precedence or associativity. If you throw a - b - c at it, it
will parse it like a - (b - c), which isn't right. (Well, actually it
is right--associative that is. We need it to be left.)
And this last step is where Pratt parsers go from pretty nice to
totally radical. We'll make two simple changes. We'll extend
parseExpression() to take a precedence--a number that tells which
expressions can be parsed by that call. If it encounters an expression
whose precedence is lower than we allow, it just stops parsing and
returns what it has so far.
To make that check we need to know the precedence of any given infix
expression. We'll do that by letting the parselet specify it:
public interface InfixParselet {
Expression parse(Parser parser, Expression left, Token token);
int getPrecedence();
}
Using that, our core expression parser becomes:
public Expression parseExpression(int precedence) {
Token token = consume();
PrefixParselet prefix = mPrefixParselets.get(token.getType());
if (prefix == null) throw new ParseException(
"Could not parse \"" + token.getText() + "\".");
Expression left = prefix.parse(this, token);
while (precedence < getPrecedence()) {
token = consume();
InfixParselet infix = mInfixParselets.get(token.getType());
left = infix.parse(this, left, token);
}
return left;
}
That relies on a tiny helper function to get the precedence of the
current token or default if there's no infix parselet for it:
private int getPrecedence() {
InfixParselet parser = mInfixParselets.get(
lookAhead(0).getType());
if (parser != null) return parser.getPrecedence();
return 0;
}
And that's it. To use this, we'll set up a little precedence table:
public class Precedence {
public static final int ASSIGNMENT = 1;
public static final int CONDITIONAL = 2;
public static final int SUM = 3;
public static final int PRODUCT = 4;
public static final int EXPONENT = 5;
public static final int PREFIX = 6;
public static final int POSTFIX = 7;
public static final int CALL = 8;
}
To make our operators correctly handle precedence, they'll pass an
appropriate value back into parseExpression() when they call it
recursively. For example, the [27]BinaryOperatorParselet instance that
handles the + operator will pass in Precedence.SUM when it parses its
right-hand operand.
Associativity is easy too. If an infix parselet calls parseExpression()
with the same precedence that it returns for its own getPrecedence()
call, you'll get left associativity. To be right-associative, it just
needs to pass in one less than that instead.
Go Forth and Multiply
I've rewritten the [28]parser for Magpie using this and it worked like
a charm. I'm also working on a JavaScript parser based on this and
again it's been a great fit.
I find parsing like this to be simple, terse, extensible (Magpie, for
example, uses this to [29]let you extend its own syntax at runtime),
and easy to read. I'm at the point where I can't imagine writing a
parser any other way. I never thought I'd say this, but parsers are
easy now.
To see for yourself, just take a look at [30]the complete program.
Please enable JavaScript to view the [31]comments powered by Disqus.
[32][dogshot_square.jpg]
Hi! I'm Bob Nystrom, the one on the left.
I wrote a book called [33]Game Programming Patterns. I'm working on
another book called [34]Crafting Interpreters.
You can email me at robert at this site or follow me on twitter at
[35]@munificentbob.
Elsewhere
* Code at [36]github
* Tweets at [37]twitter
* Photos at [38]500px
* Photos at [39]flickr
Categories
* [40]code 67
* [41]language 43
* [42]magpie 24
* [43]c-sharp 13
* [44]dart 13
* [45]game-dev 12
* [46]java 10
* [47]cpp 8
* [48]design 7
* [49]game-patterns 6
* [50]parsing 6
* [51]roguelike 6
* [52]book 5
* [53]go 5
* [54]js 4
* [55]personal 4
* [56]c 3
* [57]finch 3
* [58]python 3
* [59]ruby 3
* [60]blog 2
* [61]f-sharp 2
* [62]lua 2
* [63]music 2
* [64]ai 1
* [65]beta 1
* [66]blogofile 1
* [67]game 1
* [68]jasic 1
* [69]javascript 1
* [70]oop 1
* [71]optimization 1
* [72]oscon 1
* [73]politics 1
* [74]scheme 1
* [75]typescript 1
* [76]visualization 1
All [77]76 articles...
This blog is built using [78]jekyll. The source repo for it is
[79]here.
© 2008-2021 Robert Nystrom
References
Visible links:
1. https://journal.stuffwithstuff.com/rss.xml
2. https://journal.stuffwithstuff.com/atom.xml
3. https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
4. https://journal.stuffwithstuff.com/2011/02/13/extending-syntax-from-within-a-language/
5. https://journal.stuffwithstuff.com/2011/04/21/multimethods-multiple-inheritance-multiawesome/
6. https://journal.stuffwithstuff.com/archive
7. https://journal.stuffwithstuff.com/category/code
8. https://journal.stuffwithstuff.com/category/java
9. https://journal.stuffwithstuff.com/category/js
10. https://journal.stuffwithstuff.com/category/language
11. https://journal.stuffwithstuff.com/category/magpie
12. https://journal.stuffwithstuff.com/category/parsing
13. http://xkcd.com/208/
14. http://en.wikipedia.org/wiki/Heap_%28data_structure%29
15. http://en.wikipedia.org/wiki/Vaughan_Pratt
16. http://en.wikipedia.org/wiki/Recursive_descent
17. http://www.jslint.com/
18. http://javascript.crockford.com/tdop/tdop.html
19. http://effbot.org/zone/simple-top-down-parsing.htm
20. https://github.com/munificent/bantam
21. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Lexer.java
22. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Token.java
23. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/expressions/Expression.java
24. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/expressions/ConditionalExpression.java
25. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/Parser.java
26. http://en.wikipedia.org/wiki/Quicksort
27. https://github.com/munificent/bantam/blob/master/src/com/stuffwithstuff/bantam/parselets/BinaryOperatorParselet.java
28. https://github.com/munificent/magpie/blob/master/src/com/stuffwithstuff/magpie/parser/MagpieParser.java
29. http://journal.stuffwithstuff.com/2011/02/13/extending-syntax-from-within-a-language/
30. http://github.com/munificent/bantam
31. http://disqus.com/?ref_noscript
32. https://journal.stuffwithstuff.com/
33. http://gameprogrammingpatterns.com/
34. http://craftinginterpreters.com/
35. https://twitter.com/intent/user?screen_name=munificentbob
36. http://github.com/munificent
37. http://twitter.com/munificentbob
38. https://500px.com/munificent
39. http://www.flickr.com/photos/bobisbob/
40. https://journal.stuffwithstuff.com/category/code
41. https://journal.stuffwithstuff.com/category/language
42. https://journal.stuffwithstuff.com/category/magpie
43. https://journal.stuffwithstuff.com/category/c-sharp
44. https://journal.stuffwithstuff.com/category/dart
45. https://journal.stuffwithstuff.com/category/game-dev
46. https://journal.stuffwithstuff.com/category/java
47. https://journal.stuffwithstuff.com/category/cpp
48. https://journal.stuffwithstuff.com/category/design
49. https://journal.stuffwithstuff.com/category/game-patterns
50. https://journal.stuffwithstuff.com/category/parsing
51. https://journal.stuffwithstuff.com/category/roguelike
52. https://journal.stuffwithstuff.com/category/book
53. https://journal.stuffwithstuff.com/category/go
54. https://journal.stuffwithstuff.com/category/js
55. https://journal.stuffwithstuff.com/category/personal
56. https://journal.stuffwithstuff.com/category/c
57. https://journal.stuffwithstuff.com/category/finch
58. https://journal.stuffwithstuff.com/category/python
59. https://journal.stuffwithstuff.com/category/ruby
60. https://journal.stuffwithstuff.com/category/blog
61. https://journal.stuffwithstuff.com/category/f-sharp
62. https://journal.stuffwithstuff.com/category/lua
63. https://journal.stuffwithstuff.com/category/music
64. https://journal.stuffwithstuff.com/category/ai
65. https://journal.stuffwithstuff.com/category/beta
66. https://journal.stuffwithstuff.com/category/blogofile
67. https://journal.stuffwithstuff.com/category/game
68. https://journal.stuffwithstuff.com/category/jasic
69. https://journal.stuffwithstuff.com/category/javascript
70. https://journal.stuffwithstuff.com/category/oop
71. https://journal.stuffwithstuff.com/category/optimization
72. https://journal.stuffwithstuff.com/category/oscon
73. https://journal.stuffwithstuff.com/category/politics
74. https://journal.stuffwithstuff.com/category/scheme
75. https://journal.stuffwithstuff.com/category/typescript
76. https://journal.stuffwithstuff.com/category/visualization
77. https://journal.stuffwithstuff.com/archive
78. http://jekyllrb.com/
79. https://github.com/munificent/journal
Hidden links:
81. https://www.reddit.com/submit?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy//
82. https://news.ycombinator.com/submitlink?u=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy//&t=Pratt%20Parsers:%20Expression%20Parsing%20Made%20Easy
83. http://twitter.com/share?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/&text=%22Pratt%20Parsers:%20Expression%20Parsing%20Made%20Easy%22%20%40munificentbob
84. http://www.facebook.com/share.php?u=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
85. https://plus.google.com/share?url=http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
86. https://journal.stuffwithstuff.com/rss.xml
|