miniany/doc/Writing a C Compiler, Part 3.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584

<!DOCTYPE html>
<!-- saved from url=(0058)https://norasandler.com/2017/12/15/Write-a-Compiler-3.html -->
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <title>Writing a C Compiler, Part 3</title>
  <meta name="description" content="This is the third post in a series. Read part 1 here.">

  <link rel="stylesheet" href="./Writing a C Compiler, Part 3_files/main.css">
  <link rel="canonical" href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html">
  <link rel="alternate" type="application/rss+xml" title="Nora Sandler" href="https://norasandler.com/feed.xml">
  
</head>


  <body>

    <header class="site-header" role="banner">

  <div class="wrapper">
    
    
    <a class="site-title" href="https://norasandler.com/">Nora Sandler</a>
  
    
      <nav class="site-nav">
        <input type="checkbox" id="nav-trigger" class="nav-trigger">
        <label for="nav-trigger">
          <span class="menu-icon">
            <svg viewBox="0 0 18 15" width="18px" height="15px">
              <path fill="#424242" d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.031C17.335,0,18,0.665,18,1.484L18,1.484z"></path>
              <path fill="#424242" d="M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0c0-0.82,0.665-1.484,1.484-1.484 h15.031C17.335,6.031,18,6.696,18,7.516L18,7.516z"></path>
              <path fill="#424242" d="M18,13.516C18,14.335,17.335,15,16.516,15H1.484C0.665,15,0,14.335,0,13.516l0,0 c0-0.82,0.665-1.484,1.484-1.484h15.031C17.335,12.031,18,12.696,18,13.516L18,13.516z"></path>
            </svg>
          </span>
        </label>

        <div class="trigger">
          
            
              <a class="page-link" href="https://norasandler.com/about/">About</a>
              
            
              <a class="page-link" href="https://norasandler.com/archive/">Archive</a>
              
            
          <a class="page-link" href="https://github.com/nlsandler">Github</a>
          <a href="https://norasandler.com/feed.xml"><img id="rss" height="20" width="20" src="./Writing a C Compiler, Part 3_files/rss.png"></a>

        </div>
      </nav>
    
  </div>
</header>


    <main class="page-content" aria-label="Content">
      <div class="wrapper">
        <article class="post h-entry" itemscope="" itemtype="http://schema.org/BlogPosting">

  <header class="post-header">
    <h1 class="post-title p-name" itemprop="name headline">Writing a C Compiler, Part 3</h1>
    <p class="post-meta">
      <time class="dt-published" datetime="2017-12-15T20:30:00+00:00" itemprop="datePublished">Dec 15, 2017
      </time></p>
  </header>

  <div class="post-content e-content" itemprop="articleBody">
    <p><em>This is the third post in a series. Read part 1 <a href="https://norasandler.com/2017/11/29/Write-a-Compiler.html">here</a>.</em></p>

<p>This week we’ll add binary operations to support basic arithmetic. We’ll figure out how to correctly handle operator precedence and associativity. You can find the accompanying tests <a href="https://github.com/nlsandler/write_a_c_compiler">here</a>.</p>

<h1 id="week-3-binary-operators">Week 3: Binary Operators</h1>

<p>This week we’re adding several binary operations (operators that take two values):</p>

<ul>
  <li>Addition <code class="language-plaintext highlighter-rouge">+</code></li>
  <li>Subtraction <code class="language-plaintext highlighter-rouge">-</code></li>
  <li>Multiplication <code class="language-plaintext highlighter-rouge">*</code></li>
  <li>Division <code class="language-plaintext highlighter-rouge">/</code></li>
</ul>

<p>As usual, we’ll update each stage of the compiler to support these operations.</p>

<h2 id="lexing">Lexing</h2>

<p>Each of the operators above will require a new token, except for subtraction – we already have a <code class="language-plaintext highlighter-rouge">-</code> token. It gets tokenized the same way whether it’s a subtraction or negation operator; we’ll figure out how to interpret it during the parsing stage. Arithmetic expressions can also contain parentheses, but we already have tokens for those too, so we don’t need to change our lexer at all to handle them.</p>

<p>Here’s the full list of tokens we need to support. Tokens from previous weeks are at the top, new tokens are bolded at the bottom:</p>

<ul>
  <li>Open brace <code class="language-plaintext highlighter-rouge">{</code></li>
  <li>Close brace <code class="language-plaintext highlighter-rouge">}</code></li>
  <li>Open parenthesis <code class="language-plaintext highlighter-rouge">(</code></li>
  <li>Close parenthesis <code class="language-plaintext highlighter-rouge">)</code></li>
  <li>Semicolon <code class="language-plaintext highlighter-rouge">;</code></li>
  <li>Int keyword <code class="language-plaintext highlighter-rouge">int</code></li>
  <li>Return keyword <code class="language-plaintext highlighter-rouge">return</code></li>
  <li>Identifier <code class="language-plaintext highlighter-rouge">[a-zA-Z]\w*</code></li>
  <li>Integer literal <code class="language-plaintext highlighter-rouge">[0-9]+</code></li>
  <li>Minus <code class="language-plaintext highlighter-rouge">-</code></li>
  <li>Bitwise complement <code class="language-plaintext highlighter-rouge">~</code></li>
  <li>Logical negation <code class="language-plaintext highlighter-rouge">!</code></li>
  <li><strong>Addition <code class="language-plaintext highlighter-rouge">+</code></strong></li>
  <li><strong>Multiplication <code class="language-plaintext highlighter-rouge">*</code></strong></li>
  <li><strong>Division <code class="language-plaintext highlighter-rouge">/</code></strong></li>
</ul>

<h4 id="-task">☑ Task:</h4>
<p>Update the <em>lex</em> function to handle the new tokens. It should work for all stage 1, 2, and 3 examples in the test suite, including the invalid ones.</p>

<h2 id="parsing">Parsing</h2>

<p>This week we’ll need to add another expression type to our AST: binary operations. Here’s the latest set of definitions for our AST nodes; only the definition of <code class="language-plaintext highlighter-rouge">exp</code> has changed:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program = Program(function_declaration)
function_declaration = Function(string, statement) //string is the function name
statement = Return(exp)
exp = BinOp(binary_operator, exp, exp)
    | UnOp(unary_operator, exp)
    | Constant(int)
</code></pre></div></div>

<p>Note that we now distinguish between binary and unary operators in our AST definition. For example, <code class="language-plaintext highlighter-rouge">-</code> as in negation and <code class="language-plaintext highlighter-rouge">-</code> as in subtraction would be different types/classes/whatever in our AST. Here’s how we might construct the AST for <code class="language-plaintext highlighter-rouge">2 - (-3)</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>two = Const(2)
three = Const(3)
neg_three = UnOp(NEG, three)
exp = BinOp(MINUS, two, neg_three)
</code></pre></div></div>

<p>We also need to change the definition of <code class="language-plaintext highlighter-rouge">&lt;exp&gt;</code> in our grammar. The most obvious definition is something like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp&gt; ::= &lt;exp&gt; &lt;binary_op&gt; &lt;exp&gt; | &lt;unary_op&gt; &lt;exp&gt; | "(" &lt;exp&gt; ")" | &lt;int&gt;
</code></pre></div></div>

<p>But there are several related problems with this grammar:</p>

<ol>
  <li>
    <p><strong>It doesn’t handle operator precedence</strong>. Consider the expression <code class="language-plaintext highlighter-rouge">2 + 3 * 4</code>. Using the grammar above, you can construct two possible parse trees:</p>

    <div class="img-wrapper">
     <div>
         <img src="./Writing a C Compiler, Part 3_files/exp1.svg" aria-describedby="tree-1" alt="Parse tree for expression interpreted as (2 + 3) * 4, outline below.">
         <p class="caption">Tree #1</p>
     </div>
     <div>
         <img src="./Writing a C Compiler, Part 3_files/exp2.svg" aria-describedby="tree-2" alt="Parse tree for expression interpreted as 2 + (3 * 4), outline below.">
         <p class="caption">Tree #2</p>
     </div>
 </div>

    <p>Using the first parse tree, this expression evalutes to <code class="language-plaintext highlighter-rouge">(2 + 3) * 4 = 24</code>. Using the second, it’s <code class="language-plaintext highlighter-rouge">2 + (3 * 4) = 14</code>. According to the C standard and mathematical convention, <code class="language-plaintext highlighter-rouge">*</code> has higher precendence than <code class="language-plaintext highlighter-rouge">+</code>, so the second parse tree is correct. Our grammar has to encode this precedence somehow.</p>

    <p>This is a problem with our unary operations too – according to this grammar, <code class="language-plaintext highlighter-rouge">~2 + 3</code> could be parsed as <code class="language-plaintext highlighter-rouge">~(2 + 3)</code>, which is of course wrong.</p>
  </li>
</ol>

<div id="tree-1" class="screen-reader-only">
  <p>AST tree for (2 + 3) * 4</p>
  <ul>
    <li>expression: <code class="language-plaintext highlighter-rouge">&lt;exp&gt; &lt;binop&gt; &lt;exp&gt;</code>
      <ul>
        <li>expression: <code class="language-plaintext highlighter-rouge">&lt;exp&gt; &lt;binop&gt; &lt;exp&gt;</code>
          <ul>
            <li>expression: <code class="language-plaintext highlighter-rouge">&lt;int&gt;</code>
              <ul>
                <li>constant: 2</li>
              </ul>
            </li>
            <li>operator: +</li>
            <li>expression: <code class="language-plaintext highlighter-rouge">&lt;int&gt;</code>
              <ul>
                <li>constant: 3</li>
              </ul>
            </li>
          </ul>
        </li>
        <li>operator: *</li>
        <li>expression: <code class="language-plaintext highlighter-rouge">&lt;int&gt;</code>
          <ul>
            <li>constant: 4</li>
          </ul>
        </li>
      </ul>
    </li>
  </ul>
</div>
<div id="tree-2" class="screen-reader-only">
  <p>AST tree for 2 + (3 * 4)</p>
  <ul>
    <li>expression: <code class="language-plaintext highlighter-rouge">&lt;exp&gt; &lt;binop&gt; &lt;exp&gt;</code>
      <ul>
        <li>expression: <code class="language-plaintext highlighter-rouge">&lt;int&gt;</code>
          <ul>
            <li>constant: 2</li>
          </ul>
        </li>
        <li>operator: +</li>
        <li>expression: <code class="language-plaintext highlighter-rouge">&lt;exp&gt; &lt;binop&gt; &lt;exp&gt;</code>
          <ul>
            <li>expression: <code class="language-plaintext highlighter-rouge">&lt;int&gt;</code>
              <ul>
                <li>constant: 3</li>
              </ul>
            </li>
            <li>operator: *</li>
            <li>expression: <code class="language-plaintext highlighter-rouge">&lt;int&gt;</code>
              <ul>
                <li>constant: 4</li>
              </ul>
            </li>
          </ul>
        </li>
      </ul>
    </li>
  </ul>
</div>

<ol start="2">
  <li>
    <p><strong>It doesn’t handle associativity</strong>. Operations at the same precedence level should be evaluated left-to-right<sup id="anchor1"><a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#fn1">1</a></sup>. For example <code class="language-plaintext highlighter-rouge">1 - 2 - 3</code> should be parsed as <code class="language-plaintext highlighter-rouge">(1 - 2) - 3</code>. But, according to the grammar above, parsing it as <code class="language-plaintext highlighter-rouge">1 - (2 - 3)</code> is also valid.</p>
  </li>
  <li>
    <p><strong>It’s <a href="https://en.wikipedia.org/wiki/Left_recursion">left-recursive</a>.</strong> In the grammar above, one of the production rules for <code class="language-plaintext highlighter-rouge">&lt;exp&gt;</code> is:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> &lt;exp&gt; &lt;binary_op&gt; &lt;exp&gt;
</code></pre></div>    </div>

    <p>In this production rule, the left-most (i.e. first) symbol is also <code class="language-plaintext highlighter-rouge">&lt;exp&gt;</code> — that’s what left-recursive means. Left-recursive grammars aren’t <em>incorrect</em>, but recursive descent (RD) parsers can’t handle them<sup id="anchor2"><a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#fn2">2</a></sup>. We’ll talk about <em>why</em> this is a problem later in this post.</p>
  </li>
</ol>

<p>Let’s start by tackling problem #1, precedence<sup id="anchor3"><a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#fn3">3</a></sup>. We’ll handle unary operators first – they always have higher precedence than binary operators. A unary operator should only be applied to a whole expression if:</p>

<ul>
  <li>the expression is a single integer (e.g. <code class="language-plaintext highlighter-rouge">~4</code>)</li>
  <li>the expression is wrapped in parentheses (e.g. <code class="language-plaintext highlighter-rouge">~(1+1)</code>), or</li>
  <li>the expression is itself a unary operation (e.g. <code class="language-plaintext highlighter-rouge">~!8</code>, <code class="language-plaintext highlighter-rouge">-~(2+2)</code>).</li>
</ul>

<p>To express this, we’re going to need another symbol in our grammar to refer to “an expression a unary operator can be applied to”. We’ll call it a factor. We’ll rewrite our grammar like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp&gt; ::= &lt;exp&gt; &lt;binary_op&gt; &lt;exp&gt; | &lt;factor&gt;
&lt;factor&gt; ::= "(" &lt;exp&gt; ")" | &lt;unary_op&gt; &lt;factor&gt; | &lt;int&gt;
</code></pre></div></div>

<p>We’ve now created two levels of precedence: one for binary operations, one for unary operations. We’re also handling parentheses correctly – putting an expression inside parentheses forces higher precedence.</p>

<p>We can make a similar change to force <code class="language-plaintext highlighter-rouge">*</code> and <code class="language-plaintext highlighter-rouge">/</code> to be higher precedence than <code class="language-plaintext highlighter-rouge">+</code> and <code class="language-plaintext highlighter-rouge">-</code>. We added a <code class="language-plaintext highlighter-rouge">&lt;factor&gt;</code> symbol before, representing the operands of unary operations. Now we’ll add a <code class="language-plaintext highlighter-rouge">&lt;term&gt;</code> symbol, representing the operands of multiplication and division<sup id="anchor4"><a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#fn4">4</a></sup>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp&gt; ::= &lt;exp&gt; ("+" | "-") &lt;exp&gt; | &lt;term&gt;
&lt;term&gt; ::= &lt;term&gt; ("*" | "/") &lt;term&gt; | &lt;factor&gt;
&lt;factor&gt; ::= "(" &lt;exp&gt; ")" | &lt;unary_op&gt; &lt;factor&gt; | &lt;int&gt;
</code></pre></div></div>

<p>This grammar encodes operator precedence correctly. There’s now only one possible parse tree for <code class="language-plaintext highlighter-rouge">2 + 3 * 4</code>:</p>

<p><img src="./Writing a C Compiler, Part 3_files/exp_fixed.svg" alt="Parse tree according to new grammar for 2 + (3 * 4), outline below"></p>

<div class="screen-reader-only">
  <ul>
    <li>expression: <code class="language-plaintext highlighter-rouge">&lt;exp&gt; ("+"|"-") &lt;term&gt;</code>
      <ul>
        <li>expression: <code class="language-plaintext highlighter-rouge">&lt;int&gt;</code>
          <ul>
            <li>constant: 2</li>
          </ul>
        </li>
        <li>operator: +</li>
        <li>expression: <code class="language-plaintext highlighter-rouge">&lt;term&gt; &lt;binop&gt; &lt;factor&gt;</code>
          <ul>
            <li>expression: <code class="language-plaintext highlighter-rouge">&lt;int&gt;</code>
              <ul>
                <li>constant: 2</li>
              </ul>
            </li>
            <li>operator: *</li>
            <li>expression: <code class="language-plaintext highlighter-rouge">&lt;int&gt;</code>
              <ul>
                <li>constant: 3</li>
              </ul>
            </li>
          </ul>
        </li>
      </ul>
    </li>
  </ul>
</div>

<p>That’s problem #1 solved. Problem #2 was that this grammar didn’t handle associativity.
<em>If</em> you’re not using an RD parser, you generally use left recursive production rules for left-associative operations, and right recursive rules for right-associative operations. In that case, we could rewrite the rule for <code class="language-plaintext highlighter-rouge">&lt;exp&gt;</code> like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp&gt; ::= &lt;exp&gt; ("+" | "-") &lt;term&gt; | &lt;term&gt;
</code></pre></div></div>

<p>This would force addition and subtraction to be left-associative; you can’t parse <code class="language-plaintext highlighter-rouge">1 - 2 - 3</code> as <code class="language-plaintext highlighter-rouge">1 - (2 - 3)</code> because <code class="language-plaintext highlighter-rouge">2 - 3</code> isn’t a term.</p>

<p>But we are using an RD parser, so we can’t handle this left recursive rule. To understand why this won’t work, let’s try writing a function to parse expressions according to this rule.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def parse_expression(tokens):
    //determine which of two production rules applies:
    //  * &lt;exp&gt; ("+" | "-") &lt;term&gt;
    //  * &lt;term&gt;
    if is_term(tokens): //how do we figure this out???
        return parse_term(tokens)
    else:
        //recursively call parse_expression to handle it
        e1 = parse_expression(tokens) //recurse forever ☠️
</code></pre></div></div>

<p>To figure out which production rule to use, we can’t just look at the first token or two – we need to know if there’s a <code class="language-plaintext highlighter-rouge">+</code> or <code class="language-plaintext highlighter-rouge">-</code> operation anywhere in this expression. And if we determine that this expression is a sum or difference, we’re going to call <code class="language-plaintext highlighter-rouge">parse_expression</code> recursively forever. In order to not do that, we’d need to find the last <code class="language-plaintext highlighter-rouge">&lt;term&gt;</code> at the end of the expression, parse and remove <em>that</em>, then go back and parse the rest. Both of these problems (figuring out which production rule to use, and parsing the last term at the end) require us to look ahead an arbitrary number of tokens until we hit the end of the expression. You might be able to get this approach to work – I’m not sure if there are existing parsing algorithms similar to this – but it will be complicated, and it definitely won’t be a recursive descent parser. So we’re not going to do that.</p>

<p>If, on the other hand, we just switched around <code class="language-plaintext highlighter-rouge">&lt;term&gt;</code> and <code class="language-plaintext highlighter-rouge">&lt;exp&gt;</code> to avoid left recursion, we’d have this rule:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp&gt; ::= &lt;term&gt; ("+" | "-") &lt;exp&gt; | &lt;term&gt;
</code></pre></div></div>

<p>This is easy to parse but wrong – it’s <em>right</em>-associative. Using this grammar, you would have to parse <code class="language-plaintext highlighter-rouge">1 - 2 - 3</code> as <code class="language-plaintext highlighter-rouge">1 - (2 - 3)</code>.</p>

<p>So our options seem to be an unparseable left recursive grammar, or an incorrect right recursive grammar. Luckily, there’s another solution. We’ll introduce repetition into our grammar, so we can define an expression as a term, possibly plus or minus a term, possibly plus or minus another term…and so on forever. In EBNF notation, wrapping something in curly braces (<code class="language-plaintext highlighter-rouge">{}</code>) means it can be repeated zero or more times. Here’s the <strong>final</strong> grammar we’ll be using for expressions this week:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;exp&gt; ::= &lt;term&gt; { ("+" | "-") &lt;term&gt; }
&lt;term&gt; ::= &lt;factor&gt; { ("*" | "/") &lt;factor&gt; }
&lt;factor&gt; ::= "(" &lt;exp&gt; ")" | &lt;unary_op&gt; &lt;factor&gt; | &lt;int&gt;
</code></pre></div></div>

<p>This grammar handles precedence correctly, it isn’t left recursive and it isn’t right-associative. However, it’s not really left-associative either. Before, an <code class="language-plaintext highlighter-rouge">&lt;exp&gt;</code> was a binary operation with two terms – now it has an arbitrary number of terms. If you have a bunch of operations at the same precendence level (like in <code class="language-plaintext highlighter-rouge">1 - 2 - 3</code>), this grammar doesn’t provide any way to group them into sub-expressions.</p>

<p>That’s okay, though! Our grammar doesn’t need to correspond perfectly with our AST. We can still build our AST in a left-associative way. We’ll parse the first term, and then, if there are any more terms, we process them in a loop, constructing a new BinOp node at each iteration. Here’s the pseudocode for this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def parse_expression(toks):
    term = parse_term(toks) //pops off some tokens
    next = toks.peek() //check the next token, but don't pop it off the list yet
    while next == PLUS or next == MINUS: //there's another term!
        op = convert_to_op(toks.next())
        next_term = parse_term(toks) //pops off some more tokens
        term = BinOp(op, term, next_term)
        next = toks.peek()

    return t1
</code></pre></div></div>

<p>We can use exactly the same approach in <code class="language-plaintext highlighter-rouge">parse_term</code>.</p>

<p><code class="language-plaintext highlighter-rouge">parse_factor</code> is straightforward, since it doesn’t have to handle associativity. We’ll look at the first token to identify which production rule to use; pop off constants and make sure they have the value we expect; and call other functions to handle non-terminal symbols.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def parse_factor(toks)
    next = toks.next()
    if next == OPEN_PAREN:
        //&lt;factor&gt; ::= "(" &lt;exp&gt; ")"
        exp = parse_exp(toks) //parse expression inside parens
        if toks.next() != CLOSE_PAREN: //make sure parens are balanced
            fail()
        return exp
    else if is_unop(next)
        //&lt;factor&gt; ::= &lt;unary_op&gt; &lt;factor&gt;
        op = convert_to_op(next)
        factor = parse_factor(toks)
        return UnOp(op, factor)
    else if next.type == "INT":
        //&lt;factor&gt; ::= &lt;int&gt;
        return Const(convert_to_int(next))
    else:
        fail()
</code></pre></div></div>

<p>Just for the sake of completeness, here’s this week’s complete grammar, including the new stuff above (expressions, terms, factors) and stuff that hasn’t changed since last week (functions, statements, etc.):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;program&gt; ::= &lt;function&gt;
&lt;function&gt; ::= "int" &lt;id&gt; "(" ")" "{" &lt;statement&gt; "}"
&lt;statement&gt; ::= "return" &lt;exp&gt; ";"
&lt;exp&gt; ::= &lt;term&gt; { ("+" | "-") &lt;term&gt; }
&lt;term&gt; ::= &lt;factor&gt; { ("*" | "/") &lt;factor&gt; }
&lt;factor&gt; ::= "(" &lt;exp&gt; ")" | &lt;unary_op&gt; &lt;factor&gt; | &lt;int&gt;
</code></pre></div></div>

<h4 id="-task-1">☑ Task:</h4>
<p>Update your expression-parsing code to handle addition, subtraction, multiplication, and division. It should successfully parse all valid stage 1, 2, and 3 examples in the test suite, and fail on all invalid stage 1, 2, and 3 examples. Manually inspect the AST for each example to make sure it handles associativity and operator precedence correctly. If you haven’t written a pretty printer yet, you’ll probably need to do that now.</p>

<h2 id="code-generation">Code Generation</h2>

<p>There’s a new challenge in the code generation stage this week. To handle a binary expression, like <code class="language-plaintext highlighter-rouge">e1 + e2</code>, our generated assembly needs to:</p>

<ul>
  <li>Calculate <code class="language-plaintext highlighter-rouge">e1</code> and save it somewhere.</li>
  <li>Calculate <code class="language-plaintext highlighter-rouge">e2</code>.</li>
  <li>Add <code class="language-plaintext highlighter-rouge">e1</code> to <code class="language-plaintext highlighter-rouge">e2</code>, and store the result in EAX.</li>
</ul>

<p>So, we need somewhere to save the first operand. Saving it in a register would be complicated; the second operand can itself contain subexpressions, so it might also need to save intermediate results in a register, potentially overwriting <code class="language-plaintext highlighter-rouge">e1</code><sup id="anchor5"><a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#fn5">5</a></sup>. Instead, we’ll save the first operand on the stack.</p>

<p>Let’s talk about the stack briefly. Every process on a computer has some memory. This memory is divided into several segments, one of which is the <strong>call stack</strong>, or just the stack. The address of the top of the stack is stored in the ESP register, aka the stack pointer. Like with most stacks, you can push things onto the top, or pop things off the top; x86 includes <code class="language-plaintext highlighter-rouge">push</code> and <code class="language-plaintext highlighter-rouge">pop</code> instructions to do just that. One confusing thing about the stack is that it grows towards <em>lower</em> memory addresses – when you push something onto the stack, you <em>decrement</em> ESP. The processor relies on ESP to figure out where the top of the stack is. So, <code class="language-plaintext highlighter-rouge">pushl val</code> does the following<sup id="anchor6"><a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#fn6">6</a></sup>:</p>

<ul>
  <li>Writes <code class="language-plaintext highlighter-rouge">val</code> to the next empty spot on the stack (i.e. ESP - 4)</li>
  <li>Decrements ESP by 4, so it contains the memory address of <code class="language-plaintext highlighter-rouge">val</code>.</li>
</ul>

<p><img src="./Writing a C Compiler, Part 3_files/push_val.svg" alt="On the left, a diagram of the stack value a is on top at memory address 0xbffff980. Other values below a have higher memory addresses. The ESP register holds value 0xbffff980, the address of a. On the right, a diagram representing the stack after &quot;val&quot; is pushed onto it. Now &quot;val&quot; is on top of the stack at address 0xbffff97c, and ESP holds value 0xbffff97c. &quot;a&quot; is just below &quot;val&quot;, at the same address as before."></p>

<p>Along the same lines, <code class="language-plaintext highlighter-rouge">popl dest</code> does the following:</p>

<ul>
  <li>Reads value from the top of the stack (i.e. the value at the memory address in ESP).</li>
  <li>Copies that value into <code class="language-plaintext highlighter-rouge">dest</code>, which is a register or other memory location</li>
  <li>Increments ESP by 4, so it points to the value just below <code class="language-plaintext highlighter-rouge">val</code>.</li>
</ul>

<p>But right now, all you really need to know is that there’s a stack, <code class="language-plaintext highlighter-rouge">push</code> puts things on it and <code class="language-plaintext highlighter-rouge">pop</code> takes things off it.
So, here’s our assembly for <code class="language-plaintext highlighter-rouge">e1 + e2</code>:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="err">&lt;</span><span class="nf">CODE</span> <span class="nv">FOR</span> <span class="nv">e1</span> <span class="nv">GOES</span> <span class="nv">HERE</span><span class="o">&gt;</span>
    <span class="nf">push</span> <span class="o">%</span><span class="nb">eax</span> <span class="c1">; save value of e1 on the stack</span>
    <span class="err">&lt;</span><span class="nf">CODE</span> <span class="nv">FOR</span> <span class="nv">e2</span> <span class="nv">GOES</span> <span class="nv">HERE</span><span class="o">&gt;</span>
    <span class="nf">pop</span> <span class="o">%</span><span class="nb">ecx</span> <span class="c1">; pop e1 from the stack into ecx</span>
    <span class="nf">addl</span> <span class="o">%</span><span class="nb">ecx</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span> <span class="c1">; add e1 to e2, save results in eax</span>
</code></pre></div></div>

<p>You can handle <code class="language-plaintext highlighter-rouge">e1 * e2</code> exactly the same way, using <code class="language-plaintext highlighter-rouge">imul</code> instead of <code class="language-plaintext highlighter-rouge">addl</code>. Subtraction is a bit more complicated because order of operands matters; <code class="language-plaintext highlighter-rouge">subl src, dst</code> computes <code class="language-plaintext highlighter-rouge">dst - src</code>, and saves the result in <code class="language-plaintext highlighter-rouge">dst</code>. You’ll need to make sure <code class="language-plaintext highlighter-rouge">e1</code> is in <code class="language-plaintext highlighter-rouge">dst</code> and <code class="language-plaintext highlighter-rouge">e2</code> is in <code class="language-plaintext highlighter-rouge">src</code> – and, of course, that the result ends up in EAX. Division is even trickier: <code class="language-plaintext highlighter-rouge">idivl dst</code> treats EDX and EAX as a single, 64-bit register and calculates [EDX:EAX] / <code class="language-plaintext highlighter-rouge">dst</code>. It then stores the quotient in EAX and the remainder in EDX. To make it work, you’ll need to first move <code class="language-plaintext highlighter-rouge">e1</code> into EAX, and then sign-extend it into EDX using the <code class="language-plaintext highlighter-rouge">cdq</code> instruction before issuing the <code class="language-plaintext highlighter-rouge">idivl</code> instruction<sup id="anchor7"><a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#fn7">7</a></sup>. You can check the <a href="https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf">Intel Software Developer’s Manual</a> for more details on any of these instructions.</p>

<h4 id="-task-2">☑ Task:</h4>
<p>Update your code-generation pass to emit correct code for addition, subtraction, division, and multiplication. It should succeed on all valid examples and fail on all invalid examples for stages 1, 2, and 3.</p>

<h2 id="using-gdb">Using GDB</h2>

<p>If you run into trouble during code generation, you may want to step through it with GDB or LLDB. I’m not going to cover how to use GDB here (I list some tutorials under “Further Reading”), but here are a few tips specifically for stepping through assembly without a symbol table:</p>

<ul>
  <li>You can use <code class="language-plaintext highlighter-rouge">nexti</code> and <code class="language-plaintext highlighter-rouge">stepi</code> to step through one assembly instruction at a time.</li>
  <li>In GDB, but not LLDB, <code class="language-plaintext highlighter-rouge">layout asm</code> displays the assembly as you step through it, and <code class="language-plaintext highlighter-rouge">layout regs</code> displays the registers.</li>
  <li>You can set breakpoints at functions (e.g. <code class="language-plaintext highlighter-rouge">b main</code>) even when your binary doesn’t have debug symbols.</li>
</ul>

<p>Also, running GDB on OS X is a pain in the butt. You can find instructions about how to get it working <a href="https://sourceware.org/gdb/wiki/BuildingOnDarwin">here</a>, or you can use LLDB instead. I kind of hate LLDB but maybe I’m just not used to it.</p>

<h2 id="up-next">Up Next</h2>

<p>Next week, I’ll be on vacation. <a href="https://norasandler.com/2017/12/28/Write-a-Compiler-4.html">The week after that</a>, we’ll add more binary operators to support comparisons and boolean logic. See you then!</p>

<p><em>If you have any questions, corrections, or other feedback, you can <a href="mailto:nora@norasandler.com">email me</a> or <a href="https://github.com/nlsandler/write_a_c_compiler/issues">open an issue</a>.</em></p>

<h2 id="further-reading">Further Reading</h2>

<ul>
  <li><a href="https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf">Intel® 64 and IA-32 Architectures Software Developer’s Manual</a> – the definitive reference on all x86 instructions, which will be useful for the rest of the series. Also a good way to impress people at parties when they ask what you’re reading.</li>
  <li><a href="https://eli.thegreenplace.net/2009/03/14/some-problems-of-recursive-descent-parsers">Some problems of recursive descent parsers</a> – a nice explanation of how RD parsers handle associativity. Several other articles on the same blog are also worth a read:
    <ul>
      <li><a href="https://eli.thegreenplace.net/2008/09/26/recursive-descent-ll-and-predictive-parsers">Recursive descent, LL and predictive parsers</a> – an overview of RD parsing.</li>
      <li><a href="https://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing">Top-Down operator precedence parsing</a> – an intro to Pratt parsers.</li>
      <li><a href="https://eli.thegreenplace.net/2012/08/02/parsing-expressions-by-precedence-climbing">Parsing expressions by precedence climbing</a> – an intro to precedence climbing.</li>
    </ul>
  </li>
  <li><a href="https://courses.cs.washington.edu/courses/cse378/97au/help/gdb-intro.html">An Introduction to GDB</a> – a helpful reference if you haven’t used GDB before.</li>
</ul>

<div class="footnote">
  <p><sup id="fn1">1</sup>
Not all operations in C are left-associative, but all of this week’s binary operations are.<a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#anchor1">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn2">2</sup>
Other types of parsers, like <a href="https://en.wikipedia.org/wiki/LALR_parser">LALR parsers</a>, can handle left recursive grammars just fine.<a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#anchor2">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn3">3</sup>
Alternatively, you can handle operator precedence and associativity is <a href="https://eli.thegreenplace.net/2012/08/02/parsing-expressions-by-precedence-climbing">precedence climbing</a> or <a href="http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/">Pratt parsing</a> – they’re <a href="https://www.oilshell.org/blog/2016/11/01.html">basically the same thing</a>.
Precedence climbing is pretty easy to understand, and it’s a nice option because it requires less boilerplate code and makes it easier to add more operators later on. I didn’t use it in this blog post because I wanted to present the simplest, most vanilla possible solution, but I will probably use it the next time I write a recursive descent parser.
If you decide to use precedence parsing in your own compiler, it shouldn’t impact your ability to follow along with the rest of this series.<a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#anchor3">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn4">4</sup>
Don’t let the notation trip you up here. <code class="language-plaintext highlighter-rouge">()</code> indicates a grouping of tokens, so <code class="language-plaintext highlighter-rouge">("*" | "/")</code> means “either a plus token or a minus token.” <code class="language-plaintext highlighter-rouge">"("</code> or <code class="language-plaintext highlighter-rouge">")"</code>, in quotes, means a literal parenthesis token.<a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#anchor4">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn5">5</sup>
Real compilers usually do save intermediate values in registers, because it’s much faster than saving them on the stack. We won’t do that because allocating registers is complicated and saving them onto the stack is simple.<a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#anchor5">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn6">6</sup>
This all assumes you’re working with 4-byte operands. On 64-bit architectures this instruction typically pushes 8-byte operands, by writing the operand to RSP - 8 and then decrementing RSP by 8. (RSP is the 64-bit equivalent to ESP). <a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#anchor6">↩</a></p>
</div>

<div class="footnote">
  <p><sup id="fn7">7</sup>
An earlier version of this post incorrectly said that you should zero out EDX before the <code class="language-plaintext highlighter-rouge">idivl</code> instruction. Thanks to Borna Cafuk for pointing out the error! <a href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html#anchor7">↩</a></p>
</div>

  </div><a class="u-url" href="https://norasandler.com/2017/12/15/Write-a-Compiler-3.html" hidden=""></a>
</article>

      </div>
    </main>

    <footer class="site-footer">

  <div class="wrapper">
      <div class="footer-col-wrapper">
        <div class="footer-col footer-col-1">
            <div class="rc-scout" data-scout-rendered="true"><p class="rc-scout__text"><i class="rc-scout__logo"></i> Want to become a better programmer? <a class="rc-scout__link" href="https://www.recurse.com/scout/click?t=8f520efbc4be09fb83a71920f53a07b7">Join the Recurse Center!</a></p></div><script async="" defer="" src="./Writing a C Compiler, Part 3_files/loader.js"></script>
        </div>
      </div>
    <div class="footer-col-wrapper">
      <div class="footer-col footer-col-1">
        © 2023 Nora Sandler.
      </div>
    </div>
  </div>

</footer>


<script async="" src="./Writing a C Compiler, Part 3_files/scout-176be16681a03cbddd686f3cc96694d3aab338ea5cb65452f83d989309810528.js"></script><style class="rc-scout__style" type="text/css">.rc-scout {
  display: block;
  padding: 0;
  border: 0;
  margin: 0;
}
.rc-scout__text {
  display: block;
  padding: 0;
  border: 0;
  margin: 0;
  height: 100%;
  font-size: 100%;
}
.rc-scout__logo {
  display: inline-block;
  padding: 0;
  border: 0;
  margin: 0;
  width: 0.85em;
  height: 0.85em;
  background: no-repeat center url('data:image/svg+xml;utf8,%3Csvg%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20viewBox%3D%220%200%2012%2015%22%3E%3Crect%20x%3D%220%22%20y%3D%220%22%20width%3D%2212%22%20height%3D%2210%22%20fill%3D%22%23000%22%3E%3C%2Frect%3E%3Crect%20x%3D%221%22%20y%3D%221%22%20width%3D%2210%22%20height%3D%228%22%20fill%3D%22%23fff%22%3E%3C%2Frect%3E%3Crect%20x%3D%222%22%20y%3D%222%22%20width%3D%228%22%20height%3D%226%22%20fill%3D%22%23000%22%3E%3C%2Frect%3E%3Crect%20x%3D%222%22%20y%3D%223%22%20width%3D%221%22%20height%3D%221%22%20fill%3D%22%233dc06c%22%3E%3C%2Frect%3E%3Crect%20x%3D%224%22%20y%3D%223%22%20width%3D%221%22%20height%3D%221%22%20fill%3D%22%233dc06c%22%3E%3C%2Frect%3E%3Crect%20x%3D%226%22%20y%3D%223%22%20width%3D%221%22%20height%3D%221%22%20fill%3D%22%233dc06c%22%3E%3C%2Frect%3E%3Crect%20x%3D%223%22%20y%3D%225%22%20width%3D%222%22%20height%3D%221%22%20fill%3D%22%233dc06c%22%3E%3C%2Frect%3E%3Crect%20x%3D%226%22%20y%3D%225%22%20width%3D%222%22%20height%3D%221%22%20fill%3D%22%233dc06c%22%3E%3C%2Frect%3E%3Crect%20x%3D%224%22%20y%3D%229%22%20width%3D%224%22%20height%3D%223%22%20fill%3D%22%23000%22%3E%3C%2Frect%3E%3Crect%20x%3D%221%22%20y%3D%2211%22%20width%3D%2210%22%20height%3D%224%22%20fill%3D%22%23000%22%3E%3C%2Frect%3E%3Crect%20x%3D%220%22%20y%3D%2212%22%20width%3D%2212%22%20height%3D%223%22%20fill%3D%22%23000%22%3E%3C%2Frect%3E%3Crect%20x%3D%222%22%20y%3D%2213%22%20width%3D%221%22%20height%3D%221%22%20fill%3D%22%23fff%22%3E%3C%2Frect%3E%3Crect%20x%3D%223%22%20y%3D%2212%22%20width%3D%221%22%20height%3D%221%22%20fill%3D%22%23fff%22%3E%3C%2Frect%3E%3Crect%20x%3D%224%22%20y%3D%2213%22%20width%3D%221%22%20height%3D%221%22%20fill%3D%22%23fff%22%3E%3C%2Frect%3E%3Crect%20x%3D%225%22%20y%3D%2212%22%20width%3D%221%22%20height%3D%221%22%20fill%3D%22%23fff%22%3E%3C%2Frect%3E%3Crect%20x%3D%226%22%20y%3D%2213%22%20width%3D%221%22%20height%3D%221%22%20fill%3D%22%23fff%22%3E%3C%2Frect%3E%3Crect%20x%3D%227%22%20y%3D%2212%22%20width%3D%221%22%20height%3D%221%22%20fill%3D%22%23fff%22%3E%3C%2Frect%3E%3Crect%20x%3D%228%22%20y%3D%2213%22%20width%3D%221%22%20height%3D%221%22%20fill%3D%22%23fff%22%3E%3C%2Frect%3E%3Crect%20x%3D%229%22%20y%3D%2212%22%20width%3D%221%22%20height%3D%221%22%20fill%3D%22%23fff%22%3E%3C%2Frect%3E%3C%2Fsvg%3E');
}
.rc-scout__link:link, .rc-scout__link:visited {
  color: #3dc06c;
  text-decoration: underline;
}
.rc-scout__link:hover, .rc-scout__link:active {
  color: #4e8b1d;
}
</style></body></html>