miniany/doc/zserge.com_blog_cucu-part1.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412

<!DOCTYPE html>
<html lang="en-US">
	<head><script type="text/javascript" src="https://web-static.archive.org/_static/js/bundle-playback.js?v=t1Bf4PY_" charset="utf-8"></script>
<script type="text/javascript" src="https://web-static.archive.org/_static/js/wombat.js?v=txqj7nKC" charset="utf-8"></script>
<script>window.RufflePlayer=window.RufflePlayer||{};window.RufflePlayer.config={"autoplay":"on","unmuteOverlay":"hidden"};</script>
<script type="text/javascript" src="https://web-static.archive.org/_static/js/ruffle.js"></script>
<script type="text/javascript">
  __wm.init("http://web.archive.org/web");
  __wm.wombat("http://zserge.com:80/blog/cucu-part1.html","20160518154155","http://web.archive.org/","web","https://web-static.archive.org/_static/",
	      "1463586115");
</script>
<link rel="stylesheet" type="text/css" href="https://web-static.archive.org/_static/css/banner-styles.css?v=S1zqJCYt" />
<link rel="stylesheet" type="text/css" href="https://web-static.archive.org/_static/css/iconochive.css?v=qtvMKcIJ" />
<!-- End Wayback Rewrite JS Include -->

		<meta charset="UTF-8"/>
		<title>cucu: a compiler you can understand (1/3)</title>
		<meta content="Compilers is fun. Want to write your own one?" name="description"/>
		<meta content="Serge Zaitsev" name="author"/>
		<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
		<meta content="width=device-width" name="viewport"/>
		<link href="http://web.archive.org/web/20160518154155/http://zserge.com/rss.xml" rel="alternate" title="RSS" type="application/rss+xml"/>
		<link href="blog/cucu-part1.html" rel="canonical"/>		<!-- OpenGraph data -->
		<meta content="cucu: a compiler you can understand (1/3)" property="og:title"/>
		<meta content="article" property="og:type"/>
		<meta content="blog/cucu-part1.html" property="og:url"/>
		<meta content="http://web.archive.org/web/20160518154155im_/http://zserge.com/logo.png" property="og:image"/>
		<meta content="Compilers is fun. Want to write your own one?" property="og:description"/>
		<meta content="en_US" property="og:locale"/>		<!-- Twitter card data -->
		<meta content="summary" name="twitter:card"/>
		<meta content="@zsergo" name="twitter:site"/>		<!-- Fonts -->
		<link href="//web.archive.org/web/20160518154155cs_/http://fonts.googleapis.com/css?family=Merriweather:900,900italic,300,300italic" rel="stylesheet" type="text/css"/>
		<link href="//web.archive.org/web/20160518154155cs_/http://fonts.googleapis.com/css?family=Roboto+Mono:100,200,300,400,400italic,700,700italic&amp;subset=latin,latin-ext" rel="stylesheet" type="text/css"/>		<!-- Styles -->
		<link href="/web/20160518154155cs_/http://zserge.com/styles.css" rel="stylesheet" type="text/css"/>		<!-- Favicons -->
		<link href="/web/20160518154155im_/http://zserge.com/favicon.ico" rel="shortcut icon"/>
		<link href="/web/20160518154155im_/http://zserge.com/favicon.128.png" rel="apple-touch-icon-precomposed"/>
	</head>
	<body>
		<header>
			<nav>
				<a class="logo" href="/web/20160518154155/http://zserge.com/">Z</a>
			</nav>
			<div class="empty"></div>
			<nav>
				<section>
					<a href="/web/20160518154155/http://zserge.com/about.html">about</a>
					<a href="/web/20160518154155/http://zserge.com/blog.html">posts</a>
				</section>
				<section>
					<a href="http://web.archive.org/web/20160518154155/https://twitter.com/zsergo">@me</a>
					<a href="http://web.archive.org/web/20160518154155/https://plus.google.com/u/0/+SergeZaitsev">+me</a>
					<a href="http://web.archive.org/web/20160518154155/https://github.com/zserge">&lt;/&gt;me</a>
				</section>
			</nav>
		</header>
		<h1>cucu: a compiler you can understand (part&nbsp;1)</h1>

<p>Let talk about the compilers. Have you ever thought of writing your own one?</p>

<p>I will try to show you how simple it is. The first part will be pretty much
theoretical, so keep patience.</p>

<h2>what we&rsquo;re going to achieve?</h2>

<p>CUCU is a toy compiler for a toy language. I want it to be as close to ANSI C
as possible, so that every valid CUCU program could be compiled with a C
compiler without any errors. Of course, the support of the whole ANSI C
standard is very difficult, so I picked a very small C language subset.</p>

<p>For example, here&rsquo;s a valid CUCU code snippet:</p>

<pre><code>int cucu_strlen(char *s) {
    int i = 0;
    while (s[i]) {
        i = i + 1;
    }
    return i;
}
</code></pre>

<h2>grammar</h2>

<p>We&rsquo;re about to define a grammar for our language. It&rsquo;s an important step,
because everything in our compiler design depends on it.</p>

<p>So, let&rsquo;s go from top to bottom. Our source file contains a <strong>program</strong>.
What is a program? We know it&rsquo;s a list of <strong>variable declarations</strong>, <strong>function
declarations</strong> and <strong>function definitions</strong>, e.g:</p>

<pre><code>int func(char *s, int len); /* function declaration */
int i;                      /* variable declaration */

int func(char *s, int len) { /* function definition */
    ...
}
</code></pre>

<p>Let&rsquo;s try to write it down in <a href="http://web.archive.org/web/20160518154155/http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form">EBNF</a> form (it&rsquo;s absolutely ok, if you don&rsquo;t
know what EBNF is, it&rsquo;s really intuitive):</p>

<pre><code>&lt;program&gt; ::= { &lt;var-decl&gt; | &lt;func-decl&gt; | &lt;func-def&gt; } ;
</code></pre>

<p>This notation says: &ldquo;a program is a repeating sequence of variable declarations,
function declarations and function definitions. But what is all those
declarations and definitions? Ok, let&rsquo;s go deeper:</p>

<pre><code>&lt;var-decl&gt; ::= &lt;type&gt; &lt;ident&gt; &quot;;&quot;
&lt;func-decl&gt; ::= &lt;type&gt; &lt;ident&gt; &quot;(&quot; &lt;func-args&gt; &quot;)&quot; &quot;;&quot;
&lt;func-def&gt; ::= &lt;type&gt; &lt;ident&gt; &quot;(&quot; &lt;func-args&gt; &quot;)&quot; &lt;func-body&gt;
&lt;func-args&gt; ::= { &lt;type&gt; &lt;ident&gt; &quot;,&quot; }
&lt;type&gt; ::= &quot;int&quot; | &quot;char *&quot;
</code></pre>

<p>So, variable declaration is simple: it&rsquo;s a type name followed by the identifier,
and followed by a semicolon, like we usually do in C, e.g.:</p>

<pre><code>int i;
char *s;
</code></pre>

<p>Function declaration is a bit more complicated. It&rsquo;s a &ldquo;type + identifier&rdquo;,
and an optional list of function arguments <code>&lt;func-args&gt;</code> inside the braces.</p>

<p>Function arguments list, in it&rsquo;s turn, is a sequence of comma-separated
&ldquo;type + identifier&rdquo;, like:</p>

<pre><code>char *s, int from, int to
</code></pre>

<p><em>Actually, trailing comma in CUCU is still allowed, but not required. It will
just simplify our parser code.</em></p>

<p>The supported data types are only <code>int</code> and <code>char *</code>. Identifier is a sequence
of letters, digits and an underscore symbol.</p>

<p>The only thing left is <code>&lt;func-body&gt;</code>. But let&rsquo;s talk about <strong>statements</strong> and
<strong>expressions</strong> first.</p>

<p><em>Statement</em> is a smallest standalone element of the language. Here are valid
statements in out language:</p>

<pre><code>/* These are simple statements */
i = 2 + 3; /* assignment statement */
my_func(i); /* function call statement */
return i; /* return statement */

/* These are compound statements */
if (x &gt; 0) { .. } else { .. }
while (x &gt; 0) { .. }
</code></pre>

<p><em>Expression</em> is a smaller part of the statement. As opposed to statements,
expressions always return a value.  Usually, it&rsquo;s just the arithmetic. For
example in the statement <code>func(x[2], i + j)</code> the expressions are <code>x[2]</code> and
<code>i+j</code>.</p>

<p>So, looking back at <code>&lt;func-body&gt;</code>. It&rsquo;s just a valid statement (which is
usually a block statement).</p>

<p>Here&rsquo;s what we have:</p>

<pre><code>&lt;func-body&gt; ::= &lt;statement&gt;
&lt;statement&gt; ::= &quot;{&quot; { &lt;statement&gt; } &quot;}&quot;                /* block statement */
                | [&lt;type&gt;] &lt;ident&gt; [ &quot;=&quot; &lt;expr&gt; ] &quot;;&quot;  /* assignment */
                | &quot;return&quot; &lt;expr&gt; &quot;;&quot;
                | &quot;if&quot; &quot;(&quot; &lt;expr&gt; &quot;)&quot; &lt;statement&gt; [ &quot;else&quot; &lt;statement&gt; ]
                | &quot;while&quot; &quot;(&quot; &lt;expr&gt; &quot;)&quot; &lt;statement&gt;
                | &lt;expr&gt; &quot;;&quot;
</code></pre>

<p>Here are possible expressions in CUCU language:</p>

<pre><code>&lt;expr&gt; ::= &lt;bitwise-expr&gt; 
           | &lt;bitwise-expr&gt; = &lt;expr&gt;
&lt;bitwise-expr&gt; ::= &lt;eq-expr&gt;
                   | &lt;bitwise-expr&gt; &amp; &lt;eq-expr&gt;
                   | &lt;bitwise-expr&gt; | &lt;eq-expr&gt;
&lt;eq-expr&gt; ::= &lt;rel-expr&gt;
              | &lt;eq-expr&gt; == &lt;rel-expr&gt;
              | &lt;eq-expr&gt; != &lt;rel-expr&gt;
&lt;rel-expr&gt; ::= &lt;shift-expr&gt;
               | &lt;rel-expr&gt; &lt; &lt;shift-expr&gt;
&lt;shift-expr&gt; ::= &lt;add-expr&gt;
                 | &lt;shift-expr&gt; &lt;&lt; &lt;add-expr&gt;
                 | &lt;shift-expr&gt; &gt;&gt; &lt;add-expr&gt;
&lt;add-expr&gt; ::= &lt;postfix-expr&gt;
               | &lt;add-expr&gt; + &lt;postfix-expr&gt;
               | &lt;add-expr&gt; - &lt;postfix-expr&gt;
&lt;postfix-expr&gt; ::= &lt;prim-expr&gt;
                   | &lt;postfix-expr&gt; [ &lt;expr&gt; ]
                   | &lt;postfix-expr&gt; ( &lt;expr&gt; { &quot;,&quot; &lt;expr&gt; } )
&lt;prim-expr&gt; := &lt;number&gt; | &lt;ident&gt; | &lt;string&gt; | &quot;(&quot; &lt;expr&gt; &quot;)&quot;
</code></pre>

<p>That&rsquo;s it. Did you notice the recursion in the expression notation?  Basically,
the notation above shows us the precedence of the operators from bottom to top:
parens and square brackets are evaluated first, and assignment goes the last.</p>

<p>For example, according to this grammar an expression <code>8&gt;&gt;1+1</code>
will be evaluated to 2 (like in <code>8&gt;&gt;(1+1)</code>), not to 5 (like in <code>(8&gt;&gt;1)+1</code>),
because <code>&gt;&gt;</code> has lower precedence than <code>+</code>.</p>

<h2>lexer</h2>

<p>Now we are done with grammar, and are ready to start. The first thing to do is
a lexer. Our compiler takes a stream of bytes as an input, and lexer allows to
split that stream into smaller tokens, that can be processed later. It gives us
some level of abstraction and simplifies out parser.</p>

<p>For example, a sequence of bytes &ldquo;int i = 2+31;&rdquo; will be split into tokens:</p>

<pre><code>int
i
=
2
+
31
;
</code></pre>

<p><em>In normal grown-up lexers every lexeme (token) has a type and a value, so
instead of the list above we will get a list of pairs: <code>&lt;TYPE:int&gt;,&lt;ID:i&gt;,
&lt;ASSIGN:=&gt;,&lt;NUM:2&gt;,&lt;PLUS:+&gt;,&lt;NUM:31&gt;,&lt;SEMI:;&gt;</code>. We are going to detect lexeme
type by its value, which is not academical at all!</em></p>

<p>The major problem with the lexer is that once a byte is read from the stream -
it can not be &ldquo;un-read&rdquo;. And what if we&rsquo;ve read a byte that can not be added to
the current token? Where should we store it until the current token is
processed?</p>

<p>Almost every lexer need to read ahead. Our grammar is simple enough, so all we
need to have is a single byte buffer - <code>nextc</code>. It stores a byte, that was read
from the stream, but has not yet been pushed to the token string.</p>

<p>Also, I need to warn you - I use global variables a lot in CUCU code. I know
it&rsquo;s a bad style, but if I pass all values as function arguments the compiler
would loose it&rsquo;s simplicity.</p>

<p>The whole lexer is just a single function <code>readtok()</code>. The algorithm is simple:</p>

<ul>
<li>skip leading spaces</li>
<li>try to read an identifier (a sequence of letters, digits and <code>_</code>)</li>
<li>if it&rsquo;s not an identifier - try to read a sequence of special operators, like
<code>&amp;, |, &lt;, &gt;, =, !</code>.</li>
<li>if it&rsquo;s not an operator - try to read a string literal &ldquo;&hellip;.&rdquo; or &lsquo;&hellip;.&rsquo;</li>
<li>if failed - maybe it&rsquo;s a comment, like <code>/* ... */</code>?</li>
<li>if failed again - just read a single byte. It might be another single-byte
token, like &ldquo;(&rdquo; or &ldquo;[&ldquo;.</li>
</ul>

<p>Here&rsquo;s the whole CUCU sources we&rsquo;ve got so far:</p>

<pre><code>#include &lt;stdio.h&gt; /* for vpritnf */
#include &lt;stdarg.h&gt; /* for va_list */
#include &lt;stdlib.h&gt; /* for exit() */
#include &lt;ctype.h&gt; /* for isspace, isalpha... */

#define MAXTOKSZ 256
static FILE *f; /* input file */
static char tok[MAXTOKSZ];
static int tokpos;
static int nextc;

void readchr() {
    if (tokpos == MAXTOKSZ - 1) {
        tok[tokpos] = '\0';
        fprintf(stderr, &quot;token too long: %s\n&quot;, tok);
        exit(EXIT_FAILURE);
    }
    tok[tokpos++] = nextc;
    nextc = fgetc(f);
}

void readtok() {
    for (;;) {
        while (isspace(nextc)) {
            nextc = fgetc(f);
        }
        tokpos = 0;
        while(isalnum(nextc) || nextc == '_') {
            readchr();
        }
        if (tokpos == 0) {
            while (nextc == '&lt;' || nextc == '=' || nextc == '&gt;'
                    || nextc == '!' || nextc == '&amp;' || nextc == '|') {
                readchr();
            }
        }
        if (tokpos == 0) {
            if (nextc == '\'' || nextc == '&quot;') {
                char c = nextc;
                readchr();
                while (nextc != c) {
                    readchr();
                }
                readchr();
            } else if (nextc == '/') {
                readchr();
                if (nextc == '*') {
                    nextc = fgetc(f);
                    while (nextc != '/') {
                        while (nextc != '*') {
                            nextc = fgetc(f);
                        }
                        nextc = fgetc(f);
                    }
                    nextc = fgetc(f);
                }
            } else if (nextc != EOF) {
                readchr();
            }
        }
        break;
    }
    tok[tokpos] = '\0';
}

int main() {
    f = stdin;
    nextc = fgetc(f);

    for (;;) {
        readtok();
        printf(&quot;TOKEN: %s\n&quot;, tok);
        if (tok[0] == '\0') break;
    }
    return 0;
}
</code></pre>

<p>If we put a C file as the compiler input - it will print us a list of tokens,
one per line.</p>

<p>Ok, have a cup of coffee, and let&rsquo;s <a href="/web/20160518154155/http://zserge.com/blog/cucu-part2.html">go further &rarr;</a></p>

<p><em>You can check <a href="/web/20160518154155/http://zserge.com/blog/cucu-part3.html">part3</a> to know how the story ended.</em></p>

		<p>Posted on 2012-10-23</p>
		<section class="social">
			<section>
				<a href="http://web.archive.org/web/20160518154155/http://www.facebook.com/share.php?u=http://zserge.com/blog/cucu-part1.html&amp;title=cucu:%20a%20compiler%20you%20can%20understand%20%281/3%29" style="background-color: #3b5997">like</a>
				&nbsp;
				<a href="http://web.archive.org/web/20160518154155/http://twitter.com/intent/tweet?status=cucu:%20a%20compiler%20you%20can%20understand%20%281/3%29+http://zserge.com/blog/cucu-part1.html%20via%20@zsergo" style="background-color: #41b7d8">tweet</a>
				&nbsp;
				<a href="http://web.archive.org/web/20160518154155/https://plus.google.com/share?url=http://zserge.com/blog/cucu-part1.html" style="background-color: #d64937">+1</a>
			</section>
			<section>
				<a href="/web/20160518154155/http://zserge.com/rss.xml" style="background-color: #f26522">rss</a>
				&nbsp;
				<a href="http://web.archive.org/web/20160518154155/https://twitter.com/zsergo" style="background-color: #41b7d8">@me</a>
				&nbsp;
				<a href="http://web.archive.org/web/20160518154155/https://plus.google.com/u/0/+SergeZaitsev" style="background-color: #d64937">+me</a>
				&nbsp;
				<a href="http://web.archive.org/web/20160518154155/https://github.com/zserge" style="background-color: #333333">&lt;/&gt;me</a>
			</section>
		</section>
		<section><div id="disqus_thread"></div><script type="text/javascript">
var disqus_shortname='zserge';
	(function() {
		var dsq = document.createElement('script');
		dsq.type = 'text/javascript';
		dsq.async = true;
		dsq.src = 'http://' + disqus_shortname + '.disqus.com/embed.js';
		(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
		})();
	</script>
<noscript>
</noscript>

</section>
		<footer>
			<p>
				&copy;2012&ndash;2015 &middot;
				<a href="http://web.archive.org/web/20160518154155/http://zserge.com/">Serge Zaitsev</a>
				&middot; 
				<a href="http://web.archive.org/web/20160518154155/mailto:zaitsev.serge@gmail.com">zaitsev.serge@gmail.com</a>
			</p>
		</footer>
		<script>
	(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
	m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
	})(window,document,'script','//web.archive.org/web/20160518154155/http://www.google-analytics.com/analytics.js','ga');
	ga('create', 'UA-33644825-1', 'zserge.com');
	ga('send', 'pageview');
</script>


	</body>
</html>
<!--
     FILE ARCHIVED ON 15:41:55 May 18, 2016 AND RETRIEVED FROM THE
     INTERNET ARCHIVE ON 17:21:38 Jan 12, 2024.
     JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE.

     ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C.
     SECTION 108(a)(3)).
-->
<!--
playback timings (ms):
  captures_list: 442.642
  exclusion.robots: 0.127
  exclusion.robots.policy: 0.117
  cdx.remote: 0.059
  esindex: 0.009
  LoadShardBlock: 405.453 (3)
  PetaboxLoader3.datanode: 111.094 (4)
  load_resource: 277.069
  PetaboxLoader3.resolve: 202.183
-->