diff options
Diffstat (limited to 'miniany/doc/zserge.com_blog_cucu-part1.html')
-rw-r--r-- | miniany/doc/zserge.com_blog_cucu-part1.html | 412 |
1 files changed, 412 insertions, 0 deletions
diff --git a/miniany/doc/zserge.com_blog_cucu-part1.html b/miniany/doc/zserge.com_blog_cucu-part1.html new file mode 100644 index 0000000..933db49 --- /dev/null +++ b/miniany/doc/zserge.com_blog_cucu-part1.html @@ -0,0 +1,412 @@ +<!DOCTYPE html> +<html lang="en-US"> + <head><script type="text/javascript" src="https://web-static.archive.org/_static/js/bundle-playback.js?v=t1Bf4PY_" charset="utf-8"></script> +<script type="text/javascript" src="https://web-static.archive.org/_static/js/wombat.js?v=txqj7nKC" charset="utf-8"></script> +<script>window.RufflePlayer=window.RufflePlayer||{};window.RufflePlayer.config={"autoplay":"on","unmuteOverlay":"hidden"};</script> +<script type="text/javascript" src="https://web-static.archive.org/_static/js/ruffle.js"></script> +<script type="text/javascript"> + __wm.init("http://web.archive.org/web"); + __wm.wombat("http://zserge.com:80/blog/cucu-part1.html","20160518154155","http://web.archive.org/","web","https://web-static.archive.org/_static/", + "1463586115"); +</script> +<link rel="stylesheet" type="text/css" href="https://web-static.archive.org/_static/css/banner-styles.css?v=S1zqJCYt" /> +<link rel="stylesheet" type="text/css" href="https://web-static.archive.org/_static/css/iconochive.css?v=qtvMKcIJ" /> +<!-- End Wayback Rewrite JS Include --> + + <meta charset="UTF-8"/> + <title>cucu: a compiler you can understand (1/3)</title> + <meta content="Compilers is fun. Want to write your own one?" name="description"/> + <meta content="Serge Zaitsev" name="author"/> + <meta content="IE=edge" http-equiv="X-UA-Compatible"/> + <meta content="width=device-width" name="viewport"/> + <link href="http://web.archive.org/web/20160518154155/http://zserge.com/rss.xml" rel="alternate" title="RSS" type="application/rss+xml"/> + <link href="blog/cucu-part1.html" rel="canonical"/> <!-- OpenGraph data --> + <meta content="cucu: a compiler you can understand (1/3)" property="og:title"/> + <meta content="article" property="og:type"/> + <meta content="blog/cucu-part1.html" property="og:url"/> + <meta content="http://web.archive.org/web/20160518154155im_/http://zserge.com/logo.png" property="og:image"/> + <meta content="Compilers is fun. Want to write your own one?" property="og:description"/> + <meta content="en_US" property="og:locale"/> <!-- Twitter card data --> + <meta content="summary" name="twitter:card"/> + <meta content="@zsergo" name="twitter:site"/> <!-- Fonts --> + <link href="//web.archive.org/web/20160518154155cs_/http://fonts.googleapis.com/css?family=Merriweather:900,900italic,300,300italic" rel="stylesheet" type="text/css"/> + <link href="//web.archive.org/web/20160518154155cs_/http://fonts.googleapis.com/css?family=Roboto+Mono:100,200,300,400,400italic,700,700italic&subset=latin,latin-ext" rel="stylesheet" type="text/css"/> <!-- Styles --> + <link href="/web/20160518154155cs_/http://zserge.com/styles.css" rel="stylesheet" type="text/css"/> <!-- Favicons --> + <link href="/web/20160518154155im_/http://zserge.com/favicon.ico" rel="shortcut icon"/> + <link href="/web/20160518154155im_/http://zserge.com/favicon.128.png" rel="apple-touch-icon-precomposed"/> + </head> + <body> + <header> + <nav> + <a class="logo" href="/web/20160518154155/http://zserge.com/">Z</a> + </nav> + <div class="empty"></div> + <nav> + <section> + <a href="/web/20160518154155/http://zserge.com/about.html">about</a> + <a href="/web/20160518154155/http://zserge.com/blog.html">posts</a> + </section> + <section> + <a href="http://web.archive.org/web/20160518154155/https://twitter.com/zsergo">@me</a> + <a href="http://web.archive.org/web/20160518154155/https://plus.google.com/u/0/+SergeZaitsev">+me</a> + <a href="http://web.archive.org/web/20160518154155/https://github.com/zserge"></>me</a> + </section> + </nav> + </header> + <h1>cucu: a compiler you can understand (part 1)</h1> + +<p>Let talk about the compilers. Have you ever thought of writing your own one?</p> + +<p>I will try to show you how simple it is. The first part will be pretty much +theoretical, so keep patience.</p> + +<h2>what we’re going to achieve?</h2> + +<p>CUCU is a toy compiler for a toy language. I want it to be as close to ANSI C +as possible, so that every valid CUCU program could be compiled with a C +compiler without any errors. Of course, the support of the whole ANSI C +standard is very difficult, so I picked a very small C language subset.</p> + +<p>For example, here’s a valid CUCU code snippet:</p> + +<pre><code>int cucu_strlen(char *s) { + int i = 0; + while (s[i]) { + i = i + 1; + } + return i; +} +</code></pre> + +<h2>grammar</h2> + +<p>We’re about to define a grammar for our language. It’s an important step, +because everything in our compiler design depends on it.</p> + +<p>So, let’s go from top to bottom. Our source file contains a <strong>program</strong>. +What is a program? We know it’s a list of <strong>variable declarations</strong>, <strong>function +declarations</strong> and <strong>function definitions</strong>, e.g:</p> + +<pre><code>int func(char *s, int len); /* function declaration */ +int i; /* variable declaration */ + +int func(char *s, int len) { /* function definition */ + ... +} +</code></pre> + +<p>Let’s try to write it down in <a href="http://web.archive.org/web/20160518154155/http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form">EBNF</a> form (it’s absolutely ok, if you don’t +know what EBNF is, it’s really intuitive):</p> + +<pre><code><program> ::= { <var-decl> | <func-decl> | <func-def> } ; +</code></pre> + +<p>This notation says: “a program is a repeating sequence of variable declarations, +function declarations and function definitions. But what is all those +declarations and definitions? Ok, let’s go deeper:</p> + +<pre><code><var-decl> ::= <type> <ident> ";" +<func-decl> ::= <type> <ident> "(" <func-args> ")" ";" +<func-def> ::= <type> <ident> "(" <func-args> ")" <func-body> +<func-args> ::= { <type> <ident> "," } +<type> ::= "int" | "char *" +</code></pre> + +<p>So, variable declaration is simple: it’s a type name followed by the identifier, +and followed by a semicolon, like we usually do in C, e.g.:</p> + +<pre><code>int i; +char *s; +</code></pre> + +<p>Function declaration is a bit more complicated. It’s a “type + identifier”, +and an optional list of function arguments <code><func-args></code> inside the braces.</p> + +<p>Function arguments list, in it’s turn, is a sequence of comma-separated +“type + identifier”, like:</p> + +<pre><code>char *s, int from, int to +</code></pre> + +<p><em>Actually, trailing comma in CUCU is still allowed, but not required. It will +just simplify our parser code.</em></p> + +<p>The supported data types are only <code>int</code> and <code>char *</code>. Identifier is a sequence +of letters, digits and an underscore symbol.</p> + +<p>The only thing left is <code><func-body></code>. But let’s talk about <strong>statements</strong> and +<strong>expressions</strong> first.</p> + +<p><em>Statement</em> is a smallest standalone element of the language. Here are valid +statements in out language:</p> + +<pre><code>/* These are simple statements */ +i = 2 + 3; /* assignment statement */ +my_func(i); /* function call statement */ +return i; /* return statement */ + +/* These are compound statements */ +if (x > 0) { .. } else { .. } +while (x > 0) { .. } +</code></pre> + +<p><em>Expression</em> is a smaller part of the statement. As opposed to statements, +expressions always return a value. Usually, it’s just the arithmetic. For +example in the statement <code>func(x[2], i + j)</code> the expressions are <code>x[2]</code> and +<code>i+j</code>.</p> + +<p>So, looking back at <code><func-body></code>. It’s just a valid statement (which is +usually a block statement).</p> + +<p>Here’s what we have:</p> + +<pre><code><func-body> ::= <statement> +<statement> ::= "{" { <statement> } "}" /* block statement */ + | [<type>] <ident> [ "=" <expr> ] ";" /* assignment */ + | "return" <expr> ";" + | "if" "(" <expr> ")" <statement> [ "else" <statement> ] + | "while" "(" <expr> ")" <statement> + | <expr> ";" +</code></pre> + +<p>Here are possible expressions in CUCU language:</p> + +<pre><code><expr> ::= <bitwise-expr> + | <bitwise-expr> = <expr> +<bitwise-expr> ::= <eq-expr> + | <bitwise-expr> & <eq-expr> + | <bitwise-expr> | <eq-expr> +<eq-expr> ::= <rel-expr> + | <eq-expr> == <rel-expr> + | <eq-expr> != <rel-expr> +<rel-expr> ::= <shift-expr> + | <rel-expr> < <shift-expr> +<shift-expr> ::= <add-expr> + | <shift-expr> << <add-expr> + | <shift-expr> >> <add-expr> +<add-expr> ::= <postfix-expr> + | <add-expr> + <postfix-expr> + | <add-expr> - <postfix-expr> +<postfix-expr> ::= <prim-expr> + | <postfix-expr> [ <expr> ] + | <postfix-expr> ( <expr> { "," <expr> } ) +<prim-expr> := <number> | <ident> | <string> | "(" <expr> ")" +</code></pre> + +<p>That’s it. Did you notice the recursion in the expression notation? Basically, +the notation above shows us the precedence of the operators from bottom to top: +parens and square brackets are evaluated first, and assignment goes the last.</p> + +<p>For example, according to this grammar an expression <code>8>>1+1</code> +will be evaluated to 2 (like in <code>8>>(1+1)</code>), not to 5 (like in <code>(8>>1)+1</code>), +because <code>>></code> has lower precedence than <code>+</code>.</p> + +<h2>lexer</h2> + +<p>Now we are done with grammar, and are ready to start. The first thing to do is +a lexer. Our compiler takes a stream of bytes as an input, and lexer allows to +split that stream into smaller tokens, that can be processed later. It gives us +some level of abstraction and simplifies out parser.</p> + +<p>For example, a sequence of bytes “int i = 2+31;” will be split into tokens:</p> + +<pre><code>int +i += +2 ++ +31 +; +</code></pre> + +<p><em>In normal grown-up lexers every lexeme (token) has a type and a value, so +instead of the list above we will get a list of pairs: <code><TYPE:int>,<ID:i>, +<ASSIGN:=>,<NUM:2>,<PLUS:+>,<NUM:31>,<SEMI:;></code>. We are going to detect lexeme +type by its value, which is not academical at all!</em></p> + +<p>The major problem with the lexer is that once a byte is read from the stream - +it can not be “un-read”. And what if we’ve read a byte that can not be added to +the current token? Where should we store it until the current token is +processed?</p> + +<p>Almost every lexer need to read ahead. Our grammar is simple enough, so all we +need to have is a single byte buffer - <code>nextc</code>. It stores a byte, that was read +from the stream, but has not yet been pushed to the token string.</p> + +<p>Also, I need to warn you - I use global variables a lot in CUCU code. I know +it’s a bad style, but if I pass all values as function arguments the compiler +would loose it’s simplicity.</p> + +<p>The whole lexer is just a single function <code>readtok()</code>. The algorithm is simple:</p> + +<ul> +<li>skip leading spaces</li> +<li>try to read an identifier (a sequence of letters, digits and <code>_</code>)</li> +<li>if it’s not an identifier - try to read a sequence of special operators, like +<code>&, |, <, >, =, !</code>.</li> +<li>if it’s not an operator - try to read a string literal “….” or ‘….’</li> +<li>if failed - maybe it’s a comment, like <code>/* ... */</code>?</li> +<li>if failed again - just read a single byte. It might be another single-byte +token, like “(” or “[“.</li> +</ul> + +<p>Here’s the whole CUCU sources we’ve got so far:</p> + +<pre><code>#include <stdio.h> /* for vpritnf */ +#include <stdarg.h> /* for va_list */ +#include <stdlib.h> /* for exit() */ +#include <ctype.h> /* for isspace, isalpha... */ + +#define MAXTOKSZ 256 +static FILE *f; /* input file */ +static char tok[MAXTOKSZ]; +static int tokpos; +static int nextc; + +void readchr() { + if (tokpos == MAXTOKSZ - 1) { + tok[tokpos] = '\0'; + fprintf(stderr, "token too long: %s\n", tok); + exit(EXIT_FAILURE); + } + tok[tokpos++] = nextc; + nextc = fgetc(f); +} + +void readtok() { + for (;;) { + while (isspace(nextc)) { + nextc = fgetc(f); + } + tokpos = 0; + while(isalnum(nextc) || nextc == '_') { + readchr(); + } + if (tokpos == 0) { + while (nextc == '<' || nextc == '=' || nextc == '>' + || nextc == '!' || nextc == '&' || nextc == '|') { + readchr(); + } + } + if (tokpos == 0) { + if (nextc == '\'' || nextc == '"') { + char c = nextc; + readchr(); + while (nextc != c) { + readchr(); + } + readchr(); + } else if (nextc == '/') { + readchr(); + if (nextc == '*') { + nextc = fgetc(f); + while (nextc != '/') { + while (nextc != '*') { + nextc = fgetc(f); + } + nextc = fgetc(f); + } + nextc = fgetc(f); + } + } else if (nextc != EOF) { + readchr(); + } + } + break; + } + tok[tokpos] = '\0'; +} + +int main() { + f = stdin; + nextc = fgetc(f); + + for (;;) { + readtok(); + printf("TOKEN: %s\n", tok); + if (tok[0] == '\0') break; + } + return 0; +} +</code></pre> + +<p>If we put a C file as the compiler input - it will print us a list of tokens, +one per line.</p> + +<p>Ok, have a cup of coffee, and let’s <a href="/web/20160518154155/http://zserge.com/blog/cucu-part2.html">go further →</a></p> + +<p><em>You can check <a href="/web/20160518154155/http://zserge.com/blog/cucu-part3.html">part3</a> to know how the story ended.</em></p> + + <p>Posted on 2012-10-23</p> + <section class="social"> + <section> + <a href="http://web.archive.org/web/20160518154155/http://www.facebook.com/share.php?u=http://zserge.com/blog/cucu-part1.html&title=cucu:%20a%20compiler%20you%20can%20understand%20%281/3%29" style="background-color: #3b5997">like</a> + + <a href="http://web.archive.org/web/20160518154155/http://twitter.com/intent/tweet?status=cucu:%20a%20compiler%20you%20can%20understand%20%281/3%29+http://zserge.com/blog/cucu-part1.html%20via%20@zsergo" style="background-color: #41b7d8">tweet</a> + + <a href="http://web.archive.org/web/20160518154155/https://plus.google.com/share?url=http://zserge.com/blog/cucu-part1.html" style="background-color: #d64937">+1</a> + </section> + <section> + <a href="/web/20160518154155/http://zserge.com/rss.xml" style="background-color: #f26522">rss</a> + + <a href="http://web.archive.org/web/20160518154155/https://twitter.com/zsergo" style="background-color: #41b7d8">@me</a> + + <a href="http://web.archive.org/web/20160518154155/https://plus.google.com/u/0/+SergeZaitsev" style="background-color: #d64937">+me</a> + + <a href="http://web.archive.org/web/20160518154155/https://github.com/zserge" style="background-color: #333333"></>me</a> + </section> + </section> + <section><div id="disqus_thread"></div><script type="text/javascript"> +var disqus_shortname='zserge'; + (function() { + var dsq = document.createElement('script'); + dsq.type = 'text/javascript'; + dsq.async = true; + dsq.src = 'http://' + disqus_shortname + '.disqus.com/embed.js'; + (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq); + })(); + </script> +<noscript> +</noscript> + +</section> + <footer> + <p> + ©2012–2015 · + <a href="http://web.archive.org/web/20160518154155/http://zserge.com/">Serge Zaitsev</a> + · + <a href="http://web.archive.org/web/20160518154155/mailto:zaitsev.serge@gmail.com">zaitsev.serge@gmail.com</a> + </p> + </footer> + <script> + (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ + (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), + m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) + })(window,document,'script','//web.archive.org/web/20160518154155/http://www.google-analytics.com/analytics.js','ga'); + ga('create', 'UA-33644825-1', 'zserge.com'); + ga('send', 'pageview'); +</script> + + + </body> +</html> +<!-- + FILE ARCHIVED ON 15:41:55 May 18, 2016 AND RETRIEVED FROM THE + INTERNET ARCHIVE ON 17:21:38 Jan 12, 2024. + JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. + + ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. + SECTION 108(a)(3)). +--> +<!-- +playback timings (ms): + captures_list: 442.642 + exclusion.robots: 0.127 + exclusion.robots.policy: 0.117 + cdx.remote: 0.059 + esindex: 0.009 + LoadShardBlock: 405.453 (3) + PetaboxLoader3.datanode: 111.094 (4) + load_resource: 277.069 + PetaboxLoader3.resolve: 202.183 +-->
\ No newline at end of file |