Let's build a compiler!

We encourage you to follow Jack Crenshaw's great guide to building a compiler as soon as you feel able. Some of you will want to start now! Much is achieved with WHILE loops, IF and CASE statements, procedures and functions, and the approach is so simple that few routines have more than one parameter. The output to assembler is more difficult to follow in detail but the procedure names and comments make it possible for you to "get the gist" of how you could develop your own language. You will learn a great deal in the process about, for example:

  • bottom-up design;
  • top-down design;
  • Backus-Naur form (BNF);
  • priority of operators;
  • recursion;
  • state machines;
  • syntax diagrams;
  • interpreters;
  • passing parameters by value and by reference;
  • other terminology such as recursive descent parsing, predictive parser, parse tree, lexical scanner, distributed scanner, token, grammar, code generator, run-time library, intermediate language, assembler, condition, branch, label, flag, lazy translation, optimisation, precedence, unary minus, stack, symbol table, literal pool, finite automaton, syntax, weak and strong typing and stub procedure;
  • advantages of using units;
  • the history of programming.

The original programs were designed to be run from the command line. When running from Lazarus or Delphi you need some ReadLn statements to keep the console window on the screen. When running in a Delphi IDE such as Delphi 7, you also need to use the SysUtils unit for input and output. These are accepted by Lazarus, so the code that we supply runs in both Lazarus and Delphi. The output (which we have made no attempt to optimise) is assembly language code for Intel processors. (We use the term Intel processor throughout to cover a wide range of 80x86 Intel processors). We output code for in-line assembly so that it is easy for you to test but we plan to adapt it for input to the MASM assembler. If necessary, you should read our tutorial on in-line assembler to become familiar with Intel registers and syntax. We use safely the registers EAX, EDX and ECX without preserving their former contents. We must preserve the contents of the EBP register by pushing it onto the stack before use and popping it after we have finished using it.

The original tutorial was based on the Motorola 68000 (68k) processor. The 68k processor uses branch mnemonics such as BRA instead of jumps such as JMP, so the tutorial uses the "branch" terminology. Jack wrote the tutorial many years ago when the key presses CTRL+G caused a beep and CTRL+M and CTRL+J gave a carriage return and line feed, respectively. In code Jack represents them by ^G, ^M and ^J instead of the more familiar ASCII values #7, #13 and #10.

We now summarise our changes. For the programs, we have

  • added ReadLn statements when necessary;
  • converted the output to Intel assembly language for in-line assembler blocks;
  • formatted the source code in the style we have used for other tutorials;
  • tested the output using the in-line assembler programs detailed in the appendix;
  • included examples of test results in the appendix;
  • added for Delphi users the compiler directive {$Apptype Console} and put SysUtils in the uses clause.

For the notes, we have:

  • added instructions to save versions with names such as Parse01. We refer to these names in our testing. You can backtrack to these versions if necessary and we encourage you to use them for your own experiments;
  • added some comments (prefixed with PPS:);
  • changed some spellings to UK English for the benefit of most of our readers;
  • appended links to the relevant pages of testing in the Appendix.
  • added some style formatting. (The originals are pure text files).

Jack wrote his tutorial in 16 instalments over several years. We have the benefit of knowing the content in advance and direct you to specific parts in the table below.

Part No Summary Programs
I Introduction to the KISS approach. "I hope to teach you how the things get done, so that you can go off on your own and not only reproduce what I have done, but improve on it". Source code of the cradle. Cradle
II Development of parser for arithmetic expressions. An expression is defined as the right-hand side of an assignment. Expression is first a single digit, then single digits may be added to or subtracted from each other. Multiplication and division come next, followed by brackets then the unary minus. Explanation of why the stack is ideal for handling complex expressions. Outputs assembly language instructions. Parse01 to Parse06
III Parser developed to handle single-letter identifiers for variables and functions. Program Parse10 accepts input of assignments instead of expressions. Parse11 accepts input of multi-digit integers and multi-character variables. Parse12 skips white space. Parse07 to Parse12
IV Introduction to interpreters. Discussion of lazy translation. Interpreter01 evaluates arithmetic expressions. Interpreter02 inputs single-character variables in expressions and inputs and outputs their values. Explanation of why the simple approach is successful in achieving so much. Interpreter01 and Interpreter02
V Good range of constructs such as IF statements, WHILE, REPEAT and FOR loops, with the tricky BREAK added later. Single character/digit input. Condition on which loops depend included only as a dummy procedure. Branch01 and Branch02
VI Discussion of the precedence of Boolean and arithmetic operators. Code developed for Boolean conditions based closely on BNF syntax. Boolean-handling procedures merged with branching code from Part V. Assignments.

Bool01, Branch03 and Branch04

VII Lexical scanner developed. Input converted to tokens. White-space and punctuation skipped. Symbol table. Discussion of the use of state machines in compilers. KISS01 has single-character assignments and if statements. KISS02 has the keywords IF, ELSE, ENDIF and END and allows the assignment of multi-character integer variables to expressions containing multi-digit integers and variables. Scanner01 to Scanner04, KISS01 and KISS02
VIII Discussion of how the KISS approach works so well and constraints that impeded writers of early compilers.

IX Top-down design of a compiler, starting with single-character identifiers and input values of variables. Discussion of approaches suited to Pascal and to C. TopDown01 and TopDown02
X Development of the TINY Compiler - a subset of KISS - from the top down. Starts with single-letter variables. Initialises declared variables. IF statements, WHILE loops (without the BREAK developed in Part V) with the full range of relational operators for conditions. Assignment of variables. Symbol table. CALL statements for input and output. White space skipped.

TINY01 to TINY10

XI Improvements to the lexical scanning in TINY. TINY11 pre-fetches tokens as well as identifiers and is more robust than TINY10.

Integration of TINY Version 1.1 with the MASM assembler to achieve complete compilation.

XII Experiments in handling semicolons and comments for C, Pascal and TINY/KISS. TINY13 has optional semicolons and comments within nestable braces. TINY11C, TINY11P, TINY12, TINY13
XIII Procedure declaration. Symbol table. Discussion of optional parameter lists. Camparison of passing by value with passing by reference and code generation for each. Local variables. Use of frame pointer. Single-character variable names. Assignments restricted to the form a=b, where one variable takes the value of another. Procs01 to Procs08
XIV Symbol table for the signed types 'Byte' (smallint) 'Word' (shortint) and 'Long' (integer). Assignments of constants and variables of the three types. Discussion of weak and strong typing. Types01 and Types02
XV Introduction to the use of units.
XVI Units developed. Input of multi-character variables and multi-digit integers. Expressions translated. Main02 generates code for an assignment. Main03 handles arithmetic and Boolean operators in expressions: +, -, /, *, !, &, |, ~. Note that relational operators are not included. Main01 to Main03

Follow the links below to the 16 instalments of Jack Crenshaw's tutorial, to the integration with the MASM and GNU ARM assemblers and to our appendix.

Programming - a skill for life!

Compilers (including Jack Crenshaw's "Let's Build a Compiler" applied to Intel processors and the Raspberry Pi), assemblers and interpreters