engineering<--

Crafting Compilers

Compilers are special software applications that can read source code in a specific programming language and translate this into code machine to be executed. Interpreters are similar to compilers but work differently. Interpreters are interactive applications that execute expressions and statements immediately or can load a script in memory and run it without compilation.

Formal grammar

In computer science a language is defined by formal grammar. In formal language theory, a grammar is a set of production rules for strings that form a language. The rules describe how to form strings from the language alphabet that are valid according to the language syntax.

Grammar description

To explain language grammar we can use specific notation:

We enclose identifier names in angle brackets <>;
Identifiers are starting with a letter and contain letter or numbers;
We enclose optional elements in squared brackets [];
We represent alternative (or) using the vertical bar "|";
A repeating sequence of elements is represented using 3 dots: "…"

Warning: I think creation of new computer languages by individual developers is a really bad idea. You need a team of developers well motivated and an unlimited budget to make a new language work. This is for elite software developers and leading companies. However in this article I will present the basic concepts for new languages. Learning this may be helpful for understanding better how a programming language is actually working.

Mathematical Expressions

Is a finite combination of symbols that is well-formed according to rules that depend on the context. Mathematical symbols can designate numbers elements, operators or tokens: {constants, variables, operations, functions, punctuation, grouping}, and some other aspects of logical syntax. Well-formed (syntax)

A well-formed expression is a syntactic concept. A syntax is a set of rules, principles, and processes that govern the structure of a given language.

In a correct expression the operators must have the correct number of inputs in the correct places, the characters that make up these inputs must be valid. Enumeration of symbols that violate syntax rules are not well-formed and are not valid mathematical expressions.

For example, in the usual notation of arithmetic, the expression (2 + 3) is well-formed, while the following expression is not:

 * 4 ) x+,/y.

Expression form (semantic)

Mathematical expressions include multiple forms of expressions: {arithmetic, polynomial, algebraic, closed-form, and analytical}. The form of an expression is a semantics concept.Semantics is the study of meaning. Formal semantics is about attaching meaning to expressions. Expression evaluation

An expression may be used to designate a value, which might depend on variables occurring in the expression. The determination of this value depends on the semantics attached to the symbols of the expression.

Certain expressions do not designate any value, for instance when they involve division by 0; such expressions are said to have an undefined value, but they are well-formed expressions nonetheless. Logical expressions

In general the meaning of expressions is not limited to designating values; for instance, an expression might designate a condition, or an equation that is to be solved, or it can be viewed as an object in its own right that can be manipulated according to certain rules. How compiler works?

In order to build a compiler for Level I have started my research several weeks ago. Until now I have understood some basic concepts and I have started to implement the compiler components. Next I wil present the compiler components created for project Level-PY. That is a python compiler for Level-3. Compiler Components

Most compilers are using components to split the logic of compilation into steps.Every component has a job to perform and then pass control to next component. Components are working together and collaborate using messages: function calls or data structures.

Source File -> Scanner -> Lexer -> Parser -> Code Generator -> Output File

Scanner

A scanner read the source file one character at a time. It also keep track of line number and character position in the line. A scanner usually read forward one character but sometimes scanner can read backwards or can read one character ahead.

For example:

program test is
begin
 pass;
end program;

Will be broken by scanner into characters:

p r o g r a m_t e s t_i s\n_b e g i n\n_ _p a s s ;\n e n d_p r o g r a m ;\n

(I have used "_" to show the spaces. Scanner will return all spaces and end of line characters. I have used \n for new line character.).

I have implemented the scanner using a closure. This is a function that return a function generator. The generator is using yield() to return characters one at a time.

Level scanner is reading forward two characters: one is the current character and one is the look ahead character. With this we can recognize digraph symbols like: "–", "{*", ">=" and so on. Lexer

This component will create distinct words, symbols or group of symbols named "tokens". Every token belong to a token category. Lexer is using the scanner to get characters one by one and decide to add a new character to the current token or to finalize and produce a new token.

For example:

program test is v integer;
begin
 v:v+1; print(v);
end program;

Will be split by lexer into tokens:

program
test
is
v
integer
;
begin
v
:
v
+
1
;
print
(
v
)
;
end
program
;

Lexer will determine the category of the token: keyword, symbol, identifier. Tokens are recognized because are separated by spaces or symbols. Sometimes the lexer do not know what the symbol represents until the next character is received. For this the lexer can request next character before deciding the type of the token.

Usually the lexer will eliminate indentations, white spaces and comments. For some languages indentation is important and must be analyzed by the parser. Level lexer is storing indentation information for every token:

Indent =-1 represents no indentation;
Indent = 0 is the beginning of a new line;
Indent > 0 is number of spaces at beginning of line.

You can implemented the lexer using a closure. This is a function that return a generator. The generator is using multiple yield() statements to return tokens one at a time from a source file.

Sometimes Lexer is using multiple nested loops to create tokens. You can use a single loop with inside decisions and continuation points. This makes the Lexer program shorter and more comprehensive. Parser

This component call the Lexer to get tokens one by one and use them to verify the language syntax. For this the parser is creating in memory collections to organize the tokens into logical structures:

Symbol Table,
Abstract Syntax Tree.

Compiler can be created using a producer-consumer design pattern. Parser is a consumer while Lexer is the producer. Every time a token is produced by lexer, parser will consume this token and request a new token. Code Generator

This component use logical structures created by the parser to convert your program into other language. Some compilers create intermediate language (IR) or convert the code directly into machine code. This is much harder.

Some compilers/interpreters are able to execute code from memory using AST and symbolic table without generating any intermediate code. This method can be used for scripting languages.

In the past compiler engineers use to be concerned about computer memory. Compilers use to create intermediate files and multiple passes of source code, to optimize the code as much as possible. This is not one of my concerns. This days the computer memory is plentiful and there should be space for most programs.

Main program

The parser use the lexer which is using the scanner. Therefore the main program can call only the parser. The parser will have a "driver function" that will initialize the lexer and the lexer will have a "driver function" to will initialize the scanner. Symbols and categories

It is a good design to create resources for your language before start parsing. That are collections of symbols and tokens organized by category. For every category you created a set. During parsing process you can check if a token belong to a particular category and store the token in the symbol table with category information. Further reading

After you design a new syntax, you should create a compiler or interpreter. There are free books and resources on internet to learn how to parse a language and generate code. Once you have learned this technique you are a "compiler engineer"

Practice Projects

For this course we are preparing design for several alternative programming languages. Visit the project page to read details about these projects. If you want to contribute you can contact us on Discord where we chat about our work.

References

Other Resources

Read next: Introduction to Hardware