In computer science a language is defined by formal grammar. In formal language theory, a grammar is a set of production rules for strings that form a language. The rules describe how to form strings from the language alphabet that are valid according to the language syntax.
To explain language grammar we can use specific notation:
Is a finite combination of symbols that is well-formed according to rules that depend on the context. Mathematical symbols can designate numbers elements, operators or tokens: {constants, variables, operations, functions, punctuation, grouping}, and some other aspects of logical syntax. Well-formed (syntax)
A well-formed expression is a syntactic concept. A syntax is a set of rules, principles, and processes that govern the structure of a given language.
In a correct expression the operators must have the correct number of inputs in the correct places, the characters that make up these inputs must be valid. Enumeration of symbols that violate syntax rules are not well-formed and are not valid mathematical expressions.
For example, in the usual notation of arithmetic, the expression (2 + 3) is well-formed, while the following expression is not:
* 4 ) x+,/y.
Mathematical expressions include multiple forms of expressions: {arithmetic, polynomial, algebraic, closed-form, and analytical}. The form of an expression is a semantics concept.Semantics is the study of meaning. Formal semantics is about attaching meaning to expressions. Expression evaluation
An expression may be used to designate a value, which might depend on variables occurring in the expression. The determination of this value depends on the semantics attached to the symbols of the expression.
Certain expressions do not designate any value, for instance when they involve division by 0; such expressions are said to have an undefined value, but they are well-formed expressions nonetheless. Logical expressions
In general the meaning of expressions is not limited to designating values; for instance, an expression might designate a condition, or an equation that is to be solved, or it can be viewed as an object in its own right that can be manipulated according to certain rules. How compiler works?
In order to build a compiler for Level I have started my research several weeks ago. Until now I have understood some basic concepts and I have started to implement the compiler components. Next I wil present the compiler components created for project Level-PY. That is a python compiler for Level-3. Compiler Components
Most compilers are using components to split the logic of compilation into steps.Every component has a job to perform and then pass control to next component. Components are working together and collaborate using messages: function calls or data structures.
Source File -> Scanner -> Lexer -> Parser -> Code Generator -> Output File
A scanner read the source file one character at a time. It also keep track of line number and character position in the line. A scanner usually read forward one character but sometimes scanner can read backwards or can read one character ahead.
For example:
program test is begin pass; end program;
Will be broken by scanner into characters:
p r o g r a m_t e s t_i s\n_b e g i n\n_ _p a s s ;\n e n d_p r o g r a m ;\n
(I have used "_" to show the spaces. Scanner will return all spaces and end of line characters. I have used \n for new line character.).
I have implemented the scanner using a closure. This is a function that return a function generator. The generator is using yield() to return characters one at a time.
Level scanner is reading forward two characters: one is the current character and one is the look ahead character. With this we can recognize digraph symbols like: "–", "{*", ">=" and so on. Lexer
This component will create distinct words, symbols or group of symbols named "tokens". Every token belong to a token category. Lexer is using the scanner to get characters one by one and decide to add a new character to the current token or to finalize and produce a new token.
program test is v integer; begin v:v+1; print(v); end program;
Will be split by lexer into tokens:
program test is v integer ; begin v : v + 1 ; print ( v ) ; end program ;
Lexer will determine the category of the token: keyword, symbol, identifier. Tokens are recognized because are separated by spaces or symbols. Sometimes the lexer do not know what the symbol represents until the next character is received. For this the lexer can request next character before deciding the type of the token.
Usually the lexer will eliminate indentations, white spaces and comments. For some languages indentation is important and must be analyzed by the parser. Level lexer is storing indentation information for every token:
You can implemented the lexer using a closure. This is a function that return a generator. The generator is using multiple yield() statements to return tokens one at a time from a source file.
Sometimes Lexer is using multiple nested loops to create tokens. You can use a single loop with inside decisions and continuation points. This makes the Lexer program shorter and more comprehensive. Parser
This component call the Lexer to get tokens one by one and use them to verify the language syntax. For this the parser is creating in memory collections to organize the tokens into logical structures:
Compiler can be created using a producer-consumer design pattern. Parser is a consumer while Lexer is the producer. Every time a token is produced by lexer, parser will consume this token and request a new token. Code Generator
This component use logical structures created by the parser to convert your program into other language. Some compilers create intermediate language (IR) or convert the code directly into machine code. This is much harder.
Some compilers/interpreters are able to execute code from memory using AST and symbolic table without generating any intermediate code. This method can be used for scripting languages.
In the past compiler engineers use to be concerned about computer memory. Compilers use to create intermediate files and multiple passes of source code, to optimize the code as much as possible. This is not one of my concerns. This days the computer memory is plentiful and there should be space for most programs.
The parser use the lexer which is using the scanner. Therefore the main program can call only the parser. The parser will have a "driver function" that will initialize the lexer and the lexer will have a "driver function" to will initialize the scanner. Symbols and categories
It is a good design to create resources for your language before start parsing. That are collections of symbols and tokens organized by category. For every category you created a set. During parsing process you can check if a token belong to a particular category and store the token in the symbol table with category information. Further reading
After you design a new syntax, you should create a compiler or interpreter. There are free books and resources on internet to learn how to parse a language and generate code. Once you have learned this technique you are a "compiler engineer"
For this course we are preparing design for several alternative programming languages. Visit the project page to read details about these projects. If you want to contribute you can contact us on Discord where we chat about our work.
Read next: Introduction to Hardware