This content originally appeared on DEV Community and was authored by Mohamed Achaq
Hello and welcome to a new blog post !
Today we will talk about compilers and how you can write one in Rust.
Programming Languages
There is two types of programming languages, complied and interpreted.
Programs written in compiled languages need to be compiled before they can be run as a binary on a computer and
programs written in interpreted languages are run directly on a computer by using a just in time compilation (JIT).
Examples of compiled languages are C, C++, Java, C#, Go, etc.
Examples of interpreted languages are Python, Ruby, JavaScript, etc.
Compilers
The compiler is just a program that takes a human readable programming language and converts it to machine code to be
understandable by the computer, we all know that the computer can only understand zeros and ones, so how the compiler
actually works ?
LLVM
You may have heard of LLVM, which is a tool that simplifies the process of making compilers, it's written in C++, and provides a collection of modular and reusable compiler and toolchain technologies.
LLVM is a middle layers of a complete compiler system, taking intermediate representation (IR) code from a compiler and emitting an optimized IR. This new IR can then be converted
and linked into machine-dependent assembly language code for a target platform.
In this article I'm not going to talk about LLVM instad we will talk about the fondamental of compilers, and I will provide some examples of how to write some part of a compiler in Rust.
The compiler
A simple compiler can be broken down into 3 main parts :
- A Parser (or Lexer)
- A Transformer
- A Code Generator
The Parser
The parser is the part of the compiler that tokenize the code and convert it to a list of tokens.
The Tokenzation or Lexical Analysis is the process of breaking down a program into a list of tokens, each token is a small single unit of
code that the compiler can understand, for example, here is a simple Rust program :
fn main() {
println!("Hello, world!");
}
This is a simple Rust program, it has a function named main, and inside the function there is a println! macro,
that prints the string "Hello, world!" to the console.
After the tokenzation, the program will be broken down into a list of tokens, fn is a function keyword, main is
the name of the function, and the parentheses are the opening and closing parentheses of the function, the
opening curly brace, the closing curly brace, println is the println macro, and the parentheses are
the opening and closing parentheses of the macro.
After tokenizing the code the parser will convert the list of tokens into a tree of nodes, called an Abstract Syntax Tree (AST),
which is an abstract representation of the code, this step called Syntactic Analysis.
The AST of the program above will look something like this like this :
AST {
kind: "Program",
children: [
AST {
kind: "Function",
children: [
AST {
kind: "Identifier",
value: "main",
children: []
},
AST {
kind: "Block",
children: [
AST {
kind: "Println",
children: [
AST {
kind: "String",
value: "Hello, world!",
children: []
}
]
}
]
}
]
}
]
}
The Transformer
After parsing the code we will have an AST, but the AST is not enough to be understood by the computer, it's also
a deeply nested structure, so we need to transform the AST into a more readable structure, called a Syntax Tree.
The Transformer can be broken down into 2 main parts or functions :
- Traversers
- Visitors
The Traversers are functions that traverse the AST in a depth-first maner.
For example in the above AST, the Traverser will apply the function to the root node which is the Program node, then
it will apply the function to the children of the Program node, and so on, until it reaches the last child.
A Traverser function can be implemented by using the Visitor pattern, which is a pattern that allows you to
implement a function that works on a node and its children, and then apply the function to the node and its children,
The Visitors usually are objects with methods that accapt a node and return a value.
Let's implement a simple Transformer that takes an AST and transform it into a syntax tree by passing it to a Traverser.
// this is a simple example to explain the visitor pattern
use std::collections::HashMap;
struct Visitor {
nodes: HashMap<String, String>
}
struct Traverser {
visitor: Visitor
}
struct Node {
kind: String,
children: Vec<Node>
}
impl Visitor {
fn new() -> Visitor {
Visitor {
nodes: HashMap::new()
}
}
fn visit(&mut self, node: &Node) {
self.nodes.insert(node.kind.clone(), node.kind.clone());
for child in &node.children {
self.visit(child);
}
return self.nodes;
}
}
impl Traverser {
fn new() -> Traverser {
Traverser {
visitor: Visitor::new()
}
}
fn traverse(&mut self, node: &Node) {
self.visitor.visit(node);
}
}
struct SyntaxTree {
kind : String,
body : Vec<Node>
}
fn Transformer (AST : &AST) -> SyntaxTree {
let mut syntax_tree = SyntaxTree {
kind: "Program".to_string(),
body: Vec::new()
};
let mut traverser = Traverser::new();
let ast_context = traverser.visitor.visit(AST);
syntax_tree.body = ast_context.clone();
return syntax_tree;
}
The Code Generator
The Code Generator is the part of the compiler that generate the machine code from the syntax tree,
the code generator call it's self recursively to generate the code from the syntax tree into one single
string, this string is the machine code.
Here is an example of how you can implement the code generator function in Rust :
fn generate_code(syntax_tree: &SyntaxTree) -> String {
let mut code = String::new();
match syntax_tree.kind.as_str() {
"Program" => {
for child in &syntax_tree.body {
code.push_str(&generate_code(child));
}
},
"Function" => {
code.push_str("fn ");
code.push_str(&syntax_tree.body[0].kind);
code.push_str("(");
for child in &syntax_tree.body[1].body {
code.push_str(&child.kind);
code.push_str(", ");
}
code.pop();
code.pop();
code.push_str(") {\n");
for child in &syntax_tree.body[2].body {
code.push_str(&generate_code(child));
}
code.push_str("}\n");
},
"Block" => {
for child in &syntax_tree.body {
code.push_str(&generate_code(child));
}
},
"Println" => {
code.push_str("println!(");
code.push_str(&syntax_tree.body[0].kind);
code.push_str(");\n");
},
"String" => {
code.push_str("\"");
code.push_str(&syntax_tree.body[0].kind);
code.push_str("\"");
},
"Identifier" => {
code.push_str(&syntax_tree.body[0].kind);
},
_ => {
code.push_str("<unknown>");
}
}
return code;
}
After implementing the parser, the transformer and the code generator, you can use composition
to tie everything together, and here we go you made your own compiler.
Conclusion
Compiler are very essential tools in programming, they're been arround forever, you don't need to
know how to write a compiler, but learning how compilers work helps you understand your tools better.
Usful Links
This content originally appeared on DEV Community and was authored by Mohamed Achaq
Mohamed Achaq | Sciencx (2022-06-16T20:23:23+00:00) Compilers, and how to write one !. Retrieved from https://www.scien.cx/2022/06/16/compilers-and-how-to-write-one/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.