Parsing a grammar
What is parsing?
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.
In 42sh, the parser is one of the core component of the project. You cannot have a working 42sh without a working parser. Hence, it is important to understand that notion very well.
When parsing a programming language, it is recommended to generate an Abstract Syntax Tree (AST). They are an abstract representation of the input, and help manipulating an input more easily than plain text.
Throughout this article, we will explain how to structure a parser for the 42sh project, using a recursive descent parser, and how it interacts with its lexer.
Global overview
The parser depends on a lexer. The goal of the lexer is to perform a lexical analysis on stream of character and generate an abstraction of words or sequences of character called tokens. These tokens are then consumed by the parser and transformed into an AST to represent the input in a more manipulable and abstract manner.
Implementing a parser
The peek/pop "pattern"
A frequent problem when writing a parser or a lexer is buffering data (character or token) when it cannot be used at a given time. Writing a parser or a lexer often implies looking ahead in the input; that is, in other words accessing the next token or character without consuming it (without moving forward on the input).
In C
, when lexing the input 1+2
, we want to return the following sequence of
tokens: an integer (1), an operator (+) and lastly an integer (2).
However, when encountering a delimiter we do not want to consume it.
For instance, the +
character delimits the token integer (1) and should not be
consumed on building the first token. Instead, we want to be able to read it again
on the next call to the lexer.
Your character stream and your Lexer should be interracted with using two
simple functions: *_peek()
and*_pop()
. The goal is to be able to get new
data (whether a character or token) using peek
and analyze it. If it
corresponds to what we expect, then pop
will remove it from a buffer and get
new data.
This combines particularly well with Recursive Descent Parsers in the sense
that in such parsers, we want to have a look at the next token before taking a
decision and consuming it. When we encounter a token using peek()
and this
token is not the expected one, we must not consume it directly. An
unexpected token can be the end of a grammar rule or an invalid token. In the
case of an invalid token, your parser will result in a syntax error. In the
other case, you can continue your parsing and check if the token retrieved by
peek()
is correct depending on the context, then use pop()
if it is in
order to fetch a new token on the next call to peek()
.
The lexer
Implementing a basic lexer should be one of the first steps of 42sh. It allows you to segment an input text into tokens. The objective of these tokens is to be an abstraction of the input text, and should avoid carrying too much data. Most of the time, a token carrying an identifier (generally a unique integer for each kind of tokens) along with the string it matched will be enough.
You should NOT have to read the content of that string again until the parsing of the input is over. This would lead to an anti-pattern, as we are trying to make an abstraction of that string.
If you think that you need to read it while parsing, you are generally doing something wrong.
One of the first thing you will want to do is to find what delimits a token. Then you will want to identify all the unique identifiers (aka. reserved words).
In order to implement your lexer for 42sh, we strongly recommend you to check out the Token Recognition algorithm from the SCL.
Recursive Descent Parsers: parsing made easy
Recursive descent parsers is one of the way to implement an LL Parser. It also is one of the easiest parser to implement and read. The goal is to use one procedure for each rule of your grammar.
input = and_or EOF ;
and_or = pipe { ( '&&' | '||' ) pipe } ;
pipe = command { '|' command } ;
command = WORD { WORD } ;
Using the grammar above, you would generally implement the following functions:
parse_input()
parse_pipe()
parse_and_or()
parse_command()
As mentioned previously, we are trying to create an AST from a input text. Each
of the functions associated to a rule can then return a node corresponding to
the rule it is meant to parse: parse_and_or
could return a node containing
two other nodes, the same goes for parse_pipe
, and parse_command
could
return a node containing an array of strings being the command to execute.
When a rule expects another rule as part of its entries, it will simply call the corresponding procedure. Otherwise, if it expects a token, it simply checks if the next one matches.
Assembling everything
Let's have a look at how a parser could parse the command echo toto | cat
using the grammar below:
input = pipe EOF ;
pipe = command { '|' command } ;
command = WORD { WORD } ;
The grammar has 3 rules, so we should have 3 parsing functions: parse_input, parse_pipe and parse_command.
These functions can interact with a Lexer
, and two functions exist to interact
with it: peek and pop. The Lexer
will return tokens when interacted
with.
As a conclusion, a parser analyses a input string. To do so, the parser must call the lexer to get tokens and check that each token corresponds to the parsers rules. If it does not your parser results in a syntax error. Otherwise it generally returns an AST for to the parsed input string.
Going further
Crafting Interpreters is a guide on how to implement an interpreter for any language. Most importantly, in also has a guide on how to write a lexer and a parser.
Parser Generators (also known as Compiler compilers) are tools used to generate parsers using an input grammar. It generally is a faster way to implement a parser, but generated codes tend to be slower or have poor error handling.
PlantUML is a tool used to generate diagrams. Most importantly, it has a component to help you visualise an input ebnf grammar by generating a graph automata. It can help de-mistifying the way a grammar (or just a single rule) is meant to be read.