Parsing a Grammar
What is Parsing?
Parsing, syntax analysis, or syntactic analysis, is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.
In 42sh, the parser is one of the core components of the project. You cannot have a working 42sh without a working parser. Hence, it is important to understand this notion very well.
When parsing a programming language, it is recommended to generate an Abstract Syntax Tree (AST). They are an abstract representation of the input, and help manipulating it more easily than in plain text.
Throughout this article, we will explain how to structure a parser for the 42sh project, using a recursive descent parser, and how it interacts with its lexer.
Global overview
The parser depends on a lexer. The goal of the lexer is to perform a lexical analysis on stream of character and generate an abstraction of words or sequences of character called tokens. These tokens are then consumed by the parser and transformed into an AST to represent the input in a more manipulable and abstract manner.
Implementing a Parser
The Peek/Pop "Pattern"
A frequent problem when writing a parser or a lexer is buffering data (character or token) when it cannot be used at a given time. Writing a parser or a lexer often implies looking ahead in the input; that is, in other words accessing the next token or character without consuming it (without moving forward on the input).
In C
, when lexing the input 1+2
, we want to return the following sequence of
tokens: an integer (1), an operator (+) and lastly an integer (2).
However, when encountering a delimiter we do not want to consume it.
For instance, the +
character delimits the token integer (1) and should not be
consumed when building the first token. Instead, we want to be able to read it again
on the next call to the lexer.
Your character stream and your lexer should interact using two
simple functions: *_peek()
and*_pop()
. The goal is to be able to get new
data (whether a character or token) using peek
and analyze it. If it
corresponds to what we expect, then pop
will remove it from a buffer and get
new data.
This combines particularly well with Recursive Descent Parsers in the sense
that in such parsers, we want to have a look at the next token before taking a
decision and consuming it. When we encounter a token using peek()
, and this
token is not the expected one, we must not consume it directly. An
unexpected token can be the end of a grammar rule or an invalid token. In the
case of an invalid token, your parser will result in a syntax error. In the
other case, you can continue parsing and check whether the token retrieved by
peek()
is correct depending on the context. If it is, use pop()
in
order to fetch a new token on the next call to peek()
.
The Lexer
Implementing a basic lexer should be one of the first steps of 42sh. It allows you to segment an input text into tokens. The goal of these tokens is to be an abstraction of the input text, and should avoid carrying too much data. Most of the time, a token carrying an identifier (generally a unique integer for each kind of tokens) along with the string it matched is enough.
You should NOT have to read the content of that string again until the parsing of the input is over. This would lead to an anti-pattern, as we are trying to make an abstraction of that string.
If you think that you need to read it while parsing, you are probably doing something wrong.
One of the first thing you want to do is find what delimits a token. Then, you want to identify all the unique identifiers (aka. reserved words).
In order to implement your lexer for 42sh, we strongly recommend you to check out the Token Recognition algorithm from the SCL.
Recursive Descent Parsers: Parsing Made Easy
Recursive Descent Parsing is one way to implement a LL Parser. It also is one of the easiest kinds of parsers to implement and read. The goal is to use one procedure for each rule of your grammar.
input = and_or EOF ;
and_or = pipe { ( '&&' | '||' ) pipe } ;
pipe = command { '|' command } ;
command = WORD { WORD } ;
Using the grammar above, you would generally implement the following functions:
parse_input()
parse_pipe()
parse_and_or()
parse_command()
As mentioned previously, we are trying to create an AST from a input text. Each
of the functions associated to a rule can then return a node corresponding to
the rule it is meant to parse: parse_and_or
could return a node containing
two other nodes, the same goes for parse_pipe
, and parse_command
could
return a node containing an array of strings being the command to execute.
When a rule expects another rule as part of its entries, it will simply call the corresponding procedure. Otherwise, if it expects a token, it simply checks if the next one matches.
Assembling Everything
Let us have a look at how a parser could parse the command echo toto | cat
using the grammar below:
input = pipe EOF ;
pipe = command { '|' command } ;
command = WORD { WORD } ;
The grammar has 3 rules, so we should have 3 parsing functions: parse_input, parse_pipe and parse_command.
These functions can interact with a lexer
using two functions: peek and pop.
The lexer
will return tokens when interacted with.
In sum, a parser analyses a input string. To do so, the parser must call the lexer to get tokens and check that each token corresponds to the parsers rules. If it does not, your parser results in a syntax error. Otherwise, it generally returns an AST representing the parsed input string.
Going further
Crafting Interpreters is a guide on how to implement an interpreter for any language. Most importantly, it has a guide on how to write a lexer and a parser.
Parser Generators (also known as Compiler compilers) are tools used to generate parsers using an input grammar. It generally is a faster way to implement a parser, but generated codes tend to be slower or have poor error handling.
PlantUML is a tool used to generate diagrams. Most importantly, it has a component to help you visualise an input ebnf grammar by generating a graph automata. It can help de-mistifying the way a grammar (or just a single rule) is meant to be read.