Skip to main content

Parsing a Grammar

What is Parsing?

Parsing, syntax analysis, or syntactic analysis, is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

In 42sh, the parser is one of the core components of the project. You cannot have a working 42sh without a working parser. Hence, it is important to understand this notion very well.

When parsing a programming language, it is recommended to generate an Abstract Syntax Tree (AST). They are an abstract representation of the input, and help manipulating it more easily than in plain text.

Throughout this article, we will explain how to structure a parser for the 42sh project, using a recursive descent parser, and how it interacts with its lexer.

Global overview

The parser depends on a lexer. The goal of the lexer is to perform a lexical analysis on stream of character and generate an abstraction of words or sequences of character called tokens. These tokens are then consumed by the parser and transformed into an AST to represent the input in a more manipulable and abstract manner.

Implementing a Parser

The Peek/Pop "Pattern"

A frequent problem when writing a parser or a lexer is buffering data (character or token) when it cannot be used at a given time. Writing a parser or a lexer often implies looking ahead in the input; that is, in other words accessing the next token or character without consuming it (without moving forward on the input).

In C, when lexing the input 1+2, we want to return the following sequence of tokens: an integer (1), an operator (+) and lastly an integer (2). However, when encountering a delimiter we do not want to consume it. For instance, the + character delimits the token integer (1) and should not be consumed when building the first token. Instead, we want to be able to read it again on the next call to the lexer.

Your character stream and your lexer should interact using two simple functions: *_peek() and*_pop(). The goal is to be able to get new data (whether a character or token) using peek and analyze it. If it corresponds to what we expect, then pop will remove it from a buffer and get new data.

This combines particularly well with Recursive Descent Parsers in the sense that in such parsers, we want to have a look at the next token before taking a decision and consuming it. When we encounter a token using peek(), and this token is not the expected one, we must not consume it directly. An unexpected token can be the end of a grammar rule or an invalid token. In the case of an invalid token, your parser will result in a syntax error. In the other case, you can continue parsing and check whether the token retrieved by peek() is correct depending on the context. If it is, use pop() in order to fetch a new token on the next call to peek().

The Lexer

Implementing a basic lexer should be one of the first steps of 42sh. It allows you to segment an input text into tokens. The goal of these tokens is to be an abstraction of the input text, and should avoid carrying too much data. Most of the time, a token carrying an identifier (generally a unique integer for each kind of tokens) along with the string it matched is enough.

info

You should NOT have to read the content of that string again until the parsing of the input is over. This would lead to an anti-pattern, as we are trying to make an abstraction of that string.

If you think that you need to read it while parsing, you are probably doing something wrong.

One of the first thing you want to do is find what delimits a token. Then, you want to identify all the unique identifiers (aka. reserved words).

tip

In order to implement your lexer for 42sh, we strongly recommend you to check out the Token Recognition algorithm from the SCL.

Recursive Descent Parsers: Parsing Made Easy

Recursive Descent Parsing is one way to implement a LL Parser. It also is one of the easiest kinds of parsers to implement and read. The goal is to use one procedure for each rule of your grammar.

input = and_or EOF ;

and_or = pipe { ( '&&' | '||' ) pipe } ;
pipe = command { '|' command } ;

command = WORD { WORD } ;

Using the grammar above, you would generally implement the following functions:

  • parse_input()
  • parse_pipe()
  • parse_and_or()
  • parse_command()

As mentioned previously, we are trying to create an AST from a input text. Each of the functions associated to a rule can then return a node corresponding to the rule it is meant to parse: parse_and_or could return a node containing two other nodes, the same goes for parse_pipe, and parse_command could return a node containing an array of strings being the command to execute.

When a rule expects another rule as part of its entries, it will simply call the corresponding procedure. Otherwise, if it expects a token, it simply checks if the next one matches.

Assembling Everything

Let us have a look at how a parser could parse the command echo toto | cat using the grammar below:

input = pipe EOF ;

pipe = command { '|' command } ;
command = WORD { WORD } ;

The grammar has 3 rules, so we should have 3 parsing functions: parse_input, parse_pipe and parse_command.

These functions can interact with a lexer using two functions: peek and pop. The lexer will return tokens when interacted with.

In sum, a parser analyses a input string. To do so, the parser must call the lexer to get tokens and check that each token corresponds to the parsers rules. If it does not, your parser results in a syntax error. Otherwise, it generally returns an AST representing the parsed input string.

Going further

  • Crafting Interpreters is a guide on how to implement an interpreter for any language. Most importantly, it has a guide on how to write a lexer and a parser.

  • Parser Generators (also known as Compiler compilers) are tools used to generate parsers using an input grammar. It generally is a faster way to implement a parser, but generated codes tend to be slower or have poor error handling.

  • PlantUML is a tool used to generate diagrams. Most importantly, it has a component to help you visualise an input ebnf grammar by generating a graph automata. It can help de-mistifying the way a grammar (or just a single rule) is meant to be read.