Skip to main content

The Shell Syntax

danger

You do not have to implement everything the shell grammar encompasses. If your project does not support a feature yet, it makes little sense to parse it. However, keep in mind the big picture when implementing parts of it, as it will lower the chances of having to start again from scratch.

The syntax of shell is defined in a layered fashion:

  • first, characters are grouped together into tokens.
  • then, a grammar establishes the way tokens can be organized to make valid shell programs.

To get a good overview of the steps the input goes through to get to an AST representation, we recommend the excellent Crafting Interpreters online book (at least "Scanning", "Representing Code" and "Parsing Expressions" chapters).

Token Recognition

What are tokens and how to get from characters to tokens is pretty well defined by the SCL. It even defines an algorithm for it.

Grammar

A grammar is a document which describes which inputs, once delimited into tokens, are part of a language. If an input is not part of the language, then it is invalid and must trigger a syntax error.

The SCL defines a grammar in a format that is fairly difficult to read and implement. It was transcribed into another form, which is meant to be easier to work with.

The format we use to write grammars is called EBNF. Here is a quick overview of this notation:

  • the grammar is a set of rules, which all define a set of valid inputs.
  • each rule starts like so: rule_name = and ends with a ;.
  • the input rule is the start of the grammar.
  • | represents alternative options.
  • if something is between [ these brackets ], it is optional.
  • if something is between { these curly brackets }, it may be repeated 0 to infinite times.
  • if something is between ( these parentheses ), it is grouped.
  • 'a' is the literal character a, and 'abc' the literal string abc.
  • EOF is the end of the input.
  • rules can appear inside other rules (and thus may be recursive).

You do not have to implement everything in this grammar, you can stick to what you can execute.

The grammar is designed this way to make the AST (Abstract Syntax Tree) building process easier, but you may add custom rules to factor repetitions of token sequences.

input =
list '\n'
| list EOF
| '\n'
| EOF
;

list = and_or { ( ';' | '&' ) and_or } [ ';' | '&' ] ;

and_or = pipeline { ( '&&' | '||' ) {'\n'} pipeline } ;

pipeline = ['!'] command { '|' {'\n'} command } ;

command =
simple_command
| shell_command { redirection }
| funcdec { redirection }
;

simple_command =
prefix { prefix }
| { prefix } WORD { element }
;

shell_command =
'{' compound_list '}'
| '(' compound_list ')'
| rule_for
| rule_while
| rule_until
| rule_case
| rule_if
;

funcdec = WORD '(' ')' {'\n'} shell_command ;

redirection =
[IONUMBER] ( '>' | '<' | '>>' | '>&' | '<&' | '>|' | '<>' ) WORD
| [IONUMBER] ( '<<' | '<<-' ) HEREDOC
;

prefix =
ASSIGNMENT_WORD
| redirection
;

element =
WORD
| redirection
;

compound_list =
{'\n'} and_or { ( ';' | '&' | '\n' ) {'\n'} and_or } [ ';' | '&' ] {'\n'} ;


rule_for =
'for' WORD ( [';'] | [ {'\n'} 'in' { WORD } ( ';' | '\n' ) ] ) {'\n'} 'do' compound_list 'done' ;

rule_while = 'while' compound_list 'do' compound_list 'done' ;

rule_until = 'until' compound_list 'do' compound_list 'done' ;

rule_case = 'case' WORD {'\n'} 'in' {'\n'} [case_clause] 'esac' ;

rule_if = 'if' compound_list 'then' compound_list [ else_clause ] 'fi' ;

else_clause =
'else' compound_list
| 'elif' compound_list 'then' compound_list [ else_clause ]
;

case_clause = case_item { ';;' {'\n'} case_item } [';;'] {'\n'} ;

case_item = ['('] WORD { '|' WORD } ')' {'\n'} [compound_list] ;

Reserved Words

Since the separator at the end of the compound_list rule is optional, the grammar may appear to accept commands such as:

  if echo foo then echo bar fi

However, you must bear in mind that reserved words, such as if, then and fi are only recognized at the start of a command. As such, the places where reserved words can be detected would be:

  if echo foo then echo bar fi
^ ^

Indeed, every word after the first echo is part of its arguments. Thus, then and fi are not special and do not end the condition.

However, adding a ; ends the current command and starts another one, which creates an opportunity for reserved words to appear:

  if echo foo; then echo bar; fi
^ ^ ^ ^ ^

When To Implement Parts

Every feature mentioned in the assignments part of the subject will refer to parts of this grammar to list every new token and grammar rule involved in the feature. However, keep in mind that you will eventually handle most of this grammar; so, to avoid regression, take into account rules in their full recursive hierarchy, and do not deviate from the grammar to make handling a case simpler.