Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
How hard is parsing a context-free1 language? In this post, I will try to provide an overview of one of the simplest parsing techniques of all – recursive descent parsing by hand.
This type of parsing uses mutually recursive procedures to parse a subset of context-free languages using a top-down approach. Hence, this kind of parsing is also called a top-down recursive descent parser. The grammar is restricted to only those rules that can be expressed without recursion in the left-most term.
Here is a simple grammar for a language that can parse nested expressions, with the restriction that
the expressions elements can only be
1 and only addition is supported for simplicity.
E = T "+" E | T T = "1" | "(" E ")"
This grammar can parse expressions such as
To start parsing, we need a bit of infrastructure. In particular, we need the ability to tell where
we are currently (
cur_position), the ability to backtrack to a previous position, and the ability
to tell when the input is complete. I use global variables for ease of discussion, and to avoid having
to commit too much to Python syntax and semantics. Use the mechanisms available in your language to
modularize your parser.
We also need the ability to extract the
next token (in this case, the next element in the input array).
Another convenience we use is the ability to
match a token to a given symbol.
Once we have all these, the core part of parsing is two procedures. The first tries to match a sequence of terms one by one. If the match succeeds, then we return success. If not, then we signal failure and exit.
The other corresponds to the alternatives for each production. If any alternative succeeds, then the parsing succeeds.
With this, we are now ready to write our parser. Since we are writing a top-down recursive descent parser, we
start with the axiom rule
E which contains two alternatives.
E_2 are simple sequential rules
T is similar
And each alternative in
T gets defined correspondingly.
We also need terminals, which is again simple enough
The only thing that remains is to define the parser
The interesting part is that, our infrastructure can be readily turned to parse much more complex grammars, with almost one-to-one rewriting of each rule. For example, here is a slightly more complex grammar:
term = fact mul_op term | fact fact = digits | "(" expr ")" digits = digit digits | digit digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 add_op = "+" | "-" mul_op = "*" | "/"
Its conversion is almost automatic
Indeed, it is close enough to automatic, that we can make it fully automatic. We first define the
grammar as a data-structure for convenience. I hope I don’t need to convince you that I could have
easily loaded it as a
JSON file, or even parsed the BNF myself if necessary from an external file.
Using the grammar just means that we have to slightly modify our core procedures.
Using it is same as before:
Bringing it all together,
Of course, one usually wants to do something with the parsed output. However, given that the procedures are organized in a top-down fashion, saving the resulting expressions is relatively trivial.
The parser we create is not really interpreting the grammar as a Context-Free Grammar. Rather, it uses the grammar as if it is written using another formalism called Parsing Expression Grammar. However, an important subclass of context-free languages in real world – LL(*) – can be completely represented using PEG. Hence, the title is not completely wrong. ↩