How hard is parsing a context-free language? - recursive descent parsing by hand
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
How hard is parsing a context-free1 language? In this post, I will try to provide an overview of one of the simplest parsing techniques of all – recursive descent parsing by hand.
This type of parsing uses mutually recursive procedures to parse a subset of context-free languages using a top-down approach. Hence, this kind of parsing is also called a top-down recursive descent parser. The grammar is restricted to only those rules that can be expressed without recursion in the left-most term.
Here is a simple grammar for a language that can parse nested expressions, with the restriction that
the expressions elements can only be
1 and only addition is supported for simplicity.
<E> ::= <T> "+" <E> | <T> <T> = "1" | "(" <E> ")"
This grammar can parse expressions such as
To start parsing, we need a bit of infrastructure. In particular, we need the ability to tell where
we are currently (
cur_position), the ability to backtrack to a previous position, and the ability
to tell when the input is complete. I use global variables for ease of discussion, and to avoid having
to commit too much to Python syntax and semantics. Use the mechanisms available in your language to
modularize your parser.
We also need the ability to extract the
next token (in this case, the next element in the input array).
Another convenience we use is the ability to
match a token to a given symbol.
With these, we can now translate our grammar directly. We first define terminals
Next, Each alternate expansion is defined as a procedure
We then hook up these alternate expansions in a single procedure.
Same with T.
Now, we define our parser
We can also abstract the above sequence using tow procedures. The first tries to match a sequence of terms one by one. If the match succeeds, then we return success. If not, then we signal failure and exit.
The other corresponds to the alternatives for each production. If any alternative succeeds, then the parsing succeeds.
We will now write our parser as follows.
E_1 and E_2 are fairly simple
T is similar
And each alternative in
T gets defined correspondingly.
We define our parser using
The interesting part is that, our infrastructure can be readily turned to parse much more complex grammars, with almost one-to-one rewriting of each rule. For example, here is a slightly more complex grammar:
<term> ::= <fact> <mul_op> <term> | <fact> <fact> ::= <digits> | "(" <expr> ")" <digits> ::= <digit> <digits> | <digit> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <add_op> ::= "+" | "-" <mul_op> ::= "*" | "/"
Its conversion is almost automatic
Indeed, it is close enough to automatic, that we can make it fully automatic. We first define the
grammar as a data-structure for convenience. I hope I don’t need to convince you that I could have
easily loaded it as a
JSON file, or even parsed the BNF myself if necessary from an external file.
Using the grammar just means that we have to slightly modify our core procedures.
Using it is same as before:
Bringing it all together,
Of course, one usually wants to do something with the parsed output. However, given that the procedures are organized in a top-down fashion, saving the resulting expressions is relatively trivial.
The runnable Python source for this notebook is available here.
The parser we create is not really interpreting the grammar as a Context-Free Grammar. Rather, it uses the grammar as if it is written using another formalism called Parsing Expression Grammar. However, an important subclass of context-free languages in real world – LL(*) – can be completely represented using PEG. Hence, the title is not completely wrong. ↩