How hard is parsing a context-free language? - recursive descent parsing by hand

Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

How hard is parsing a context-free¹ language? In this post, I will try to provide an overview of one of the simplest parsing techniques of all – recursive descent parsing by hand.

This type of parsing uses mutually recursive procedures to parse a subset of context-free languages using a top-down approach. Hence, this kind of parsing is also called a top-down recursive descent parser. The grammar is restricted to only those rules that can be expressed without recursion in the left-most term.

Here is a simple grammar for a language that can parse nested expressions, with the restriction that the expressions elements can only be 1 and only addition is supported for simplicity.

<E> ::= <T> "+" <E>
  | <T>
<T> = "1"
  | "(" <E> ")"

This grammar can parse expressions such as 1, 1+1, 1+1+1+1 etc.

To start parsing, we need a bit of infrastructure. In particular, we need the ability to tell where we are currently (cur_position), the ability to backtrack to a previous position, and the ability to tell when the input is complete. I use global variables for ease of discussion, and to avoid having to commit too much to Python syntax and semantics. Use the mechanisms available in your language to modularize your parser.

We also need the ability to extract the next token (in this case, the next element in the input array).

Another convenience we use is the ability to match a token to a given symbol.

With these, we can now translate our grammar directly. We first define terminals

Next, Each alternate expansion is defined as a procedure

We then hook up these alternate expansions in a single procedure.

Same with T.

Now, we define our parser

Using it

We can also abstract the above sequence using tow procedures. The first tries to match a sequence of terms one by one. If the match succeeds, then we return success. If not, then we signal failure and exit.

The other corresponds to the alternatives for each production. If any alternative succeeds, then the parsing succeeds.

We will now write our parser as follows.

E_1 and E_2 are fairly simple

Defining T is similar

And each alternative in T gets defined correspondingly.

We define our parser using E_()

Using it:

The interesting part is that, our infrastructure can be readily turned to parse much more complex grammars, with almost one-to-one rewriting of each rule. For example, here is a slightly more complex grammar:

<term> ::= <fact> <mul_op> <term>
     | <fact>

<fact> ::=  <digits>
     | "(" <expr> ")"

<digits> ::= <digit> <digits>
      | <digit>

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<add_op> ::= "+" | "-"
<mul_op> ::= "*" | "/"

Its conversion is almost automatic

Using it:

Indeed, it is close enough to automatic, that we can make it fully automatic. We first define the grammar as a data-structure for convenience. I hope I don’t need to convince you that I could have easily loaded it as a JSON file, or even parsed the BNF myself if necessary from an external file.

Using the grammar just means that we have to slightly modify our core procedures.

Using it is same as before:

Bringing it all together,

Using

Of course, one usually wants to do something with the parsed output. However, given that the procedures are organized in a top-down fashion, saving the resulting expressions is relatively trivial.

Artifacts

The runnable Python source for this notebook is available here.

The parser we create is not really interpreting the grammar as a Context-Free Grammar. Rather, it uses the grammar as if it is written using another formalism called Parsing Expression Grammar. However, an important subclass of context-free languages in real world – LL(*) – can be completely represented using PEG. Hence, the title is not completely wrong. ↩

Contents

Artifacts