Simple Combinatory Parsing For Context Free Languages
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
Combinatory parsing (i.e parsing with combinators) was introduced by William H Burge in his seminal work Recursive Programming Techniques – The systems programming series in 1975. (It was called Parsing Relations in that book. See page 174). Unfortunately, it took until 2001 for the arrival of Parsec, and for combinatory programming to be noticed by the wider world.
While Parsec is a pretty good library, it is often hard to understand for programmers who are not familiar with Haskell. So, the question is, can one explain the concept of simple combinatory parsing without delving to Haskell and Monads? Here is my attempt.
The idea of combinatory parsing is really simple. We start with the smallest parsers that do some work — parsing single characters, then figure out how to combine them to produce larger and larger parsers.
Note that since we are dealing with context-free parsers, ambiguity in parsing is a part of life. Hence, we have to keep a list of possible parses at all times. A failure to parse is then naturally represented by an empty list, and a single parse item is a tuple that contains the remaining input as the first element, and the parse result as the second. This particular approach was pioneered by Wadler.
So, here, we start with the basic parser – one that parses a single character
a. The return values are as expected — a single empty list for failure to
parse and a list of parses (a single element here because there is only one way
While this is a good start, we do not want to rewrite our parser each time we
want to parse a new character. So, we define our parser generator for single
Lit(c) captures the character
c passed in, and returns a new function
that parses specifically that literal.
We can use it as follows — note that we need to split a string into characters
list before it can be passed to the parser:
That is, the input (the first part) is completely consumed, leaving an empty
array, and the parsed result is
Can it parse the literal
which prints nothing — that is, the parser was unable to consume any input.
We define a convenience method to get only parsed results.
Using it as follows:
Next, we define how to concatenate two parsers. That is, the result from one
parser becomes the input of the next. The idea is that the first parser would
have generated a list of successful parses until different locations. The second
parser now tries to advance the location of parsing. If successful, the location
is updated. If not successful, the parse is dropped. That is, given a list of
ab and a parser that is
Lit('a') which results in an array with a single element as the successful
[([['b'], ['a']])] The first element of the item is the remaining list
Now, the second parser is applied on it, which takes it forward to
[([, ['a', 'b']])].
If instead, we were parsing
ac, then the result of the first parse will be the same
as before. But the second parse will not succeed. Hence this item will be dropped resulting
in an empty array.
This parser can be used in the following way:
Finally we define how alternatives expansions are handled. Each parser operates on the input string from the beginning independently. So, the implementation is simple. Note that it is here that the main distinguishing feature of a context free parser compared to parsing expressions are found. We try all possible ways to parse a string irrespective of whether previous rules managed to parse it or not. Parsing expressions on the other hand, stop at the first success.
It can be used as follows:
With this, our parser is fairly usable. We only need to retrieve complete parses as below.
Note that the order in which
AndThen is applied is not shown in the results.
This is a consequence of the way we defined
inside the return from the parse. If we wanted to keep the distinction, we
could have simply used
[pr1, pr2] instead.
This is not however, complete. One major issue is that recursion is not
implemented. One way we can implement recursion is to make everything lazy.
The only difference in these implementations is how we unwrap the parsers first
i.e we use
p1() instead of
AndThen, since p1 and p2 are now
lambda calls that need to be evaluated to get the actual function.
We are now ready to test our new parsers.
Simple parenthesis language
We define a simple language to parse
(1). Note that we can define these either using the
lambda syntax or using
def. Since it is easier to debug using
def, and we are giving
these parsers a name anyway, we use
We can now parse the simple expression
Next, we extend our definitions to parse the recursive language with any number of parenthesis.
That is, it should be able to parse
((1)), and others. As you
Paren from being evaluated too soon.
Note that at this point, we do not really have a labeled parse tree. The way to add it is the following. We first define
can be applied at particular parse points.
We can now define the function that will be accepted by
We define the actual parser for the literal
1 as below:
Similarly, we update the
It is used as follows
The fuzzer package contains the tools to display trees.
Available PackagesThese are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.
We first define a wrapper for display
It is used as follows
Similarly we update
Now, we are ready to try something adventurous. Let us allow a sequence of parenthesized ones.
We check if our new parser works
That seems to have worked!.
All this is pretty cool. But none of this looks as nice as the Parsec examples we see. Can we apply some syntactic sugar and make it read nice? Do we have to go for the monadic concepts to get the syntactic sugar? Never fear! we have the solution. We simply define a class that incorporates some syntactic sugar on top.
It can be used as follows
Apply also works with this
Used as follows
Note that one has to be careful about the precedence of operators. In
particular, if you mix and match
|, always use parenthesis
We also define a regular expression matcher for completeness. Note that
unlike our other parsers, regular expression matcher is greedy, and matches
as much as it can match, and produces a single match. Consequently if we have
a regular expression
/^a+/ and we are provided with
'aaa', rather than
a, aa, aaa as matches, it produces only
Used as follows
The simple parenthesis language
What is missing at this point? Left recursion!.
However, there is something even more interesting here. If you remember my
previous post about minimal PEG parsers you
might notice the similarity between the
That is, in context-free combinatory parsing (unlike PEG), the
ensures that results of multiple attempts are kept. The
AndThen is a lighter
version of the
unify_rule in that it tries to unify two symbols at a time
rather than any number of symbols. However, this should convince you that we
can translate one to the other easily. That is, how to limit
OrElse so that
it does the ordered choice of PEG parsing or how to modify
unify_rule so that we have a true context free grammar parser rather than a
The runnable Python source for this notebook is available here.
The installable python wheel
combinatoryparser is available here.