Simple Combinatory Parsing For Context Free Languages
Contents
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
Combinatory parsing (i.e parsing with combinators) was introduced by William H Burge in his seminal work Recursive Programming Techniques – The systems programming series in 1975. (It was called Parsing Relations in that book. See page 174). Unfortunately, it took until 2001 for the arrival of Parsec, and for combinatory programming to be noticed by the wider world.
While Parsec is a pretty good library, it is often hard to understand for programmers who are not familiar with Haskell. So, the question is, can one explain the concept of simple combinatory parsing without delving to Haskell and Monads? Here is my attempt.
The idea of combinatory parsing is really simple. We start with the smallest parsers that do some work — parsing single characters, then figure out how to combine them to produce larger and larger parsers.
Note that since we are dealing with context-free parsers, ambiguity in parsing is a part of life. Hence, we have to keep a list of possible parses at all times. A failure to parse is then naturally represented by an empty list, and a single parse item is a tuple that contains the remaining input as the first element, and the parse result as the second. This particular approach was pioneered by Wadler.
So, here, we start with the basic parser – one that parses a single character
a
. The return values are as expected — a single empty list for failure to
parse and a list of parses (a single element here because there is only one way
to parse a
).
While this is a good start, we do not want to rewrite our parser each time we
want to parse a new character. So, we define our parser generator for single
literal parsers Lit
.
The Lit(c)
captures the character c
passed in, and returns a new function
that parses specifically that literal.
We can use it as follows — note that we need to split a string into characters
using list
before it can be passed to the parser:
That is, the input (the first part) is completely consumed, leaving an empty
array, and the parsed result is ['a']
.
Can it parse the literal b
?
which prints nothing — that is, the parser was unable to consume any input.
We define a convenience method to get only parsed results.
Using it as follows:
AndThen
Next, we define how to concatenate two parsers. That is, the result from one
parser becomes the input of the next. The idea is that the first parser would
have generated a list of successful parses until different locations. The second
parser now tries to advance the location of parsing. If successful, the location
is updated. If not successful, the parse is dropped. That is, given a list of
chars ab
and a parser that is AndThen(Lit('a'), Lit('b'))
, AndThen
first
applies Lit('a')
which results in an array with a single element as the successful
parse: [([['b'], ['a']])]
The first element of the item is the remaining list ['b']
.
Now, the second parser is applied on it, which takes it forward to [([[], ['a', 'b']])]
.
If instead, we were parsing ac
, then the result of the first parse will be the same
as before. But the second parse will not succeed. Hence this item will be dropped resulting
in an empty array.
This parser can be used in the following way:
OrElse
Finally we define how alternatives expansions are handled. Each parser operates on the input string from the beginning independently. So, the implementation is simple. Note that it is here that the main distinguishing feature of a context free parser compared to parsing expressions are found. We try all possible ways to parse a string irrespective of whether previous rules managed to parse it or not. Parsing expressions on the other hand, stop at the first success.
It can be used as follows:
With this, our parser is fairly usable. We only need to retrieve complete parses as below.
Note that the order in which AndThen
is applied is not shown in the results.
This is a consequence of the way we defined AndThen
using pr1+pr2
deep
inside the return from the parse. If we wanted to keep the distinction, we
could have simply used [pr1, pr2]
instead.
Recursion
This is not however, complete. One major issue is that recursion is not
implemented. One way we can implement recursion is to make everything lazy.
The only difference in these implementations is how we unwrap the parsers first
i.e we use p1()
instead of p1
.
AndThen
OrElse
Similar to AndThen
, since p1 and p2 are now lambda
calls that need to be evaluated to get the actual function.
We are now ready to test our new parsers.
Simple parenthesis language
We define a simple language to parse (1)
. Note that we can define these either using the
lambda
syntax or using def
. Since it is easier to debug using def
, and we are giving
these parsers a name anyway, we use def
.
We can now parse the simple expression (1)
Next, we extend our definitions to parse the recursive language with any number of parenthesis.
That is, it should be able to parse (1)
, ((1))
, and others. As you
can see, lambda
protects Paren
from being evaluated too soon.
Using
Note that at this point, we do not really have a labeled parse tree. The way to add it is the following. We first define Apply
that
can be applied at particular parse points.
We can now define the function that will be accepted by Apply
We define the actual parser for the literal 1
as below:
Similarly, we update the paren1
parser.
It is used as follows
The fuzzer package contains the tools to display trees.
Available Packages
These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.We first define a wrapper for display
It is used as follows
Similarly we update Paren
Used thus:
Now, we are ready to try something adventurous. Let us allow a sequence of parenthesized ones.
We check if our new parser works
That seems to have worked!.
All this is pretty cool. But none of this looks as nice as the Parsec examples we see. Can we apply some syntactic sugar and make it read nice? Do we have to go for the monadic concepts to get the syntactic sugar? Never fear! we have the solution. We simply define a class that incorporates some syntactic sugar on top.
It can be used as follows
Apply also works with this
Used as follows
Note that one has to be careful about the precedence of operators. In
particular, if you mix and match >>
and |
, always use parenthesis
to disambiguate.
We also define a regular expression matcher for completeness. Note that
unlike our other parsers, regular expression matcher is greedy, and matches
as much as it can match, and produces a single match. Consequently if we have
a regular expression /^a+/
and we are provided with 'aaa'
, rather than
producing a, aa, aaa
as matches, it produces only aaa
.
Used as follows
The simple parenthesis language
Remaining
What is missing at this point? Left recursion!.
However, there is something even more interesting here. If you remember my
previous post about minimal PEG parsers you
might notice the similarity between the AndThen()
and unify_rule()
and OrElse()
and unify_key()
.
That is, in context-free combinatory parsing (unlike PEG), the OrElse
ensures that results of multiple attempts are kept. The AndThen
is a lighter
version of the unify_rule
in that it tries to unify two symbols at a time
rather than any number of symbols. However, this should convince you that we
can translate one to the other easily. That is, how to limit OrElse
so that
it does the ordered choice of PEG parsing or how to modify unify_key
and
unify_rule
so that we have a true context free grammar parser rather than a
PEG parser.
Artifacts
The runnable Python source for this notebook is available here.
The installable python wheel combinatoryparser
is available here.