Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Combinatory parsing (i.e parsing with combinators) was introduced by William H Burge in his seminal work Recursive Programming Techniques – The systems programming series in 1975. Unfortunately, it took until 2001 for the arrival of Parsec, and for combinatory programming to be noticed by the wider world.

While Parsec is a pretty good library, it is often hard to understand for programmers who are not familiar with Haskell. So, the question is, can one explain the concept of simple combinatory parsing without delving to Haskell and Monads? Here is my attempt.

The idea of combinatory parsing is really simple. We start with the smallest parsers that do some work — parsing single characters, then figure out how to combine them to produce larger and larger parsers.

Note that since we are dealing with context-free parsers, ambiguity in parsing is a part of life. Hence, we have to keep a list of possible parses at all times. A failure to parse is then naturally represented by an empty list, and a single parse item is a tuple that contains the remaining input as the first element, and the parse result as the second. This particular approach was pioneered by Wadler.

So, here, we start with the basic parser – one that parses a single character a. The return values are as expected — a single empty list for failure to parse and a list of parses (a single element here because there is only one way to parse a).



While this is a good start, we do not want to rewrite our parser each time we want to parse a new character. So, we define our parser generator for single literal parsers Lit.



The Lit(c) captures the character c passed in, and returns a new function that parses specifically that literal.

We can use it as follows — note that we need to split a string into characters using list before it can be passed to the parser:



That is, the input (the first part) is completely consumed, leaving an empty array, and the parsed result is ['a']. Can it parse the literal b?



which prints nothing — that is, the parser was unable to consume any input.

We define a convenience method to get only parsed results.



Using it as follows:



AndThen

Next, we define how to concatenate two parsers. That is, the result from one parser becomes the input of the next. The idea is that the first parser would have generated a list of successful parses until different locations. The second parser now tries to advance the location of parsing. If successful, the location is updated. If not successful, the parse is dropped. That is, given a list of chars ab and a parser that is AndThen(Lit('a'), Lit('b')), AndThen first applies Lit('a') which results in an array with a single element as the succssful parse: [([['b'], ['a']])] The first elment of the item is the remaining list ['b']. Now, the second parser is applied on it, which takes it forward to [([[], ['a', 'b']])]. If instead, we were parsing ac, then the result of the first parse will be the same as before. But the second parse will not succeed. Hence this item will be dropped resulting in an empty array.



This parser can be used in the following way:



OrElse

Finally we define how alternatives expansions are handled. Each parser operates on the input string from the beginning independently. So, the implementation is simple. Note that it is here that the main distinguishing feature of a context free parser compared to parsing expressions are found. We try all possible ways to parse a string irrespective of whether previous rules managed to parse it or not. Parsing expressions on the other hand, stop at the first success.



It can be used as follows:



With this, our parser is fairly usable. We only need to retrieve complete parses as below.



Note that the order in which AndThen is applied is not shown in the results. This is a consequence of the way we defined AndThen using pr1+pr2 deep inside the return from the parse. If we wanted to keep the distinction, we could have simply used [pr1, pr2] instead.

Recursion

This is not however, complete. One major issue is that recursion is not implemented. One way we can implement recursion is to make everything lazy. The only difference in these implementations is how we unwrap the parsers first i.e we use p1() instead of p1.

AndThen



OrElse

Similar to AndThen, since p1 and p2 are now lambda calls that need to be evaluated to get the actual function.



We are now ready to test our new parsers.

Simple parenthesis language

We define a simple language to parse (1). Note that we can define these either using the lambda syntax or using def. Since it is easier to debug using def, and we are giving these parsers a name anyway, we use def.



We can now parse the simple expression (1)



Next, we extend our definitions to parse the recursive language with any number of parenthesis. That is, it should be able to parse (1), ((1)), and others. As you can see, lambda protects Paren from being evaluated too soon.



Note that at this point, we do not really have a labelled parse tree. The way to add it is the following. We first define Apply that can be applied at particular parse points.



We can now define the function that will be accepted by Apply



We define the actual parser for the literal 1 as below:



Similarly, we update the paren1 parser.



It is used as follows



Similarly we update Paren



Used as thus:



Now, we are ready to try something adventurous. Let us allow a sequence of parenthesized ones.



We check if our new parser works



That seems to have worked!.

All this is pretty cool. But none of this looks as nice as the Parsec examples we see. Can we apply some syntactic sugar and make it read nice? Do we have to go for the monadic concepts to get the syntactic sugar? Never fear! we have the solution. We simply define a class that incorporates some syntactic sugar on top.



It can be used as follows





Apply also works with this



Used as follows



Note that one has to be careful about the precedence of operators. In particular, if you mix and match >> and |, always use parenthesis to disambiguate.

All together



The simple parenthesis language



Remaining

What is missing at this point? Left recursion!. However, there is something even more interesting here. If you remember my previous post about minimal PEG parsers you might notice the similarity between the AndThen() and unify_rule() and OrElse() and unify_key(). That is, in context-free combinatory parsing (unlike PEG), the OrElse ensures that results of multiple attempts are kept. The AndThen is a lighter version of the unify_rule in that it tries to unify two symbols at a time rather than any number of symbols. However, this should convince you that we can translate one to the other easily. That is, how to limit OrElse so that it does the ordered choice of PEG parsing or how to modify unify_key and unify_rule so that we have a true context free grammar parser rather than a PEG parser.