Building Grammars for Fun and Profit

TLDR; This tutorial takes you through the steps to write a simple context-free grammmar that can parse your custom data format. The Python interpreter is embedded so that you can work through the implementation steps.

Definitions

For this post, we use the following terms as we have defiend previously:

The alphabet is the set all of symbols in the input language. For example, in this post, we use all ASCII characters as alphabet.
A terminal is a single alphabet symbol. For example, x is a terminal symbol.
A nonterminal is a symbol outside the alphabet whose expansion is defined in the grammar using rules for expansion. For example, <term> is a nonterminal symbol.
A rule is a finite sequence of terms (two types of terms: terminals and nonterminals) that describe an expansion of a given terminal. For example, [<term>+<expr>] is one of the expansion rules of the nonterminal <expr>.
A definition is a set of rules that describe the expansion of a given nonterminal. For example, [[<digit>,<digits>],[<digit>]] is the definition of the nonterminal <digits>
A context-free grammar is composed of a set of nonterminals and corresponding definitions that define the structure of the nonterminal. The grammar given below is an example context-free grammar.
A terminal derives a string if the string contains only the symbols in the terminal. A nonterminal derives a string if the corresponding definition derives the string. A definition derives the string if one of the rules in the definition derives the string. A rule derives a string if the sequence of terms that make up the rule can derive the string, deriving one substring after another contiguously (also called parsing).
A derivation tree is an ordered tree that describes how an input string is derived by the given start symbol. Also called a parse tree.
A derivation tree can be collapsed into its string equivalent. Such a string can be parsed again by the nonterminal at the root node of the derivation tree such that at least one of the resulting derivation trees would be the same as the one we started with.

Definitions
Grammar for Regular Expressions
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

Say you are designing a custom data format. You use this format to parse data from your clients or customers who are required to provide the data in the language you have designed. You will come across this requirement whenever you have a system that interacts with human beings. In most cases, you may be able to manage with simple well defined formats such as comma-sepaated values, XML or JSON. However, there may be cases where the customers may want something that is closer to their domain. In such cases, what is the best way forward? This is what this post will explore.

Designing a grammar can be intimidating at first, particularly if you have gone through one of the traditional compiler design and implementation courses in the university. In this post, I will attempt to convince you that formal grammars can be much more easier to write and work with. Furthermore, parsing with them can be far simpler than writing your own recursive descent parser from scratch, and far more secure.

The idea we explore in this post is that, rather than starting with a parser, a better alternative is to start with a generator (a grammar fuzzer) instead.

Grammar for Regular Expressions

As an example, let us say that you are building a grammar for simple regular expressions. We take the simplest regular expression possible A single character. Let us build a generator for such simple regular expressions. We will return a derivation tree as the result.

This works as expected

We define a function to collapse the derivation tree

Using it

So let us try to extend our implementation a little bit. A regular expression is not just a single character. It can be any number of such characters, or sequence of subexpressions. So, let us define that.

This again works as expected.

Regular expressions allow one to specify alternative matching with the pipe symbol. That is a|b matches either a or b. So, let us define that.

This again works as expected.

Next, we will implement parenthesized expressions. That is, (a|b) is a parenthesized expression that matches either a or b, and is considered one single expression for any meta character that comes after.

symmetric probabilities will exhaust the stack. So, use this trick to avoid that instead.

Testing it again.

All that remains is to define the kleene star * for zero or more repetitions

Testing it again.

At this point, you would have a fully explored test generator that is ready to test any parser you would write, and can be improved upon for any extensions you might want. More importantly, you can easily convert the generator you wrote to a grammar for your data format. That is, each subroutine name become the name of a nonterminal symbol.

E.g. gen_cex() to <cex>

The branches (match) become rule alternatives:

E.g. <uniexp>: [[<alpha>], [<parenexp>]]

Sequence of procedure calls become sequences in a rule.

E.g. : [['(', , ')']]

With these steps, we have the following grammar.

We can now use it to parse regular expressions. Using our previously defined PEG parser,

Using it

However, there are some caveats to the grammars thus produced. The main caveat is that you have to be careful in how you order your rules. In general, build your language such that the first token produced by each of your rules for the same definition is different. This will give you LL(1) grammars, which are the easiest to use.

If the above rule can’t be adhered to, and if you have two rules such that the match of one will be a subsequence of another, order the rules in grammar such that the longest match always comes first.

For example,

 <exp> := <cex> '|' <exp>
        | <exp>

or equivalently

'<exp>' : [['<cex>', '|', '<exp>'], ['<exp>']]

Such kind of grammars are called Parsing Expression Grammars. The idea is that the first rule is given priority over the other rules, and hence longest matching rule should come first.

Artifacts

The runnable Python source for this notebook is available here.

Definitions

Contents

Grammar for Regular Expressions

Artifacts