Representing a Grammar in Python
In the previous posts, I described how can write a parser. For doing that, I made use of a grammar written as a python data structure, with the assumption that it can be loaded as a JSON file if necessary. The grammar looks like this:
Contents
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
However, this is somewhat unsatisfying. There are too many distracting syntactic elements in the code, making it difficult to see where each elements are. What we want is a better representation. Indeed, one better representation is to lose one level of nesting, and instead, parse the string for terminals and non-terminals. The following representation uses a single string for a single rule, and a list for all alternative rules for a key.
But is there a better way? Ideally, one would like to define the grammar like one defines the class, so that it feels part of the language.
One mechanism we can (ab)use is the type annotations. Specifically in Python 3.7
one can use the postponed evaluation of annotations to accomplish a DSL as below, with grammar keys as attributes of the Grammar class.
class expr(Grammar):
start: '<expr>'
expr: '<term> + <term>' | '<term> - <term>'
term: '<factor} * <term>' | '<factor> / <term>'
factor: '( <expr> )' | '<integer>'
integer: '<digit> <integer}' | '<digit>'
digit: '0' | '1' | '2'
The annotations lets us access the types of each class as a string, that can be evaluated separately. The Grammar
class is defined as follows:
Unfortunately, to be able to use the annotation feature,
we need to place import from __future__ import annotations
at the top of the
file. So, while we can do that in practice, it is difficult to do that in this
blog post form. Hence, I put the contents of my grammar into a string, and
evaluate that string instead.
Given all this, to print the grammar in a readable form is simply:
We need to make our grammar available in the standard format. So, let us define such an accessor.
Using
This works well. Let us extend our grammar with a few convenience methods.
For example, we want to specify separate processing if the nonterminal starts
with a meta character such as +
, *
, $
or @
.
We define any nonterminal that starts with *
to mean zero or more of its
nonterminal without the meta character.
Usage
Artifacts
The runnable Python source for this notebook is available here.