Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
We have previously seen how to parse strings using both context-free
grammars, as well as usin combinatory
parsers.
However, languages such as Python and Haskell cannot be directly parsed by
these parsers. This is because they use indentation levels to indicate
nested statement groups.
For example, given:
if True:
x = 100
y = 200
Python groups the x = 100 and y = 200 together, and is parsed equivalent
to
if True: {
x = 100;
y = 200;
}
in a C like language. This use of indentation is hard to capture in
context-free grammars.
Interestingly, it turns out that there is an easy solution. We can simply
keep track of the indentation and de-indentation for identifying groups.
The idea here is to first use a lexical analyzer to translate the source code
into tokens, and then post-process these tokens to insert Indent and
Dedent tokens. Hence, we start by defining our lexical analyzer. Turns out,
our combinatory parser is really good as a lexical analyzer.
As before, we start by importing our prerequisite packages.
Available Packages
These are packages that refer either to my previous posts or to pure python
packages that I have compiled, and is available in the below locations. As
before, install them if you need to run the program directly on the machine.
To install, simply download the wheel file (`pkg.whl`) and install using
`pip install pkg.whl`.
We start by defining a minimal set of tokens necessary to lex a simple
language.
Numeric literals represent numbers.
Quoted literals represent strings.
Punctuation represent operators and other punctuation
Name represents function and variable names, and other names in the program.
We also need to represent new lines and whitespace.
With these, we can define our tokenizer as follows. A lexical token can
anything that we previously defined.
And the source string can contain any number of such tokens.
Tokenize
We can now define our tokenizer as follows.
Indents
Next, we want to insert indentation and de-indentation. We do that by keeping
a stack of seen indentation levels. If the new indentation level is greater
than the current indentation level, we push the new indentation level into
the stack. If the indentation level is smaller, the we pop the stack until we
reach the correct indentation level.
we can now extract the indentation based blocks as follows
At this point, we can apply a standard context-free parser for parsing the
produced tokens. We use a simple Combinatory parser for that.
Combinatory parser
For display
Tokenizing
Parsing
Artifacts
The runnable Python source for this notebook is available here.