Contents

  1. Tokens
  2. Tokenize
  3. Indents
  4. Combinatory parser
  5. Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

We have previously seen how to parse strings using both context-free grammars, as well as usin combinatory parsers. However, languages such as Python and Haskell cannot be directly parsed by these parsers. This is because they use indentation levels to indicate nested statement groups. For example, given:

if True:
   x = 100
   y = 200

Python groups the x = 100 and y = 200 together, and is parsed equivalent to

if True: {
   x = 100;
   y = 200;
}

in a C like language. This use of indentation is hard to capture in context-free grammars. Interestingly, it turns out that there is an easy solution. We can simply keep track of the indentation and de-indentation for identifying groups. The idea here is to first use a lexical analyzer to translate the source code into tokens, and then post-process these tokens to insert Indent and Dedent tokens. Hence, we start by defining our lexical analyzer. Turns out, our combinatory parser is really good as a lexical analyzer. As before, we start by importing our prerequisite packages.

Available Packages These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.
  1. combinatoryparser-0.0.1-py2.py3-none-any.whl from "Simple Combinatory Parsing For Context Free Languages".
  2. simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".


Tokens

We start by defining a minimal set of tokens necessary to lex a simple language.



Numeric literals represent numbers.





Quoted literals represent strings.





Punctuation represent operators and other punctuation





Name represents function and variable names, and other names in the program.



We also need to represent new lines and whitespace.



With these, we can define our tokenizer as follows. A lexical token can anything that we previously defined.



And the source string can contain any number of such tokens.



Tokenize

We can now define our tokenizer as follows.





Indents

Next, we want to insert indentation and de-indentation. We do that by keeping a stack of seen indentation levels. If the new indentation level is greater than the current indentation level, we push the new indentation level into the stack. If the indentation level is smaller, the we pop the stack until we reach the correct indentation level.



we can now extract the indentation based blocks as follows





At this point, we can apply a standard context-free parser for parsing the produced tokens. We use a simple Combinatory parser for that.

Combinatory parser







For display



Tokenizing



Parsing



Artifacts

The runnable Python source for this notebook is available here.