Incorporating Indentation Parsing in Standard Parsers -- PEG

Delimited Parser
1. Text
2. PEG
Indentation Based Parser
1. IText
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

We previously saw how to incorporate indentation sensitive parsing to combinatory parsers. There were two things that made that solution somewhat unsatisfactory. The first is that we had to use a lexer first, and generate lexical tokens before we could actually parse. This is unsatisfactory because it forces us to deal with two different kinds of grammars – the lexical grammar of tokens and the parsing grammar. Given that we have reasonable complete grammar parsers such as PEG parser and the Earley parser, it would be nicer if we can reuse these parsers somehow. The second problem is that combinatory parsers can be difficult to debug.

So, can we incorporate the indentation sensitive parsing to more standard parsers? Turns out, it is fairly simple to retrofit Python like parsing to standard grammar parsers. In this post we will see how to do that for PEG parsers. (The PEG parser post post contains the background information on PEG parsers.) That is, given

if True:
   if False:
      x = 100
      y = 200
z = 300

We want to parse it similar to

if True: {
   if False: {
      x = 100;
      y = 200;
   }
}
z = 300;

in a C like language. As before, we start by importing our prerequisite packages.

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".

Delimited Parser

We first define our grammar.

Text

We want a stream of text that we can manipulate where needed. This stream will allow us to control our parsing.

Next, we modify our PEG parser so that we can use the text stream instead of the array.

PEG

Using

It is often useful to understand the parser actions. Hence, we also define a parse visualizer as follows.

class peg_parser_visual(peg_parser):
    def __init__(self, grammar):
        self.grammar = grammar

def log(self, depth, *args):
        print(' '*depth, *args)

def unify_key(self, key, text, _stackdepth=0):
        if key not in self.grammar:
            v = text.advance(key)
            if v is not None: return (v, (key, []))
            else: return (text, None)
        rules = self.grammar[key]
        for rule in rules:
            l, res = self.unify_rule(rule, text, _stackdepth+1)
            if res is not None: return l, (key, res)
        return (text, None)

def unify_rule(self, parts, text, _stackdepth):
        results = []
        text_ = text
        for part in parts:
            self.log(_stackdepth,' ', part, '=>', repr(text))
            text_, res = self.unify_key(part, text_, _stackdepth)
            char = '#' if res is not None else '_'
            self.log(_stackdepth, part, char, '=>', repr(text_))
            if res is None: return text, None
            results.append(res)
        return text_, results

def parse(self, key, text):
        return self.unify_key(key, Text(text), 0)

Using

As you can see, while the visualizer is helpful, it is very verbose. Hence, use it only when you have difficulty understanding why a parse did or did not succeed.

Indentation Based Parser

For indentation based parsing, we modify our string stream slightly. The idea is that when the parser is expecting a new line that corresponds to a new block (indentation) or a delimiter, then it will specifically ask for <$nl> token from the text stream. The text stream will first try to satisfy the new line request. If the request can be satisfied, it will also try to identify the new indentation level. If the new indentation level is more than the current indentation level, it will insert a new <$indent> token into the text stream. If on the other hand, the new indentation level is less than the current level, it will generate as many <$dedent> tokens as required that will match the new indentation level.

IText

class IText(Text):
    def __init__(self, text, at=0, buf=None, indent=None):
        self.text, self.at = text, at
        self.buffer = [] if buf is None else buf
        self._indent = [0] if indent is None else indent

def advance(self, t):
        if t == '<$nl>': return self._advance_nl()
        else: return self._advance(t)

def _advance(self, t):
        if self.buffer:
            if self.buffer[0] != t: return None
            return IText(self.text, self.at, self.buffer[1:], self._indent)
        elif self.text[self.at:self.at+len(t)] != t:
            return None
        return IText(self.text, self.at + len(t), self.buffer, self._indent)

def _read_indent(self, at):
        indent = 0
        while self.text[at+indent:at+indent+1] == ' ':
            indent += 1
        return indent, at+indent

def _advance_nl(self):
        if self.buffer: return None
        if self.text[self.at] != '\n': return None
        my_indent, my_buf = self._indent, self.buffer
        i, at = self._read_indent(self.at+1)
        if i > my_indent[-1]:
            my_indent, my_buf = my_indent + [i], ['<$indent>'] + my_buf
        else:
            while i < my_indent[-1]:
                my_indent, my_buf = my_indent[:-1], ['<$dedent>'] + my_buf
        return IText(self.text, at, my_buf, my_indent)

def __repr__(self):
        return (repr(self.text[:self.at])+ '|' + ''.join(self.buffer) + '|'  +
                repr(self.text[self.at:]))

We will first define a small grammar to test it out.

Here is our text that corresponds to the g1 grammar.

We can now use the same parser with the text stream.

Here is a slightly more complex grammar and corresponding text

Checking if the text is parsable.

Another test

Using

Another

Using

Another

Using

Another

Using

Another

Using

Another

Using

Another

Using

Another

Using

Let us make a much larger grammar

Using

As can be seen, we require no changes to the standard PEG parser for incorporating indentation sensitive (layout sensitive) parsing. The situation is same for other parsers such as Earley parsing.

Artifacts

The runnable Python source for this notebook is available here.

Contents

Delimited Parser

Text

PEG

Indentation Based Parser

IText

Artifacts