Parsing Indentation Sensitive Languages

Tokens
Tokenize
Indents
Combinatory parser
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

We have previously seen how to parse strings using both context-free grammars, as well as usin combinatory parsers. However, languages such as Python and Haskell cannot be directly parsed by these parsers. This is because they use indentation levels to indicate nested statement groups. For example, given:

if True:
   x = 100
   y = 200

Python groups the x = 100 and y = 200 together, and is parsed equivalent to

if True: {
   x = 100;
   y = 200;
}

in a C like language. This use of indentation is hard to capture in context-free grammars. Interestingly, it turns out that there is an easy solution. We can simply keep track of the indentation and de-indentation for identifying groups. The idea here is to first use a lexical analyzer to translate the source code into tokens, and then post-process these tokens to insert Indent and Dedent tokens. Hence, we start by defining our lexical analyzer. Turns out, our combinatory parser is really good as a lexical analyzer. As before, we start by importing our prerequisite packages.

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

Tokens

We start by defining a minimal set of tokens necessary to lex a simple language.

Numeric literals represent numbers.

Quoted literals represent strings.

Punctuation represent operators and other punctuation

Name represents function and variable names, and other names in the program.

We also need to represent new lines and whitespace.

With these, we can define our tokenizer as follows. A lexical token can anything that we previously defined.

And the source string can contain any number of such tokens.

Tokenize

We can now define our tokenizer as follows.

Indents

Next, we want to insert indentation and de-indentation. We do that by keeping a stack of seen indentation levels. If the new indentation level is greater than the current indentation level, we push the new indentation level into the stack. If the indentation level is smaller, the we pop the stack until we reach the correct indentation level.

def generate_indents(tokens):
    indents = [0]
    stream = []
    while tokens:
        token, *tokens  = tokens
        # did a nested block begin
        if token[0] == 'NL':
            if not tokens:
                stream.append(token)
                dedent(0, indents, stream)
                break
            elif tokens[0][0] == 'WS':
                indent = len(tokens[0][1])
                if indent > indents[-1]:
                    indents.append(indent)
                    stream.append(('Indent', indent))
                elif indent == indents[-1]:
                    stream.append(token)
                else:
                    stream.append(token)
                    dedent(indent, indents, stream)
                tokens = tokens[1:]
            else:
                stream.append(token)
                dedent(0, indents, stream)
        else:
            stream.append(token)
    assert len(indents) == 1
    return stream

def dedent(indent, indents, stream):
    while indent < indents[-1]:
        indents.pop()
        stream.append(('Dedent', indents[-1]))
    assert indent == indents[-1]
    return

we can now extract the indentation based blocks as follows

At this point, we can apply a standard context-free parser for parsing the produced tokens. We use a simple Combinatory parser for that.

Combinatory parser

def NoParse():
    def parse(instr): return [(instr, [('Empty',)])]
    return parse

def Keyword(k):
    def parse(instr):
        if instr and instr[0] == ('Name', k):
            return [(instr[1:], [instr[0]])]
        return []
    return parse

def Literal(k):
    def parse(instr):
        if instr and instr[0][0] == k:
            return [(instr[1:], [instr[0]])]
        return []
    return parse

def NL():
    def parse(instr):
        if instr and instr[0][0] == 'NL':
            return [(instr[1:], [instr[0]])]
        return []
    return parse

def WS():
    def parse(instr):
        if instr and instr[0][0] == 'WS':
            return [(instr[1:], [instr[0]])]
        return []
    return parse

def Name():
    def parse(instr):
        if instr and instr[0][0] == 'Name':
            return [(instr[1:], [instr[0]])]
        return []
    return parse

def Punct(c):
    def parse(instr):
        if instr and instr[0] == ('Punctuation', c):
            return [(instr[1:], [instr[0]])]
        return []
    return parse

def Indent():
    def parse(instr):
        if instr and instr[0][0] == 'Indent':
            return [(instr[1:], [instr[0]])]
        return []
    return parse

def Dedent():
    def parse(instr):
        if instr and instr[0][0] == 'Dedent':
            return [(instr[1:], [instr[0]])]
        return []
    return parse

For display

Tokenizing

Parsing

ifkey = C.P(lambda: Keyword('if'))
empty = C.P(lambda: NoParse())
name = C.P(lambda: Name())
expr = C.P(lambda:
        C.Apply(
            to_valA('Expr'),
            lambda: name | nlit | qlit)
        )
ws = C.P(lambda: WS())
nl = C.P(lambda: NL())
spaces = C.P(lambda: (ws >> spaces) | empty)
colon = C.P(lambda: Punct(':'))
equals = C.P(lambda: Punct('='))
nlit = C.P(lambda: Literal('NumericLiteral'))
qlit = C.P(lambda: Literal('QuotedLiteral'))
indent = C.P(lambda: Indent())
dedent = C.P(lambda: Dedent())

assignstmt = C.P(lambda:
        C.Apply(
            to_valA('Assignment'),
            lambda: name >> spaces >> equals >> spaces >> (nlit | qlit) >> nl)
        )
ifstmt =  C.P(lambda:
        C.Apply(
            to_valA('If'),
            lambda: ifkey >> spaces >> expr >> spaces >> colon >> block)
        )

block = C.P(lambda: (indent >> stmts >> dedent) | stmts)

stmt = C.P(lambda:
        C.Apply(
            to_valA('Statement'),
            lambda: ifstmt | assignstmt)
        )

stmts = C.P(lambda:
        C.Apply(
            to_valA('Stmts'),
            lambda: stmt| (stmt >> stmts))
        )

for to_parse, parsed in stmts(res):
    if not to_parse:
        C.display_trees(parsed, get_children=get_children)

Artifacts

The runnable Python source for this notebook is available here.

Contents

Tokens

Tokenize

Indents

Combinatory parser

Artifacts