The Python Control Flow Graph

Prerequisites
A few helper functions for visualization
The CFGNode
Extracting the control flow
1. Pass
2. Module(stmt* body)
Expressions
1. Primitives
Arithmetic expressions
1. Assign(expr* targets, expr value)
2. Name
Control structures
1. If
2. While
3. Break
4. Continue
5. For
6. FunctionDef
7. Return
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

We previously discussed how one can write an interpreter for Python. We hinted at that time that the machinery could be used for a variety of other applications, including exctracting the call and control flow graph. In this post, we will show how one can extract the control flow graph using such an interpteter. Note that a much more complete implementation can be found here.

A control flow graph is a directed graph data structure that encodes all paths that may be traversed through a program. That is, in some sense, it is an abstract view of the interpreter as a whole.

This implementation is based on the fuzzingbook CFG appendix However, the fuzzingbook implementation is focused on Python statements as it is used primarily for visualization, while this is based on basic blocks with the intension of using it for code generation.

Control flow graphs are useful for a variety of tasks. They are one of the most frequently used tools for visualization. But more imporatntly it is the starting point for further analysis of the program including code generation, optimizations, and other static analysis techniques.

Prerequisites

As before, we start with the prerequisite imports.

System Imports

These are available from Pyodide, but you may wish to make sure that they are installed if you are attempting to run the program directly on the machine.

matplotlib
networkx

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

A few helper functions for visualization

Use CLIGraphics if you are running it from the command line.

If you want to run it in command line, change the WebGraphics here to CLIGraphics.

The color is used to determine whether true or false branch was taken.

The peripheries determines the number of border lines. Start and stop gets double borders.

The shape determines the kind of node. A diamond is a conditional, and start and stop are ovals.

The to_graph() function produces a graph from the nodes in the registry.

The CFGNode

The control flow graph is a graph, and hence we need a data structue for the node. We need to store the parents of this node, the children of this node, and register itself in the registery.

Given that it is a directed graph node, we need the ability to add parents and children.

A few convenience methods to make our life simpler.

class CFGNode(CFGNode):
    def __eq__(self, other):
        return self.rid == other.rid

def __neq__(self, other):
        return self.rid != other.rid

def lineno(self):
        if hasattr(self.ast_node, 'lineno'):
            if (isinstance(self.ast_node, ast.AnnAssign)):
                if self.ast_node.target.id == 'exit':
                    return -self.ast_node.lineno
            return self.ast_node.lineno
        # should we return the parent line numbers instead?
        return 0

def name(self):
        return str(self.rid)

def expr(self):
        return self.source()

def __str__(self):
        return "id:%d line[%d] parents: %s : %s" % \
           (self.rid, self.lineno(),
            str([p.rid for p in self.parents]), self.source())

def __repr__(self):
        return str(self)

def source(self):
        if self.ast_node is None: return ''
        return ast.unparse(self.ast_node).strip()

def annotation(self):
        if self.annot is not None:
            return self.annot
        if self.source() in {'start', 'stop'}:
            return "<%s>" % self.source()
        return self.source()

def to_json(self):
        return {'id':self.rid, 'parents': [p.rid for p in self.parents],
               'children': [c.rid for c in self.children],
               'calls': self.calls, 'at': self.lineno(), 'ast':self.source()}

def get_gparent_id(self):
        p = self.state.registry[self.rid]
        while not p.annotation():
            p = p.parents[0]
        return str(p.rid)

The usage is as below:

Extracting the control flow

The control flow graph is essentially a source code walker, and shares the basic structure with our interpreter.

As before, we define walk() that walks a given AST node. A major difference from MCI is in the functions that handle each node. Since it is a directed graph traversal, our walk() accepts a list of parent nodes that point to this node, and also invokes the various on_*() functions with the same list. These functions in turn return a list of nodes that exit them.

While expressions, and single statements only have one node that comes out of them, control flow structures and function calls can have multiple nodes that come out of them going into the next node. For example, an If statement will have a node from both the if.body and if.orelse going into the next one.

parse is just ast.parse

defining eval.

Pass

The pass statement is trivial. It simply adds one more node.

Here is the CFG from a single pass statement.

Module(stmt* body)

We next define the Module. A python module is composed of a sequence of statements, and the graph is a linear path through these statements. That is, each time a statement is executed, we make a link from it to the next statement.

Here is the CFG from the following which is wrapped in a module

Expressions

Primitives

How should we handle primitives? Since they are simply interpreted as is, they can be embedded right in.

They however, are simple expressions

Generating the following CFG

Arithmetic expressions

The following implements the arithmetic expressions. The unaryop() simply walks the arguments and adds the current node to the chain. The binop() has to walk the left argument, then walk the right argument, and finally insert the current node in the chain. compare() is again similar to binop(). expr(), again has only one argument to walk, and one node out of it.

CFG for this expression

Assign(expr* targets, expr value)

Unlike MCI, assignment is simple as it has only a single node coming out of it.

The following are not yet implemented:

AugAssign(expr target, operator op, expr value)
AnnAssign(expr target, expr annotation, expr? value, int simple)

Further, we do not yet implement parallel assignments.

Example

Name

Control structures

If

We now come to the control structures. For the if statement, we have two parallel paths. We first evaluate the test expression, then add a new node corresponding to the if statement, and provide the paths through the if.body and if.orelse.

class PyCFGExtractor(PyCFGExtractor):
    def on_if(self, node, myparents):
        p = self.walk(node.test, myparents)
        test_node = [CFGNode(parents=myparents, ast=ast.parse(
                            '_if: %s' % ast.unparse(node.test).strip()).body[0],
                            annot="if: %s" % ast.unparse(node.test).strip(),
                            state=self.gstate)]
        ast.copy_location(test_node[0].ast_node, node.test)
        g1 = test_node
        g_true = [CFGNode(parents=g1,
                          ast=ast.parse('_if: True').body[0],
                          label="if:True",
                          annot='',
                          state=self.gstate)]
        ast.copy_location(g_true[0].ast_node, node.test)
        g1 = g_true
        for n in node.body:
            g1 = self.walk(n, g1)
        g2 = test_node
        g_false = [CFGNode(parents=g2,
                           ast=ast.parse('_if: False').body[0],
                           label="if:False",
                           annot='',
                           state=self.gstate)]
        ast.copy_location(g_false[0].ast_node, node.test)
        g2 = g_false
        for n in node.orelse:
            g2 = self.walk(n, g2)
        return g1 + g2

Example

While

The while statement is more complex than the if statement. For one, we need to provide a way to evaluate the condition at the beginning of each iteration.

Essentially, given something like this:

while x > 0:
    statement1
    if x:
       continue;
    if y:
       break
    statement2

We need to expand this into:

lbl1: v = x > 0
lbl2: if not v: goto lbl2
      statement1
      if x: goto lbl1
      if y: goto lbl3
      statement2
      goto lbl1
lbl3: ...

The basic idea is that when we walk the node.body, if there is a break statement, it will start searching up the parent chain, until it finds a node with loop_entry label. Then it will attach itself to the exit_nodes as one of the exits.

class PyCFGExtractor(PyCFGExtractor):
    def on_while(self, node, myparents):
        loop_id = self.gstate.counter
        lbl1_node = CFGNode(parents=myparents,
                            ast=node,
                            label='loop_entry',
                            annot='%s: while' % loop_id,
                            state=self.gstate)
        p = self.walk(node.test, [lbl1_node])

lbl2_node = CFGNode(parents=p, ast=node.test, label='while:test',
               annot='if: %s' % ast.unparse(node.test).strip(), state=self.gstate)
        g_false = CFGNode(parents=[lbl2_node],
                          ast=None,
                          label="if:False",
                          annot='',
                          state=self.gstate)
        g_true = CFGNode(parents=[lbl2_node],
                         ast=None,
                         label="if:True",
                         annot='',
                         state=self.gstate)
        lbl1_node.exit_nodes = [g_false]

p = [g_true]

for n in node.body:
            p = self.walk(n, p)

# the last node is the parent for the lb1 node.
        lbl1_node.add_parents(p)

return lbl1_node.exit_nodes

Example

Break

As we explained before, the break when it is encountred, looks up the parent chain. Once it finds a parent that has the loop_entry label, it attaches itself to that parent. The statements following the break are not its immediate children. Hence, we return an empty list.

Example

Continue

Continue is similar to break, except that it has to restart the loop. Hence, it adds itself as a parent to the node with loop_entry attribute. As like break, execution does not proceed to the lexically next statement after continue. Hence, we return an empty set.

Example

For

The For statement in Python is rather complex. Given a for loop as below

for i in my_expr:
    statement1
    statement2

This has to be extracted to the following:

lbl1: 
      __iv = iter(my_expr)
lbl2: if __iv.length_hint() > 0: goto lbl3
      i = next(__iv)
      statement1
      statement2
lbl3: ...

We need on_call() for implementing on_for(). Essentially, we walk through the arguments, then add a node corresponding to the call to the parents.

class PyCFGExtractor(PyCFGExtractor):
    def on_call(self, node, myparents):
        p = myparents
        for a in node.args:
            p = self.walk(a, p)
        myparents[0].add_calls(node.func)
        p = [CFGNode(parents=p,
                     ast=node,
                     label='call',
                     annot='',
                     state=self.gstate)]
        return p

class PyCFGExtractor(PyCFGExtractor):
    def on_for(self, node, myparents):
        #node.target in node.iter: node.body
        loop_id = self.gstate.counter

for_pre = CFGNode(parents=myparents,
                          ast=None,
                          label='for_pre',
                          annot='',
                          state=self.gstate)

init_node = ast.parse('__iv_%d = iter(%s)' % (
            loop_id, ast.unparse(node.iter).strip())).body[0]
        p = self.walk(init_node, [for_pre])

lbl1_node = CFGNode(parents=p,
                            ast=node,
                            label='loop_entry',
                            annot='%s: for' % loop_id,
                            state=self.gstate)
        _test_node = ast.parse(
                '__iv_%d.__length__hint__() > 0' % loop_id).body[0].value
        p = self.walk(_test_node, [lbl1_node])

lbl2_node = CFGNode(parents=p,
                            ast=_test_node,
                            label='for:test',
                            annot='for: %s' % ast.unparse(_test_node).strip(),
                            state=self.gstate)
        g_false = CFGNode(parents=[lbl2_node],
                          ast=None,
                          label="if:False",
                          annot='', state=self.gstate)
        g_true = CFGNode(parents=[lbl2_node],
                         ast=None,
                         label="if:True",
                         annot='',
                         state=self.gstate)
        lbl1_node.exit_nodes = [g_false]

p = [g_true]

# now we evaluate the body, one at a time.
        for n in node.body:
            p = self.walk(n, p)

# the test node is looped back at the end of processing.
        lbl1_node.add_parents(p)

return lbl1_node.exit_nodes

Example

FunctionDef

When defining a function, we should define the return_nodes for the return statement to hook into. Further, we should also register our functions.

Next, we have to decide: Do we want the call graph of the function definition to be attached to the previous statements? In Python, the function definition itself is independent of the previous statements. Hence, here, we choose not to have parents for the definition.

class PyCFGExtractor(PyCFGExtractor):
    def on_functiondef(self, node, myparents):
        # name, args, body, decorator_list, returns
        fname = node.name
        args = node.args
        returns = node.returns
        enter_node = CFGNode(
                parents=[], ast=ast.parse('enter: %s(%s)' % (
                node.name, ', '.join([a.arg for a in node.args.args]))).body[0],
                annot='<define>: %s' % node.name, state=self.gstate
                )  # sentinel
        enter_node.calleelink = True
        ast.copy_location(enter_node.ast_node, node)

exit_node = CFGNode(parents=[], ast=ast.parse('exit: %s(%s)' % (
               node.name, ', '.join([a.arg for a in node.args.args]))).body[0],
               annot='<>: %s' % node.name, state=self.gstate
               ) #  sentinel
        enter_node.return_nodes = [] # sentinel

p = [enter_node]
        for n in node.body:
            p = self.walk(n, p)

for n in p:
            if n not in enter_node.return_nodes:
                enter_node.return_nodes.append(n)

for n in enter_node.return_nodes:
            exit_node.add_parent(n)

self.functions[fname] = [enter_node, exit_node]
        self.functions_node[enter_node.lineno()] = fname

return myparents

Return

For return, we need to look up which function we have to return from.

We just need few more functions to ensure that our arrows are linked up

First, we make sure that all the child nodes are linked to from the parents.

Next, we make sure that for any node (marked by its line number), we know where it is defined.

Finally, we link functions call sites.

Example

Artifacts

The runnable Python source for this notebook is available here.

The installable python wheel pycfg is available here.

Contents

Prerequisites

A few helper functions for visualization

The CFGNode

Extracting the control flow

Pass

Module(stmt* body)

Expressions

Primitives

Arithmetic expressions

Assign(expr* targets, expr value)

Name

Control structures

If

While

Break

Continue

For

FunctionDef

Return

Artifacts