Contents

  1. Prerequisites
  2. A few helper functions for visualization
  3. The CFGNode
  4. Extracting the control flow
    1. Pass
    2. Module(stmt* body)
  5. Expressions
    1. Primitives
  6. Arithmetic expressions
    1. Assign(expr* targets, expr value)
    2. Name
  7. Control structures
    1. If
    2. While
    3. Break
    4. Continue
    5. For
    6. FunctionDef
    7. Return
  8. Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

We previously discussed how one can write an interpreter for Python. We hinted at that time that the machinery could be used for a variety of other applications, including exctracting the call and control flow graph. In this post, we will show how one can extract the control flow graph using such an interpteter. Note that a much more complete implementation can be found here.

A control flow graph is a directed graph data structure that encodes all paths that may be traversed through a program. That is, in some sense, it is an abstract view of the interpreter as a whole.

This implementation is based on the fuzzingbook CFG appendix However, the fuzzingbook implementation is focused on Python statements as it is used primarily for visualization, while this is based on basic blocks with the intension of using it for code generation.

Control flow graphs are useful for a variety of tasks. They are one of the most frequently used tools for visualization. But more imporatntly it is the starting point for further analysis of the program including code generation, optimizations, and other static analysis techniques.

Prerequisites

As before, we start with the prerequisite imports.

System Imports These are available from Pyodide, but you may wish to make sure that they are installed if you are attempting to run the program directly on the machine.
  1. matplotlib
  2. networkx
Available Packages These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.
  1. pydot-1.4.1-py2.py3-none-any.whl
  2. metacircularinterpreter-0.0.1-py2.py3-none-any.whl from "Python Meta Circular Interpreter".


A few helper functions for visualization



The color is used to determine whether true or false branch was taken.



The peripheries determines the number of border lines. Start and stop gets double borders.



The shape determines the kind of node. A diamond is a conditional, and start and stop are ovals.



The to_graph() function produces a graph from the nodes in the registry.



The CFGNode

The control flow graph is a graph, and hence we need a data structue for the node. We need to store the parents of this node, the children of this node, and register itself in the registery.



Given that it is a directed graph node, we need the ability to add parents and children.



A few convenience methods to make our life simpler.



The usage is as below:



Extracting the control flow

The control flow graph is essentially a source code walker, and shares the basic structure with our interpreter.



As before, we define walk() that walks a given AST node. A major difference from MCI is in the functions that handle each node. Since it is a directed graph traversal, our walk() accepts a list of parent nodes that point to this node, and also invokes the various on_*() functions with the same list. These functions in turn return a list of nodes that exit them.

While expressions, and single statements only have one node that comes out of them, control flow structures and function calls can have multiple nodes that come out of them going into the next node. For example, an If statement will have a node from both the if.body and if.orelse going into the next one.



Pass

The pass statement is trivial. It simply adds one more node.



Here is the CFG from a single pass statement.



Module(stmt* body)

We next define the Module. A python module is composed of a sequence of statements, and the graph is a linear path through these statements. That is, each time a statement is executed, we make a link from it to the next statement.



Here is the CFG from the following which is wrapped in a module



Expressions

Primitives

How should we handle primitives? Since they are simply interpreted as is, they can be embedded right in.



They however, are simple expressions



Generating the following CFG



Arithmetic expressions

The following implements the arithmetic expressions. The unaryop() simply walks the arguments and adds the current node to the chain. The binop() has to walk the left argument, then walk the right argument, and finally insert the current node in the chain. compare() is again similar to binop(). expr(), again has only one argument to walk, and one node out of it.



CFG for this expression



Assign(expr* targets, expr value)

Unlike MCI, assignment is simple as it has only a single node coming out of it.

The following are not yet implemented:

  • AugAssign(expr target, operator op, expr value)
  • AnnAssign(expr target, expr annotation, expr? value, int simple)

Further, we do not yet implement parallel assignments.



Example



Name



Control structures

If

We now come to the control structures. For the if statement, we have two parallel paths. We first evaluate the test expression, then add a new node corresponding to the if statement, and provide the paths through the if.body and if.orelse.



Example



While

The while statement is more complex than the if statement. For one, we need to provide a way to evaluate the condition at the beginning of each iteration.

Essentially, given something like this:

while x > 0:
    statement1
    if x:
       continue;
    if y:
       break
    statement2

We need to expand this into:

lbl1: v = x > 0
lbl2: if not v: goto lbl2
      statement1
      if x: goto lbl1
      if y: goto lbl3
      statement2
      goto lbl1
lbl3: ...

The basic idea is that when we walk the node.body, if there is a break statement, it will start searching up the parent chain, until it finds a node with loop_entry label. Then it will attach itself to the exit_nodes as one of the exits.



Example



Break

As we explained before, the break when it is encountred, looks up the parent chain. Once it finds a parent that has the loop_entry label, it attaches itself to that parent. The statements following the break are not its immediate children. Hence, we return an empty list.



Example



Continue

Continue is similar to break, except that it has to restart the loop. Hence, it adds itself as a parent to the node with loop_entry attribute. As like break, execution does not proceed to the lexically next statement after continue. Hence, we return an empty set.



Example



For

The For statement in Python is rather complex. Given a for loop as below

for i in my_expr:
    statement1
    statement2

This has to be extracted to the following:

lbl1: 
      __iv = iter(my_expr)
lbl2: if __iv.length_hint() > 0: goto lbl3
      i = next(__iv)
      statement1
      statement2
lbl3: ...

We need on_call() for implementing on_for(). Essentially, we walk through the arguments, then add a node corresponding to the call to the parents.



Example



more



more



FunctionDef

When defining a function, we should define the return_nodes for the return statement to hook into. Further, we should also register our functions.

Next, we have to decide: Do we want the call graph of the function definition to be attached to the previous statements? In Python, the function definition itself is independent of the previous statements. Hence, here, we choose not to have parents for the definition.



Return

For return, we need to look up which function we have to return from.



Example



Artifacts

The runnable Python source for this notebook is available here.

The installable python wheel pycfg is available here.