- A few helper functions for visualization
- The CFGNode
- Extracting the control flow
- Arithmetic expressions
- Control structures
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
We previously discussed how one can write an interpreter for Python. We hinted at that time that the machinery could be used for a variety of other applications, including exctracting the call and control flow graph. In this post, we will show how one can extract the control flow graph using such an interpteter. Note that a much more complete implementation can be found here.
A control flow graph is a directed graph data structure that encodes all paths that may be traversed through a program. That is, in some sense, it is an abstract view of the interpreter as a whole.
This implementation is based on the fuzzingbook CFG appendix However, the fuzzingbook implementation is focused on Python statements as it is used primarily for visualization, while this is based on basic blocks with the intension of using it for code generation.
Control flow graphs are useful for a variety of tasks. They are one of the most frequently used tools for visualization. But more imporatntly it is the starting point for further analysis of the program including code generation, optimizations, and other static analysis techniques.
As before, we start with the prerequisite imports.
System ImportsThese are available from Pyodide, but you may wish to make sure that they are installed if you are attempting to run the program directly on the machine.
Available PackagesThese are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.
A few helper functions for visualization
The color is used to determine whether true or false branch was taken.
The peripheries determines the number of border lines. Start and stop gets double borders.
The shape determines the kind of node. A diamond is a conditional, and start and stop are ovals.
to_graph() function produces a graph from the nodes in the registry.
The control flow graph is a graph, and hence we need a data structue for the node. We need to store the parents of this node, the children of this node, and register itself in the registery.
Given that it is a directed graph node, we need the ability to add parents and children.
A few convenience methods to make our life simpler.
The usage is as below:
Extracting the control flow
The control flow graph is essentially a source code walker, and shares the basic structure with our interpreter.
As before, we define
walk() that walks a given AST node.
A major difference from MCI is in the functions that handle each node. Since it is a directed
graph traversal, our
walk() accepts a list of parent nodes that point to this node, and also
invokes the various
on_*() functions with the same list. These functions in turn return a list
of nodes that exit them.
While expressions, and single statements only have one node that comes out of them, control flow
structures and function calls can have multiple nodes that come out of them going into the next
node. For example, an
If statement will have a node from both the
going into the next one.
The pass statement is trivial. It simply adds one more node.
Here is the CFG from a single pass statement.
We next define the
Module. A python module is composed of a sequence of statements,
and the graph is a linear path through these statements. That is, each time a statement
is executed, we make a link from it to the next statement.
Here is the CFG from the following which is wrapped in a module
How should we handle primitives? Since they are simply interpreted as is, they can be embedded right in.
They however, are simple expressions
Generating the following CFG
The following implements the arithmetic expressions. The
unaryop() simply walks
the arguments and adds the current node to the chain. The
binop() has to walk
the left argument, then walk the right argument, and finally insert the current
node in the chain.
compare() is again similar to
expr(), again has
only one argument to walk, and one node out of it.
CFG for this expression
Assign(expr* targets, expr value)
Unlike MCI, assignment is simple as it has only a single node coming out of it.
The following are not yet implemented:
- AugAssign(expr target, operator op, expr value)
- AnnAssign(expr target, expr annotation, expr? value, int simple)
Further, we do not yet implement parallel assignments.
We now come to the control structures. For the
if statement, we have two
parallel paths. We first evaluate the test expression, then add a new node
corresponding to the if statement, and provide the paths through the
while statement is more complex than the
if statement. For one,
we need to provide a way to evaluate the condition at the beginning of
Essentially, given something like this:
while x > 0: statement1 if x: continue; if y: break statement2
We need to expand this into:
lbl1: v = x > 0 lbl2: if not v: goto lbl2 statement1 if x: goto lbl1 if y: goto lbl3 statement2 goto lbl1 lbl3: ...
The basic idea is that when we walk the
node.body, if there is a break
statement, it will start searching up the parent chain, until it finds a node with
loop_entry label. Then it will attach itself to the
exit_nodes as one
of the exits.
As we explained before, the
break when it is encountred, looks up
the parent chain. Once it finds a parent that has the
it attaches itself to that parent. The statements following the
break are not
its immediate children. Hence, we return an empty list.
Continue is similar to
break, except that it has to restart the loop. Hence,
it adds itself as a parent to the node with
loop_entry attribute. As like
execution does not proceed to the lexically next statement after
we return an empty set.
For statement in Python is rather complex. Given a for loop as below
for i in my_expr: statement1 statement2
This has to be extracted to the following:
lbl1: __iv = iter(my_expr) lbl2: if __iv.length_hint() > 0: goto lbl3 i = next(__iv) statement1 statement2 lbl3: ...
on_call() for implementing
on_for(). Essentially, we walk through
the arguments, then add a node corresponding to the call to the parents.
When defining a function, we should define the
return_nodes for the
return statement to hook into. Further, we should also register our
Next, we have to decide: Do we want the call graph of the function definition to be attached to the previous statements? In Python, the function definition itself is independent of the previous statements. Hence, here, we choose not to have parents for the definition.
return, we need to look up which function we have to return from.
The runnable Python source for this notebook is available here.
The installable python wheel
pycfg is available here.