Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
We previously discussed how one can write an interpreter for
Python. We hinted at that time that the machinery could be used for a variety of
other applications, including exctracting the call and control flow graph. In this
post, we will show how one can extract the control flow graph using such an interpteter.
Note that a much more complete implementation can be found here.
A control flow graph is a directed graph
data structure that encodes all paths that may be traversed through a program. That is, in some
sense, it is an abstract view of the interpreter as a whole.
This implementation is based on the fuzzingbook CFG appendix
However, the fuzzingbook implementation is focused on Python statements as it is used primarily for
visualization, while this is based on basic blocks with the intension of using it for code
generation.
Control flow graphs are useful for a variety of tasks. They are one of the most frequently
used tools for visualization. But more imporatntly it is the starting point for further
analysis of the program including code generation, optimizations, and other static analysis
techniques.
Prerequisites
As before, we start with the prerequisite imports.
System Imports
These are available from Pyodide, but you may wish to make sure that they are
installed if you are attempting to run the program directly on the machine.
matplotlib
networkx
Available Packages
These are packages that refer either to my previous posts or to pure python
packages that I have compiled, and is available in the below locations. As
before, install them if you need to run the program directly on the machine.
To install, simply download the wheel file (`pkg.whl`) and install using
`pip install pkg.whl`.
Use CLIGraphics if you are running it from the
command line.
If you want to run it in command line, change the WebGraphics here to CLIGraphics.
The color is used to determine whether true or false branch was taken.
The peripheries determines the number of border lines. Start and stop gets double borders.
The shape determines the kind of node. A diamond is a conditional, and start and stop are ovals.
The to_graph() function produces a graph from the nodes in the registry.
The CFGNode
The control flow graph is a graph, and hence we need a data structue for the node. We need to store the parents
of this node, the children of this node, and register itself in the registery.
Given that it is a directed graph node, we need the ability to add parents and children.
A few convenience methods to make our life simpler.
The usage is as below:
Extracting the control flow
The control flow graph is essentially a source code walker, and shares the basic
structure with our interpreter.
As before, we define walk() that walks a given AST node.
A major difference from MCI is in the functions that handle each node. Since it is a directed
graph traversal, our walk() accepts a list of parent nodes that point to this node, and also
invokes the various on_*() functions with the same list. These functions in turn return a list
of nodes that exit them.
While expressions, and single statements only have one node that comes out of them, control flow
structures and function calls can have multiple nodes that come out of them going into the next
node. For example, an If statement will have a node from both the if.body and if.orelse
going into the next one.
parse is just ast.parse
defining eval.
Pass
The pass statement is trivial. It simply adds one more node.
Here is the CFG from a single pass statement.
Module(stmt* body)
We next define the Module. A python module is composed of a sequence of statements,
and the graph is a linear path through these statements. That is, each time a statement
is executed, we make a link from it to the next statement.
Here is the CFG from the following which is wrapped in a module
Expressions
Primitives
How should we handle primitives? Since they are simply interpreted as is, they can be embedded right in.
They however, are simple expressions
Generating the following CFG
Arithmetic expressions
The following implements the arithmetic expressions. The unaryop() simply walks
the arguments and adds the current node to the chain. The binop() has to walk
the left argument, then walk the right argument, and finally insert the current
node in the chain. compare() is again similar to binop(). expr(), again has
only one argument to walk, and one node out of it.
CFG for this expression
Assign(expr* targets, expr value)
Unlike MCI, assignment is simple as it has only a single node coming out of it.
The following are not yet implemented:
AugAssign(expr target, operator op, expr value)
AnnAssign(expr target, expr annotation, expr? value, int simple)
Further, we do not yet implement parallel assignments.
Example
Name
Control structures
If
We now come to the control structures. For the if statement, we have two
parallel paths. We first evaluate the test expression, then add a new node
corresponding to the if statement, and provide the paths through the if.body
and if.orelse.
Example
While
The while statement is more complex than the if statement. For one,
we need to provide a way to evaluate the condition at the beginning of
each iteration.
lbl1: v = x > 0
lbl2: if not v: goto lbl2
statement1
if x: goto lbl1
if y: goto lbl3
statement2
goto lbl1
lbl3: ...
The basic idea is that when we walk the node.body, if there is a break
statement, it will start searching up the parent chain, until it finds a node with
loop_entry label. Then it will attach itself to the exit_nodes as one
of the exits.
Example
Break
As we explained before, the break when it is encountred, looks up
the parent chain. Once it finds a parent that has the loop_entry label,
it attaches itself to that parent. The statements following the break are not
its immediate children. Hence, we return an empty list.
Example
Continue
Continue is similar to break, except that it has to restart the loop. Hence,
it adds itself as a parent to the node with loop_entry attribute. As like break,
execution does not proceed to the lexically next statement after continue. Hence,
we return an empty set.
Example
For
The For statement in Python is rather complex. Given a for loop as below
We need on_call() for implementing on_for(). Essentially, we walk through
the arguments, then add a node corresponding to the call to the parents.
Example
more
more
FunctionDef
When defining a function, we should define the return_nodes for the
return statement to hook into. Further, we should also register our
functions.
Next, we have to decide: Do we want the call graph of the function definition to
be attached to the previous statements? In Python, the function definition itself
is independent of the previous statements. Hence, here, we choose not to have
parents for the definition.
Return
For return, we need to look up which function we have to return from.
We just need few more functions to ensure that our arrows are linked up
First, we make sure that all the child nodes are linked to from the parents.
Next, we make sure that for any node (marked by its line number), we know
where it is defined.
Finally, we link functions call sites.
Example
Artifacts
The runnable Python source for this notebook is available here.
The installable python wheel pycfg is available here.