A Minimal Static Call Graph for Python Programs

Prerequisites
The graphical routines.
The GraphState
Extracting the graph
1. Module(stmt* body)
2. FunctionDef
3. Pass
4. Name
5. Return
6. Call()
7. Primitives
Arithmetic expressions
1. Assign(expr* targets, expr value)
Control structures
What remains?
Related
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

We previously discussed how one can write an interpreter Python, and we made use of this machinery in generating a control flow graph. In this post, we will show how one can extract the static call-graph using the same machinery. A call-graph is a directed graph data structure that encodes the structure of function calls in a program. One can extract the call-graph either dynamically or statically. A dynamic call-grah can be constructed by following the execution path of the program on multiple inputs. Such dynamic call-graphs under-approximate the actual structure of the program to the functions that was covered by the set of input. The dual of a dynamic call-graph is a static call-graph (the subject of this post) which is constructed by statically analyzing the program. Such call-graphs overapproximate the actual call-graph of the program (the exceptions are when there are higher order functions and othe polymorphism). As in control-flow graph, a static call-graph is another abstract view of the interpreter. Static call-graphs complement control flow graphs in static analysis. They allow one to identify which functions may be impacted by a change in one function (dependencies), evaluate reachability, and also can be used to implement various traversals of the code. Note that this is a limited proof of concept. It does not implement the entire traversal other than what is required for us to get our examples working.

Prerequisites

As before, we start with the prerequisite imports. Click on the bullets for more information on how to obtain them.

System Imports

These are available from Pyodide, but you may wish to make sure that they are installed if you are attempting to run the program directly on the machine.

matplotlib
networkx

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

The graphical routines.

The default class for Pyodide.

Use CLIGraphics if you are running it from the command line.

We use Pyodide here.

The to_graph() function produces a graph from the nodes in the registry. This is more simplified than in control-flow because there is only one kind of node in this graph – a function.

The GraphState

The call-graph is a graph, and hence we need a data structure for the graph.

We also need a node class.

Given that it is a directed graph node, we need the ability to add calls and callers.

The context is really the lexical context (scope) in which a function was defined. For example, functions may be defined in the context of modules or classes, and classes may be defined in the context of modules.

We next define two convenience methods for printing any node.

The usage of our graph structure is as below:

Extracting the graph

The call-graph is essentially a source code walker, and shares the basic structure with our interpreter.

As in previous posts, we define walk() that walks a given AST node. A major difference from the control-flow graph is that we want to keep track of the lexical context in which a function is defined, but not necessarily the sequence. So, our walk() accepts the node and the lexical context. It then invokes the various on_*() functions with the same list. Since we ignore the sequence of statements that resulted in a call, The return value is not used.

We also define parse which parses a given fragment and eval which given a fragment, parses and walks the AST.

Module(stmt* body)

We first define the Module. The module is a basic context from where calls to functions can be made.

FunctionDef

Similar to modules, we traverse function definitions looking for calls to other functions. Hence, we only have a basic traversal in place.

Pass

The pass statement is trivial. It does nothing but exist.

Example

Name

The name is just a skeleton. It has no function other than to exist.

Return

The return statement can also have embedded function calls. Hence, we traverse the value.

Example

Call()

A call creates a link between two functions.

Example

Expressions can have function calls embedded. Hence, we traverse the value.

Example

Primitives

Similar to name, primitives also have no function other than to exist.

Arithmetic expressions

For arithmetic expressions, we need to walk on the operands.

Assign(expr* targets, expr value)

For assignment, we walk the value.

Control structures

If

We now come to the control structures. For the if statement, we have two parallel paths. We first traverse the test expression, then the body of the if branch, and the body of the else branch.

Example

While

The while statement is more complex than the if statement, but the essential idea is the same.

Example

Break

Break and continue are simply skeletons.

For(expr target, expr iter, stmt* body, stmt* orelse, string? type_comment)

The For statement is again quite simple.

Example

What remains?

At this point, what we have is a very basic call-graph algorithm that understands a restricted subset of Python. However, a lot more is left.

The main problem is that if you have higher order functions, it is very hard to identify which functions can get called when the variable containing the function is invoked. For example, the call-graph of

def f(x):
    x(1)

is almost impossible to compute because x() can be any function including f() itself. Any polymorphic behavior (such as from objects) induce similar ambiguity.

For example, many objects may override various operators such as arithmetic or boolean operators, providing their own implementations. Given the dynamic nature of Python, we can’t identify such methods. That is, without full type inference, accurate call-graphs are impossible. Even given full type inference, we need accurate abstract interpretation of the program to determine possible assignments (which we do not do here).

That is, while basic call-graph construction is easier than control-flow construction, a complete call-graph is much harder than complete control-flow graph.

We note that this post is a proof-of-concept, more intended for understanding how to construct call-graphs than as a library for use in production. If you need such libraries, I recommend Pyan, which seems to work reasonably well, as well as PyCG which is more rigorous.

For a much more indepth treatment of this subject, see the paper by Salis et al.¹ at ICSE 2021 (the result of which is PyCG), by Yu² and by Abadi et al.³.

Artifacts

The runnable Python source for this notebook is available here.

V Salis, T Sotiropoulos, P Louridas, D Spinellis, and D Mitropoulos. Pycg: Practical call graph generation in python. ICSE 2021. ↩
L Yu. Empirical study of Python call graph ASE 2019. ↩
A Abadi, B Makovitzki, R Shemer, and S Tyszberowicz. NoCFG: A Lightweight Approach for Sound Call Graph Approximation. arXiv:2105.03099. 2021. ↩

Contents

Prerequisites

The graphical routines.

The GraphState

Extracting the graph

Module(stmt* body)

FunctionDef

Pass

Name

Return

Call()

Primitives

Arithmetic expressions

Assign(expr* targets, expr value)

Control structures

If

While

Break

For(expr target, expr iter, stmt* body, stmt* orelse, string? type_comment)

What remains?

Related

Artifacts