A Minimal Static Call Graph for Python Programs
Contents
- Prerequisites
- The graphical routines.
- The GraphState
- Extracting the graph
- Arithmetic expressions
- Control structures
- What remains?
- Related
- Artifacts
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
We previously discussed how one can write an interpreter Python, and we made use of this machinery in generating a control flow graph. In this post, we will show how one can extract the static call-graph using the same machinery. A call-graph is a directed graph data structure that encodes the structure of function calls in a program. One can extract the call-graph either dynamically or statically. A dynamic call-grah can be constructed by following the execution path of the program on multiple inputs. Such dynamic call-graphs under-approximate the actual structure of the program to the functions that was covered by the set of input. The dual of a dynamic call-graph is a static call-graph (the subject of this post) which is constructed by statically analyzing the program. Such call-graphs overapproximate the actual call-graph of the program (the exceptions are when there are higher order functions and othe polymorphism). As in control-flow graph, a static call-graph is another abstract view of the interpreter. Static call-graphs complement control flow graphs in static analysis. They allow one to identify which functions may be impacted by a change in one function (dependencies), evaluate reachability, and also can be used to implement various traversals of the code. Note that this is a limited proof of concept. It does not implement the entire traversal other than what is required for us to get our examples working.
Prerequisites
As before, we start with the prerequisite imports. Click on the bullets for more information on how to obtain them.
System Imports
These are available from Pyodide, but you may wish to make sure that they are installed if you are attempting to run the program directly on the machine.- matplotlib
- networkx
Available Packages
These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.The graphical routines.
The default class for Pyodide.
Use CLIGraphics
if you are running it from the
command line.
We use Pyodide here.
The to_graph()
function produces a graph from the nodes in the registry.
This is more simplified than in control-flow because there is only one kind
of node in this graph – a function.
The GraphState
The call-graph is a graph, and hence we need a data structure for the graph.
We also need a node class.
Given that it is a directed graph node, we need the ability to add calls and callers.
The context is really the lexical context (scope) in which a function was defined. For example, functions may be defined in the context of modules or classes, and classes may be defined in the context of modules.
We next define two convenience methods for printing any node.
The usage of our graph structure is as below:
Extracting the graph
The call-graph is essentially a source code walker, and shares the basic structure with our interpreter.
As in previous posts, we define walk()
that walks a given AST node.
A major difference from the control-flow graph is that we want to keep track
of the lexical
context in which a function is defined, but not necessarily the sequence.
So, our walk()
accepts the node and the lexical context. It then
invokes the various on_*()
functions with the same list. Since we ignore
the sequence of statements that resulted in a call, The return value is not
used.
We also define parse
which parses a given fragment and eval
which
given a fragment, parses and walks the AST.
Module(stmt* body)
We first define the Module
. The module is a basic context from where calls
to functions can be made.
FunctionDef
Similar to modules, we traverse function definitions looking for calls to other functions. Hence, we only have a basic traversal in place.
Pass
The pass statement is trivial. It does nothing but exist.
Example
Name
The name is just a skeleton. It has no function other than to exist.
Return
The return statement can also have embedded function calls. Hence, we traverse the value.
Example
Call()
A call creates a link between two functions.
Example
Expressions can have function calls embedded. Hence, we traverse the value.
Example
Primitives
Similar to name, primitives also have no function other than to exist.
Arithmetic expressions
For arithmetic expressions, we need to walk on the operands.
Assign(expr* targets, expr value)
For assignment, we walk the value.
Control structures
If
We now come to the control structures. For the if
statement, we have two
parallel paths. We first traverse the test expression, then the body of the
if branch, and the body of the else branch.
Example
While
The while
statement is more complex than the if
statement, but the
essential idea is the same.
Example
Break
Break and continue are simply skeletons.
For(expr target, expr iter, stmt* body, stmt* orelse, string? type_comment)
The For
statement is again quite simple.
Example
What remains?
At this point, what we have is a very basic call-graph algorithm that understands a restricted subset of Python. However, a lot more is left.
The main problem is that if you have higher order functions, it is very hard to identify which functions can get called when the variable containing the function is invoked. For example, the call-graph of
def f(x):
x(1)
is almost impossible to compute because x()
can be any function including
f()
itself. Any polymorphic behavior (such as from objects) induce similar
ambiguity.
For example, many objects may override various operators such as arithmetic or boolean operators, providing their own implementations. Given the dynamic nature of Python, we can’t identify such methods. That is, without full type inference, accurate call-graphs are impossible. Even given full type inference, we need accurate abstract interpretation of the program to determine possible assignments (which we do not do here).
That is, while basic call-graph construction is easier than control-flow construction, a complete call-graph is much harder than complete control-flow graph.
Related
We note that this post is a proof-of-concept, more intended for understanding how to construct call-graphs than as a library for use in production. If you need such libraries, I recommend Pyan, which seems to work reasonably well, as well as PyCG which is more rigorous.
For a much more indepth treatment of this subject, see the paper by Salis et al.1 at ICSE 2021 (the result of which is PyCG), by Yu2 and by Abadi et al.3.
Artifacts
The runnable Python source for this notebook is available here.
-
V Salis, T Sotiropoulos, P Louridas, D Spinellis, and D Mitropoulos. Pycg: Practical call graph generation in python. ICSE 2021. ↩
-
L Yu. Empirical study of Python call graph ASE 2019. ↩
-
A Abadi, B Makovitzki, R Shemer, and S Tyszberowicz. NoCFG: A Lightweight Approach for Sound Call Graph Approximation. arXiv:2105.03099. 2021. ↩