A simple runtime invariant miner
TLDR; This post is a complete implementation of a Daikon-style runtime invariant miner in Python, including instrumentation, trace collection, candidate invariant checking, and implication-based suppression.
The Oracle Problem and “Likely” Invariants
A central challenge in software testing is the “Oracle Problem”: how do we programmatically determine if the output of a test is correct? Manually writing assertions (the “oracle”) is tedious and often incomplete.
Daikon-style mining provides an Approximate Oracle. By observing “gold” runs of a program, we can treat the resulting invariants as a specification of correct behavior.
There are two critical nuances to keep in mind:
-
Likely, not Guaranteed: Because these invariants are derived inductively from observations, they are hypotheses. If your inputs only contain positive integers, the miner might report
x > 0as an invariant, even if the function is meant to handle negatives. -
Descriptive vs. Prescriptive: A mined invariant describes what the code currently does, not necessarily what the programmer intended. If the “gold” version of the code has a bug, the miner will faithfully promote that bug into a requirement.
Despite being approximate, these oracles are incredibly effective for Regression Testing. If a refactor causes a previously stable invariant to fail, your “automated oracle” has just flagged a breaking change that a human might have missed.
The concept of runtime invariant mining is as follows. Given a program
and a set of inputs, we run the program under a tracer that records the
values of all variables at key points in the execution — function entry,
function exit, and so on. We then generate a large set of candidate
invariants (e.g. x >= 0, x == y, len(a) == n) and check each one
against every recorded state. A candidate that survives all observations
is reported as a likely invariant.
This idea was introduced by Ernst et al. in the Daikon system
1. Daikon has been used to find real bugs, generate
test oracles, and document program behaviour automatically.
Like our grammar miner, this post
hooks into sys.settrace and collects observations as the program runs.
The difference is that instead of recovering the structure of input
strings we recover logical properties of the program’s variables.
Synopsis
results = mine_invariants(triangle, triangle_inputs)
for point, invs in results.items():
print(point)
for inv in invs:
print(' ', inv)
Definitions
-
A program point is a named location in a program’s execution at which variable values are observed. Typical program points are function entry (
ENTER) and function exit (EXIT). By convention the name isfunction_name:::POINT_TYPE. -
A trace is a sequence of (program point, variable state) pairs collected during a single run of the program on one input.
-
A variable state is a snapshot of all in-scope variable bindings at a given program point. For an
EXITpoint this also includes the return value bound to the key'return'. -
A candidate invariant is a predicate over a variable state that we wish to check. For example,
x >= 0orx == y. -
An invariant survives a state if the predicate holds for that state. An invariant is falsified the first time it fails on any observed state, and is discarded thereafter.
-
A likely invariant is a candidate that survived every observed state at its program point. It is called “likely” because we have only checked a finite set of inputs.
-
Suppression is the process of removing redundant invariants. If
x == yholds, thenx <= yandx >= yare both implied by it and need not be reported separately.
Contents
- The Oracle Problem and “Likely” Invariants
- Synopsis
- Definitions
- Prerequisites
- Program Points
- TraceStore
- Instrumentation
- Candidate Invariants
- Invariant Engine
- Suppression
- Full Pipeline
- Scoped Program Points
- Pre/Post Relations
- Extended Example: Newton’s Method
- Utilities
- Performance
- Conclusion: The Value of “Good Enough” Oracles
- References
- Artifacts
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
Prerequisites
We only need the standard library for this post.
We will use the following simple functions as running examples throughout the post. They are small enough that we can predict their invariants by inspection, which lets us verify that the miner is working correctly.
Program Points
Daikon’s most important design decision is that invariants are attached
to specific points in the program called program points. This means
that x >= 0 at foo:::ENTER and x >= 0 at foo:::EXIT are two
separate invariants. The distinction matters: a variable may satisfy a
property on entry but not on exit, or vice versa.
We represent a program point as a small named object. By convention
the name is function_name:::POINT_TYPE. We give it __eq__ and
__hash__ so it can be used as a dict key.
Two ProgramPoint objects with the same name must compare equal and
hash identically so that the same point created in different places
refers to the same dict entry.
TraceStore
The TraceStore collects observed variable states keyed by program
point name. Each entry is a list of state dicts, one per observation,
accumulated across all inputs.
Verify that observations accumulate correctly and that get returns
them in insertion order.
Instrumentation
To collect traces we hook into Python’s sys.settrace mechanism.
We need two pieces of infrastructure: a Context class that unpacks
the information we need from a raw Python frame, and an Instrumentor
class that owns the sys.settrace lifecycle and accumulates events.
Context
A Python trace callback receives a raw frame object. Context unpacks
the fields we care about: the method name, its declared parameter names,
the source file, and the current line number.
extract_vars returns all local bindings visible in the frame.
parameters filters those down to just the declared parameters — useful
at call time when we want only the arguments the caller passed in, not
any locals that the function body may have already created.
We verify Context directly here, before moving on to Instrumentor,
so a reader can see exactly what a frame looks like when unpacked.
We install a minimal one-shot callback as a closure that captures the
frame on the first call event and immediately removes itself. This
keeps all the machinery local to the test block.
Instrumentor
Instrumentor owns the sys.settrace callback and drives the whole
observation process.
Which variable values are worth recording is a policy decision, not a
fixed part of the instrumentation. The default, default_interesting,
takes a variable name and its value and returns True if both are
worth recording. By default it excludes names starting with _ (Python
internals) and values of types our invariant templates cannot reason
about. A library user who wants to include private variables, track
additional types, or exclude specific names can pass their own
interesting=my_predicate to any instrumentor or collection function
without subclassing anything.
_tracer is the raw callback registered with sys.settrace. Python
calls it on every call, return, line, and exception event; we
respond only to call and return to keep overhead low. On call we
build a Context, check that the method name does not start with _
(to exclude Python internals), collect the variables accepted by
interesting, and store them as an ENTER state. On return we
collect all accepted locals plus the return value and store them as an
EXIT state. The callback always returns itself so Python keeps calling
it for nested frames.
run installs _tracer, calls the function, and always tears the
hook down in a finally block so we never leave a stale callback
behind even if the function raises.
Verify that a single run call produces both an ENTER and an EXIT
observation, that ENTER contains only the declared parameters, and
that EXIT contains the return value.
A library user can restrict or extend what gets recorded by passing a
custom interesting predicate. Here we pass one that accepts only
integers, which means string return values and non-integer locals are
excluded. Note the predicate receives both name and value — a user
could, for example, include private variables by ignoring the name
filter entirely.
Collecting traces over multiple inputs
We run the function once per input and accumulate all observations in
the same store. Each element of inputs is a tuple of arguments —
a single-argument function wraps its argument in a one-tuple:
([1, 2, 3],). The interesting predicate is forwarded to
Instrumentor so callers can control what gets recorded without
touching the collection loop.
Collect traces for a range of triangle inputs.
All five inputs share the same two program points triangle:::ENTER and
triangle:::EXIT. Their observations are pooled, so the surviving
invariants will be those that hold across every input.
Candidate Invariants
An Invariant object wraps a predicate and tracks whether it has been
falsified. Once falsified it stays dead — we never resurrect it.
The predicate receives a state dict and should return a bool. Any
exception during evaluation counts as a falsification, which handles
type mismatches (e.g. comparing a string to a number) without special
casing.
Unary templates
Unary invariants involve a single variable. We generate a small fixed
set of templates for every variable we observe. Templates that do not
apply to the variable’s type (e.g. >= 0 on a string) will be
falsified on the first observation and quietly drop out.
Verify that a numeric state kills the string template but preserves
the numeric ones, and that a negative value kills >= 0.
Binary templates
Binary invariants involve two variables. We generate templates for every pair of variables that appear together in at least one state.
Verify that equal states keep == alive and a differing state kills it
while leaving <= alive.
Invariant Engine
The engine takes a TraceStore, generates candidate invariants for each
program point, and checks them against all observations at that point.
A candidate that survives every observation is a likely invariant.
Generating candidates for a point
We collect all variable names that appear in any state at the point, then generate all unary and binary candidates over those names. The number of candidates grows quadratically with the number of variables, but for typical program points the variable count is small.
Verify candidate generation produces the templates we expect.
Checking candidates against observations
We test every candidate against every observed state. Candidates that remain alive after all observations are the likely invariants. The order of iteration does not matter because falsification is monotone: once dead, always dead.
Running the full analysis
analyze iterates over every program point in the store, generates
fresh candidates for each one, checks them against that point’s
observations, and collects the survivors into a results dict keyed by
point name. Candidates are generated fresh for each point so that
falsification at one point does not affect another.
Run the engine on our triangle traces.
At triangle:::ENTER we see a >= 0, b >= 0, c >= 0, and
type(a) is int because all our inputs used positive integers.
At triangle:::EXIT we see type(return) is str. Notice that
pooling all five inputs means a == b does not survive — it holds
for (3,3,3) and (5,5,5) but not for (3,4,5). We will address
this in the scoped program points section below.
Suppression
The engine produces many redundant invariants. If x == y holds then
both x <= y and x >= y add no information. The suppression step
removes weaker invariants that are implied by stronger ones.
We maintain a simple implication table: if an invariant matching the
strong pattern is alive, we remove any co-surviving invariant
matching the weak pattern that involves the same variables.
_vars extracts variable names from an invariant’s name string by
tokenising and keeping only identifiers that are not keywords used by
our templates.
suppress walks the implication table and removes any invariant that
is dominated by a stronger co-surviving invariant over the same
variables.
Verify that == suppresses the redundant <= and >=.
Full Pipeline
We now combine all the pieces into a single mine_invariants function.
Run the full pipeline on triangle.
Run the full pipeline on sum_list.
At sum_list:::EXIT the miner correctly discovers return == total,
which captures the loop invariant: the return value equals the running
total at the point the function exits.
Scoped Program Points
The flat naming above pools all calls to triangle into one program
point. This means the invariants we report are those that hold across
every input — an equilateral triangle and a scalene triangle share one
observation bucket, and properties specific to either are lost.
The fix is to give each invocation a unique identity. We introduce a
CallStack that assigns a monotonically increasing integer to each
call. The program point name becomes function[call_id]:::TAG, so
triangle[1]:::ENTER and triangle[2]:::ENTER are distinct points
whose observations never mix.
This is the same scoping idea used in our grammar miner to distinguish
separate calls to the same parsing helper. It also turns out to be the
key we need to link an ENTER observation to its corresponding EXIT
observation — something we will use in the next section.
The root sentinel (seq == 0) marks the bottom of the stack before any function has been entered.
Assign the next id, push onto the stack, and return the new method_id so the caller can use it to name the ENTER point.
Pop the current frame and restore the caller’s method_id.
Verify that enter and leave maintain the stack correctly and that
ids are unique and monotonically increasing across calls.
make_scoped_point produces the function[id]:::TAG name. The root
sentinel (seq == 0) is not a real call so we return None, which
TraceStore.add will silently ignore.
ScopedInstrumentor extends Instrumentor by maintaining a
CallStack and using make_scoped_point instead of bare
function:::TAG names. On call it pushes the method onto the stack
and tags the ENTER point with the resulting id. On return it reads
the current id for the EXIT point, then pops the stack. The run
method is inherited unchanged from Instrumentor.
Verify that two calls to the same function produce distinct program points and that their observations do not mix.
collect_traces_scoped mirrors collect_traces but uses
ScopedInstrumentor. The interesting predicate is forwarded so
callers have the same control over what gets recorded as they do with
the flat version.
With scoped points each call to triangle gets its own pair, so the
five inputs produce triangle[1] through triangle[5].
The rest of the pipeline — InvariantEngine, SuppressionLattice,
mine_invariants — is unchanged. We just pass collect_traces_scoped
instead of collect_traces.
Run the scoped pipeline on triangle.
Now triangle[1]:::ENTER (the equilateral case (3,3,3)) reports
a == b, a == c, and b == c, while triangle[3]:::ENTER (the
scalene case (3,4,5)) reports only a <= b, a <= c, and
b <= c. The scoped points give us per-call precision that the flat
version cannot.
Pre/Post Relations
The invariants we have mined so far are single-point invariants:
each one says something about the program state at either ENTER or
EXIT, but never both at once. This misses the most interesting class
of invariant: relational invariants that relate the state a function
was called with to the state it left behind. For example:
a_exit == a_entry— the argumentawas not modified.return_exit >= a_entry— the return value is at least as large as the first argument on entry.
These are pre/post invariants in the sense of Hoare logic: they
relate the precondition (values at ENTER) to the postcondition
(values at EXIT).
To express them we need to pair each ENTER observation with its
corresponding EXIT observation from the same call. The call id
produced by CallStack is the join key: the ENTER and EXIT
observations for triangle[3] came from the same invocation and can
be safely merged into one combined state where every entry variable
carries an _entry suffix and every exit variable carries an _exit
suffix.
PairedTraceStore
PairedTraceStore accumulates ENTER and EXIT observations
separately, keyed by call id, then joins them on demand. Variables
that appear at both points get two entries — x_entry and x_exit
— which lets us ask directly whether the value changed.
Verify pairing with a simple example.
PairedInstrumentor
PairedInstrumentor extends ScopedInstrumentor. Instead of writing
to a flat TraceStore it routes ENTER observations to
paired_store.add_enter and EXIT observations to
paired_store.add_exit, keyed by the integer call id from CallStack.
The run method is inherited unchanged from Instrumentor.
collect_pairs
collect_pairs mirrors collect_traces but populates a
PairedTraceStore using PairedInstrumentor. The interesting
predicate is forwarded as with the other collection functions.
Verify collection on triangle.
Relational invariant templates
With entry and exit variables in the same state dict we can write invariant templates that span both. We generate two kinds:
- Change invariants —
x_exit OP x_entryfor every variable that appears at both points. These detect whether a variable was preserved, increased, or decreased. - Return invariants —
return_exit OP x_entryfor every entry variable. These detect postconditions on the return value.
We reuse binary_invariants directly: the renamed variables
a_entry and a_exit are just variable names as far as it is
concerned.
RelationalEngine
RelationalEngine operates on the paired states from
PairedTraceStore.pairs(). It infers entry and exit variable names
from the suffix of each key in the first pair, generates relational
candidates, checks them against every pair, and returns the survivors.
Run the relational engine on triangle.
a_exit == a_entry, b_exit == b_entry, and c_exit == c_entry all
survive — triangle does not modify its arguments, which is correct.
return_exit relations to the numeric entry variables do not survive
because return is a string, so the numeric templates are falsified
immediately.
Run the relational engine on sum_list.
lst_exit == lst_entry survives — sum_list does not modify its
input list. The return-value relations to lst_entry do not survive
because lst is a list and our numeric templates do not apply to it.
The return_exit == total_exit relation found earlier by the
single-point engine is a same-point binary invariant at EXIT, not a
pre/post relation, so it does not appear here. The two engines are
complementary.
Full relational pipeline
Run the full relational pipeline on triangle.
Run the full relational pipeline on sum_list.
Extended Example: Newton’s Method
Newton’s method for finding square roots is a good stress test for the miner because it has a non-trivial loop, a convergence condition, and a clear mathematical postcondition: the return value squared should equal the input (to within floating-point tolerance).
The algorithm maintains a running estimate x and refines it by
averaging x and n/x until consecutive estimates differ by less
than a small epsilon.
We run the miner on a range of positive inputs. Newton’s method
converges for any positive n, so all inputs should produce a valid
return value.
At newton_sqrt:::ENTER we expect n >= 0 and n > 0 to survive
since all our inputs were positive. At newton_sqrt:::EXIT we expect
return >= 0 and return > 0.
Now the relational analysis, which is where the interesting postcondition lives.
return_exit >= n_entry should not survive in general — sqrt(0.25)
returns 0.5 which is less than 0.25 — and indeed it does not.
return_exit == n_entry also does not survive. What survives is
return_exit > 0 — the square root of a positive number is positive —
and the variable preservation invariants for intermediate locals.
The miner cannot discover return_exit ** 2 == n_entry because our
template set does not include quadratic relations. That is an
intentional limitation: Daikon itself uses a richer template language
including polynomial invariants. Adding a template is straightforward:
Re-run the relational engine with the quadratic template added.
return_exit * return_exit == n_entry now survives, which is the
mathematical specification of the square root function. This shows how
the template set determines what the miner can discover: the
infrastructure is general, and extending coverage is a matter of
adding more Invariant objects to the candidate list.
Also run the step function to observe its internal invariants.
newton_step:::ENTER shows x > 0 and n > 0 (all our inputs were
positive). newton_step:::EXIT shows return > 0 — the refined
estimate is always positive when both x and n are positive, which
is a genuine loop invariant of Newton’s method.
Utilities
Format and print the scoped triangle results.
Performance
The performance of this miner is dominated by the candidate checking phase, which scales at approximately \(O(T \times V^2)\), where \(T\) is the number of observed traces and \(V\) is the number of variables in scope, making it highly efficient for standard functions but computationally expensive as the variable count grows.
Conclusion: The Value of “Good Enough” Oracles
We have built a system that moves from raw execution traces to structured logical properties. While our implementation is a simplified version of the full Daikon system, it illustrates the core power of dynamic invariant mining:
-
Bridging the Specification Gap: Most software has no formal specification. By mining invariants, we create a “de facto” spec that describes how the code actually behaves.
-
Templates are the Limit: As seen with
newton_sqrt, the miner can only “see” what we give it templates for. The trade-off is between computational cost (more templates = more checks) and expressiveness. -
The Automated Guardrail: These mined invariants can be automatically turned into
assertstatements. This creates an approximate oracle—a suite of tests that ensures future changes don’t violate the established “truth” of the program’s execution.
Dynamic analysis doesn’t replace formal proof or manual testing, but it provides a remarkably low-effort way to document behavior and catch regressions in complex systems.
References
Artifacts
The runnable Python source for this notebook is available here.
-
Michael D. Ernst, Jake Cockrell, William G. Griswold, David Notkin. “Dynamically Discovering Likely Program Invariants to Support Program Evolution.” IEEE Transactions on Software Engineering, 27(2):99–123, 2001. ↩