A Simple Runtime Invariant Miner

TLDR; This post is a complete implementation of a Daikon-style runtime invariant miner in Python, including instrumentation, trace collection, candidate invariant checking, and implication-based suppression.

The Oracle Problem and Likely Invariants

A central challenge in software testing is the Oracle Problem: how do we programmatically determine if the output of a test is correct? Manually writing assertions (the “oracle”) is tedious and often incomplete.

Daikon-style mining provides an Approximate Oracle. By observing gold runs of a program, we can treat the resulting invariants as a specification of correct behavior.

There are two critical nuances to keep in mind:

Likely, not Guaranteed: Because these invariants are derived inductively from observations, they are hypotheses. If your inputs only contain positive integers, the miner might report x > 0 as an invariant, even if the function is meant to handle negatives.
Descriptive vs. Prescriptive: A mined invariant describes what the code currently does, not necessarily what the programmer intended. If the “gold” version of the code has a bug, the miner will faithfully promote that bug into a requirement.

Despite being approximate, these oracles are incredibly effective for Regression Testing. If a refactor causes a previously stable invariant to fail, your “automated oracle” has just flagged a breaking change that a human might have missed.

The concept of runtime invariant mining is as follows. Given a program and a set of inputs, we run the program under a tracer that records the values of all variables at key points in the execution — function entry, function exit, and so on. We then generate a large set of candidate invariants (e.g. x >= 0, x == y, len(a) == n) and check each one against every recorded state. A candidate that survives all observations is reported as a likely invariant. This idea was introduced by Ernst et al. in the Daikon system ¹. Daikon has been used to find real bugs, generate test oracles, and document program behaviour automatically. Like our grammar miner, this post hooks into sys.settrace and collects observations as the program runs. The difference is that instead of recovering the structure of input strings we recover logical properties of the program’s variables.

Synopsis

results = mine_invariants(triangle, triangle_inputs)
for point, invs in results.items():
    print(point)
    for inv in invs:
        print('  ', inv)

Definitions

A program point is a named location in a program’s execution at which variable values are observed. Typical program points are function entry (ENTER) and function exit (EXIT). By convention the name is function_name:::POINT_TYPE.
A trace is a sequence of (program point, variable state) pairs collected during a single run of the program on one input.
A variable state is a snapshot of all in-scope variable bindings at a given program point. For an EXIT point this also includes the return value bound to the key 'return'.
A candidate invariant is a predicate over a variable state that we wish to check. For example, x >= 0 or x == y.
An invariant survives a state if the predicate holds for that state. An invariant is falsified the first time it fails on any observed state, and is discarded thereafter.
A likely invariant is a candidate that survived every observed state at its program point. It is called “likely” because we have only checked a finite set of inputs.
Suppression is the process of removing redundant invariants. If x == y holds, then x <= y and x >= y are both implied by it and need not be reported separately.

The Oracle Problem and Likely Invariants
Synopsis
Definitions
Prerequisites
Program Points
TraceStore
Instrumentation
Candidate Invariants
1. Unary templates
2. Binary templates
Invariant Engine
Suppression
Full Pipeline
Scoped Program Points
Pre/Post Relations
Extended Example: Newton’s Method
Utilities
Performance
Conclusion: The Value of Good Enough Oracles
References
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Prerequisites

We only need the standard library for this post.

We will use the following simple functions as running examples throughout the post. They are small enough that we can predict their invariants by inspection, which lets us verify that the miner is working correctly.

Program Points

Daikon’s most important design decision is that invariants are attached to specific points in the program called program points. This means that x >= 0 at foo:::ENTER and x >= 0 at foo:::EXIT are two separate invariants. The distinction matters: a variable may satisfy a property on entry but not on exit, or vice versa.

We represent a program point as a small named object. By convention the name is function_name:::POINT_TYPE. We give it __eq__ and __hash__ so it can be used as a dict key.

Two ProgramPoint objects with the same name must compare equal and hash identically so that the same point created in different places refers to the same dict entry.

TraceStore

The TraceStore collects observed variable states keyed by program point name. Each entry is a list of state dicts, one per observation, accumulated across all inputs.

Verify that observations accumulate correctly and that get returns them in insertion order.

Instrumentation

To collect traces we hook into Python’s sys.settrace mechanism. We need two pieces of infrastructure: a Context class that unpacks the information we need from a raw Python frame, and an Instrumentor class that owns the sys.settrace lifecycle and accumulates events.

Context

A Python trace callback receives a raw frame object. Context unpacks the fields we care about: the method name, its declared parameter names, the source file, and the current line number.

extract_vars returns all local bindings visible in the frame. parameters filters those down to just the declared parameters — useful at call time when we want only the arguments the caller passed in, not any locals that the function body may have already created.

We verify Context directly here, before moving on to Instrumentor, so a reader can see exactly what a frame looks like when unpacked. We install a minimal one-shot callback as a closure that captures the frame on the first call event and immediately removes itself. This keeps all the machinery local to the test block.

Instrumentor

Instrumentor owns the sys.settrace callback and drives the whole observation process.

Which variable values are worth recording is a policy decision, not a fixed part of the instrumentation. The default, default_interesting, takes a variable name and its value and returns True if both are worth recording. By default it excludes names starting with _ (Python internals) and values of types our invariant templates cannot reason about. A library user who wants to include private variables, track additional types, or exclude specific names can pass their own interesting=my_predicate to any instrumentor or collection function without subclassing anything.

_tracer is the raw callback registered with sys.settrace. Python calls it on every call, return, line, and exception event; we respond only to call and return to keep overhead low. On call we build a Context, check that the method name does not start with _ (to exclude Python internals), collect the variables accepted by interesting, and store them as an ENTER state. On return we collect all accepted locals plus the return value and store them as an EXIT state. The callback always returns itself so Python keeps calling it for nested frames.

run installs _tracer, calls the function, and always tears the hook down in a finally block so we never leave a stale callback behind even if the function raises.

def default_interesting(name, val):
    return (not name.startswith('_') and
            isinstance(val, (int, float, str, bool, list, tuple)))

class Instrumentor:
    def __init__(self, store, interesting=default_interesting):
        self.store       = store
        self.interesting = interesting

def _tracer(self, frame, event, arg):
        if event not in ('call', 'return'):
            return self._tracer
        cxt = Context(frame)
        if cxt.method.startswith('_'):
            return self._tracer
        all_vars = {k: v for k, v in cxt.extract_vars(frame).items()
                    if self.interesting(k, v)}
        if event == 'call':
            point = ProgramPoint('%s:::ENTER' % cxt.method)
            self.store.add(point, cxt.parameters(all_vars))
        else:
            point = ProgramPoint('%s:::EXIT' % cxt.method)
            state = dict(all_vars)
            if self.interesting('return', arg):
                state['return'] = arg
            self.store.add(point, state)
        return self._tracer

def run(self, fn, *args, **kwargs):
        sys.settrace(self._tracer)
        try:
            fn(*args, **kwargs)
        finally:
            sys.settrace(None)

Verify that a single run call produces both an ENTER and an EXIT observation, that ENTER contains only the declared parameters, and that EXIT contains the return value.

A library user can restrict or extend what gets recorded by passing a custom interesting predicate. Here we pass one that accepts only integers, which means string return values and non-integer locals are excluded. Note the predicate receives both name and value. A user could, for example, include private variables by ignoring the name filter entirely.

Collecting traces over multiple inputs

We run the function once per input and accumulate all observations in the same store. Each element of inputs is a tuple of arguments. A single-argument function wraps its argument in a one-tuple: ([1, 2, 3],). The interesting predicate is forwarded to Instrumentor so callers can control what gets recorded without touching the collection loop.

Collect traces for a range of triangle inputs.

All five inputs share the same two program points triangle:::ENTER and triangle:::EXIT. Their observations are pooled, so the surviving invariants will be those that hold across every input.

Candidate Invariants

An Invariant object wraps a predicate and tracks whether it has been falsified. Once falsified it stays dead — we never resurrect it. The predicate receives a state dict and should return a bool. Any exception during evaluation counts as a falsification, which handles type mismatches (e.g. comparing a string to a number) without special casing.

Unary templates

Unary invariants involve a single variable. We generate a small fixed set of templates for every variable we observe. Templates that do not apply to the variable’s type (e.g. >= 0 on a string) will be falsified on the first observation and quietly drop out.

Verify that a numeric state kills the string template but preserves the numeric ones, and that a negative value kills >= 0.

Binary templates

Binary invariants involve two variables. We generate templates for every pair of variables that appear together in at least one state.

Verify that equal states keep == alive and a differing state kills it while leaving <= alive.

Invariant Engine

The engine takes a TraceStore, generates candidate invariants for each program point, and checks them against all observations at that point. A candidate that survives every observation is a likely invariant.

Generating candidates for a point

We collect all variable names that appear in any state at the point, then generate all unary and binary candidates over those names. The number of candidates grows quadratically with the number of variables, but for typical program points the variable count is small.

Verify candidate generation produces the templates we expect.

Checking candidates against observations

We test every candidate against every observed state. Candidates that remain alive after all observations are the likely invariants. The order of iteration does not matter because falsification is monotone: once dead, always dead.

Running the full analysis

analyze iterates over every program point in the store, generates fresh candidates for each one, checks them against that point’s observations, and collects the survivors into a results dict keyed by point name. Candidates are generated fresh for each point so that falsification at one point does not affect another.

Run the engine on our triangle traces.

At triangle:::ENTER we see a >= 0, b >= 0, c >= 0, and type(a) is int because all our inputs used positive integers. At triangle:::EXIT we see type(return) is str. Notice that pooling all five inputs means a == b does not survive — it holds for (3,3,3) and (5,5,5) but not for (3,4,5). We will address this in the scoped program points section below.

Suppression

The engine produces many redundant invariants. If x == y holds then both x <= y and x >= y add no information. The suppression step removes weaker invariants that are implied by stronger ones.

We maintain a simple implication table: if an invariant matching the strong pattern is alive, we remove any co-surviving invariant matching the weak pattern that involves the same variables. _vars extracts variable names from an invariant’s name string by tokenising and keeping only identifiers that are not keywords used by our templates.

suppress walks the implication table and removes any invariant that is dominated by a stronger co-surviving invariant over the same variables.

Verify that == suppresses the redundant <= and >=.

Full Pipeline

We now combine all the pieces into a single mine_invariants function.

Run the full pipeline on triangle.

Run the full pipeline on sum_list.

At sum_list:::EXIT the miner correctly discovers return == total, which captures the loop invariant: the return value equals the running total at the point the function exits.

Scoped Program Points

The flat naming above pools all calls to triangle into one program point. This means the invariants we report are those that hold across every input — an equilateral triangle and a scalene triangle share one observation bucket, and properties specific to either are lost.

The fix is to give each invocation a unique identity. We introduce a CallStack that assigns a monotonically increasing integer to each call. The program point name becomes function[call_id]:::TAG, so triangle[1]:::ENTER and triangle[2]:::ENTER are distinct points whose observations never mix.

This is the same scoping idea used in our grammar miner to distinguish separate calls to the same parsing helper. It also turns out to be the key we need to link an ENTER observation to its corresponding EXIT observation — something we will use in the next section.

The root sentinel (seq == 0) marks the bottom of the stack before any function has been entered.

Assign the next id, push onto the stack, and return the new method_id so the caller can use it to name the ENTER point.

Pop the current frame and restore the caller’s method_id.

Verify that enter and leave maintain the stack correctly and that ids are unique and monotonically increasing across calls.

make_scoped_point produces the function[id]:::TAG name. The root sentinel (seq == 0) is not a real call so we return None, which TraceStore.add will silently ignore.

ScopedInstrumentor extends Instrumentor by maintaining a CallStack and using make_scoped_point instead of bare function:::TAG names. On call it pushes the method onto the stack and tags the ENTER point with the resulting id. On return it reads the current id for the EXIT point, then pops the stack. The run method is inherited unchanged from Instrumentor.

class ScopedInstrumentor(Instrumentor):
    def __init__(self, store, interesting=default_interesting):
        super().__init__(store, interesting)
        self.call_stack = CallStack()

def _tracer(self, frame, event, arg):
        if event not in ('call', 'return'):
            return self._tracer
        cxt = Context(frame)
        if cxt.method.startswith('_'):
            return self._tracer
        all_vars = {k: v for k, v in cxt.extract_vars(frame).items()
                    if self.interesting(k, v)}
        if event == 'call':
            mid   = self.call_stack.enter(cxt.method)
            point = make_scoped_point(mid, 'ENTER')
            self.store.add(point, cxt.parameters(all_vars))
        else:
            mid   = self.call_stack.current()
            point = make_scoped_point(mid, 'EXIT')
            state = dict(all_vars)
            if self.interesting('return', arg):
                state['return'] = arg
            self.store.add(point, state)
            self.call_stack.leave()
        return self._tracer

Verify that two calls to the same function produce distinct program points and that their observations do not mix.

collect_traces_scoped mirrors collect_traces but uses ScopedInstrumentor. The interesting predicate is forwarded so callers have the same control over what gets recorded as they do with the flat version.

With scoped points each call to triangle gets its own pair, so the five inputs produce triangle[1] through triangle[5].

The rest of the pipeline — InvariantEngine, SuppressionLattice, mine_invariants — is unchanged. We just pass collect_traces_scoped instead of collect_traces.

Run the scoped pipeline on triangle.

Now triangle[1]:::ENTER (the equilateral case (3,3,3)) reports a == b, a == c, and b == c, while triangle[3]:::ENTER (the scalene case (3,4,5)) reports only a <= b, a <= c, and b <= c. The scoped points give us per-call precision that the flat version cannot.

Pre/Post Relations

The invariants we have mined so far are single-point invariants: each one says something about the program state at either ENTER or EXIT, but never both at once. This misses the most interesting class of invariant: relational invariants that relate the state a function was called with to the state it left behind. For example:

a_exit == a_entry — the argument a was not modified.
return_exit >= a_entry — the return value is at least as large as the first argument on entry.

These are pre/post invariants in the sense of Hoare logic: they relate the precondition (values at ENTER) to the postcondition (values at EXIT).

To express them we need to pair each ENTER observation with its corresponding EXIT observation from the same call. The call id produced by CallStack is the join key: the ENTER and EXIT observations for triangle[3] came from the same invocation and can be safely merged into one combined state where every entry variable carries an _entry suffix and every exit variable carries an _exit suffix.

PairedTraceStore

PairedTraceStore accumulates ENTER and EXIT observations separately, keyed by call id, then joins them on demand. Variables that appear at both points get two entries — x_entry and x_exit — which lets us ask directly whether the value changed.

Verify pairing with a simple example.

PairedInstrumentor

PairedInstrumentor extends ScopedInstrumentor. Instead of writing to a flat TraceStore it routes ENTER observations to paired_store.add_enter and EXIT observations to paired_store.add_exit, keyed by the integer call id from CallStack. The run method is inherited unchanged from Instrumentor.

class PairedInstrumentor(ScopedInstrumentor):
    def __init__(self, paired_store, interesting=default_interesting):
        self.paired_store = paired_store
        self.store        = TraceStore()  # unused but satisfies super().__init__
        self.interesting  = interesting
        self.call_stack   = CallStack()

def _tracer(self, frame, event, arg):
        if event not in ('call', 'return'):
            return self._tracer
        cxt = Context(frame)
        if cxt.method.startswith('_'):
            return self._tracer
        all_vars = {k: v for k, v in cxt.extract_vars(frame).items()
                    if self.interesting(k, v)}
        if event == 'call':
            mid = self.call_stack.enter(cxt.method)
            _, call_id = mid
            self.paired_store.add_enter(call_id, cxt.parameters(all_vars))
        else:
            mid = self.call_stack.current()
            _, call_id = mid
            state = dict(all_vars)
            if self.interesting('return', arg):
                state['return'] = arg
            self.paired_store.add_exit(call_id, state)
            self.call_stack.leave()
        return self._tracer

collect_pairs

collect_pairs mirrors collect_traces but populates a PairedTraceStore using PairedInstrumentor. The interesting predicate is forwarded as with the other collection functions.

Verify collection on triangle.

Relational invariant templates

With entry and exit variables in the same state dict we can write invariant templates that span both. We generate two kinds:

Change invariants — x_exit OP x_entry for every variable that appears at both points. These detect whether a variable was preserved, increased, or decreased.
Return invariants — return_exit OP x_entry for every entry variable. These detect postconditions on the return value.

We reuse binary_invariants directly: the renamed variables a_entry and a_exit are just variable names as far as it is concerned.

RelationalEngine

RelationalEngine operates on the paired states from PairedTraceStore.pairs(). It infers entry and exit variable names from the suffix of each key in the first pair, generates relational candidates, checks them against every pair, and returns the survivors.

Run the relational engine on triangle.

a_exit == a_entry, b_exit == b_entry, and c_exit == c_entry all survive — triangle does not modify its arguments, which is correct. return_exit relations to the numeric entry variables do not survive because return is a string, so the numeric templates are falsified immediately.

Run the relational engine on sum_list.

lst_exit == lst_entry survives — sum_list does not modify its input list. The return-value relations to lst_entry do not survive because lst is a list and our numeric templates do not apply to it. The return_exit == total_exit relation found earlier by the single-point engine is a same-point binary invariant at EXIT, not a pre/post relation, so it does not appear here. The two engines are complementary.

Full relational pipeline

Run the full relational pipeline on triangle.

Run the full relational pipeline on sum_list.

Extended Example: Newton’s Method

Newton’s method for finding square roots is a good stress test for the miner because it has a non-trivial loop, a convergence condition, and a clear mathematical postcondition: the return value squared should equal the input (to within floating-point tolerance).

The algorithm maintains a running estimate x and refines it by averaging x and n/x until consecutive estimates differ by less than a small epsilon.

We run the miner on a range of positive inputs. Newton’s method converges for any positive n, so all inputs should produce a valid return value.

At newton_sqrt:::ENTER we expect n >= 0 and n > 0 to survive since all our inputs were positive. At newton_sqrt:::EXIT we expect return >= 0 and return > 0.

Now the relational analysis, which is where the interesting postcondition lives.

return_exit >= n_entry should not survive in general — sqrt(0.25) returns 0.5 which is less than 0.25 — and indeed it does not. return_exit == n_entry also does not survive. What survives is return_exit > 0 — the square root of a positive number is positive — and the variable preservation invariants for intermediate locals.

The miner cannot discover return_exit ** 2 == n_entry because our template set does not include quadratic relations. That is an intentional limitation: Daikon itself uses a richer template language including polynomial invariants. Adding a template is straightforward:

Re-run the relational engine with the quadratic template added.

return_exit * return_exit == n_entry now survives, which is the mathematical specification of the square root function. This shows how the template set determines what the miner can discover: the infrastructure is general, and extending coverage is a matter of adding more Invariant objects to the candidate list.

Also run the step function to observe its internal invariants.

newton_step:::ENTER shows x > 0 and n > 0 (all our inputs were positive). newton_step:::EXIT shows return > 0 — the refined estimate is always positive when both x and n are positive, which is a genuine loop invariant of Newton’s method.

Utilities

Format and print the scoped triangle results.

Performance

The performance of this miner is dominated by the candidate checking phase, which scales at approximately \(O(T \times V^2)\), where \(T\) is the number of observed traces and \(V\) is the number of variables in scope, making it highly efficient for standard functions but computationally expensive as the variable count grows.

Conclusion: The Value of Good Enough Oracles

We have built a system that moves from raw execution traces to structured logical properties. While our implementation is a simplified version of the full Daikon system, it illustrates the core power of dynamic invariant mining:

Bridging the Specification Gap: Most software has no formal specification. By mining invariants, we create a “de facto” spec that describes how the code actually behaves.
Templates are the Limit: As seen with newton_sqrt, the miner can only “see” what we give it templates for. The trade-off is between computational cost (more templates = more checks) and expressiveness.
The Automated Guardrail: These mined invariants can be automatically turned into assert statements. This creates an approximate oracle—a suite of tests that ensures future changes don’t violate the established “truth” of the program’s execution.

Dynamic analysis doesn’t replace formal proof or manual testing, but it provides a remarkably low-effort way to document behavior and catch regressions in complex systems.

Note: Dynamic invariant mining is the dual of Static analysis for invariant (we can also call these oracles in testing parlance) inference. The main difference is that the invariants being inferred is approximate but precise, while static analysis invariants are sound but imprecise. Briding the gap is impossible if arbitrary programs are the subjects.

Dynamic invariant mining is also a whitebox dual of blackbox specification inference similar to how (code) path coverage is a whitebox dual of input coverage using k-paths.

References

Artifacts

The runnable Python source for this notebook is available here.

Michael D. Ernst, Jake Cockrell, William G. Griswold, David Notkin. “Dynamically Discovering Likely Program Invariants to Support Program Evolution.” IEEE Transactions on Software Engineering, 27(2):99–123, 2001. ↩