TLDR; This tutorial is a complete implementation of RPNI algorithm which infers the input specification from positive and negative samples. Such grammars are can be useful for identifying the specification of blackbox programs when it is impossible to query the program directly, and only its behaviour on a limited set of samples is known. The Python interpreter is embedded so that you can work through the implementation steps.

In my previous posts, I have discussed how to parse with, fuzz with, and manipulate regular and context-free grammars. However, the availability of the input grammars may not be guaranteed in many cases. Often, the available documentation may not fully capture the input specification. For example, an input field advertizing the ISO 8601 date format as input may accept slight variations, or may not implement the format fully. In such cases, one has to rely on the available behaviour to infer the true specification.

Regular Positive and Negative Inference, RPNI for short, is a classical algorithm that allows us to determine the input specification from a set of positive and negative samples. It was introducd by Oncia and Garcia in 1992 ¹. The core idea of RPNI is to construct a Prefix Tree Acceptor (PTA, also called a Trie), then iteratively try and merge each state pair, and checking the resulting DFA against negative samples to verify that the resulting DFA is not overgenaralizing. Because the new DFA is produced by merging states within the DFA, it will continue to accept all existing positive samples. The negative samples provide the defense against overgeneralization. If any negative samples are accepted, the merging of the state pair is rejected. Continuing in this fashion, RPNI will compute the DFA that can accept all positive samples, and reject all negative samples.

To things to note: Every DFA has a characteristic set of samples including positive and negative samples, when given to the RPNI algorithm, will result in the exact DFA of the blackbox. Secondly, RPNI will also infer the precise DFA in the limit. Secondly, L* is called an active grammar inference algorithm because it is able to query the blackbox, while RPNI is called a passive grammar inference algorithm because it cannot query the blackbox.

Definitions

Input symbol: A single symbol that is consumed by the machine which can move it from one state to another. The set of such symbols is called an alphabet, and is represented by $A$.
DFA: A Deterministic Finite-state Automaton is a machine which is a model of an input processor. It is represented as a set of discrete states, and transitions between them. The DFA is initialized to the start state.
State: A state is one of the components of the DFA, and represents a particular phase of the machine. A state transitions to another when it is fed an input symbol. Some states are marked as accepting.
Accepting State: Starting from the start state, and after consuming all the symbols in the input, the state reached is one of the accept states, then we say that the input was accepted by the machine.
Starting State: The machine starts at the start state.
Transition: A transition is a link between two states. A DFA transitions from one state to the next on consuming an input symbol.
Terminal Symbol: A terminal symbol is a symbol in the alphabet of the language in a regular grammar. It is the equivalnt of DFA transition in the grammar.
Nonterminal Symbol: A nonterminal symbol defines the rules of expansion in the grammar. A nonterminal symbol is associated with a definition in the grammar which contains the above rules. It corresponds to a state in the DFA.
Definition: A set of rules which can be applied to a nonterminal symbol for expansion.
Rule: A single way to expand a nonterminal symbol. In the DFA context, it contains the transition as the token, and the ending state as the final nonterminal symbol.

Definitions
1. Prerequisites
Grammar Inference

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

Prerequisites

We need the fuzzer to generate inputs to parse and also to provide some utilities such as conversion of regular expression to grammars, random sampling from grammars etc. Hence, we import all that.

Grammar Inference

As before, let us assume that the specification to be inferred is a regular language.

So, given a list of inputs which contain strings that were accepted (postive samples) and rejected (negative samples), and without the ability to query the blackbox program, or look inside the blackbox program, how do you find what the program accepts? We start by constructing a PTA, which is then generalized into a DFA. The intuition here is that the PTA is already an over specialized DFA, and all we need to do is to identify which of the states are actually the same. So we keep pairing the states, and checking whether the constraints, i.e. all negative strings are rejected, hold. This is the key intuition for all algorithms that work with positive and negative samples only.

The RPNI Algorithm

RPNI is a classic algorithm for learning regular languages from positive and negative strings. As we mentioned before, the algorithm has two phases.

Building a Prefix Tree Acceptor: Create a PTA accepting all positive samples
State Merging: Iteratively merge states to generalize the DFA while holding the constraints.

The Reg Parse

Before we go further, let us define a few helper functions. First, we need to represent a DFA. A Deterministic Finite Automation can be adequately represented by a right linear regular grammar. That is, each onterminal is equivalent to a state in the DFA. The start symbol of the grammar is the entry point for the DFA. Transitions from a state A to another state B is indicated as a rule <A> := b <B>. That is, state A transitions to state B on receiving the token b. This is represented as a Python dictionary.

{
'<A>'     : [['b', '<B>'], ['c', '<C>']],
'<B>'     : [['b', '<B>'], []],
'<C>'     : [[]]
}

As you can noticce in the above representation, the accepting states are indicated by adding [] to the state. In the above, both <B> and <C> are accepting states. To make it easier to recognize, we insist that [] be only added as the last rule.

With this in mind, it becomes easy to define a matcher for this DFA. It is close to our original peg_parse, but with additional handling for the accepting states.

Reg parse

import sys

class dfa_parse:
    def __init__(self, grammar):
        self.grammar = grammar

def unify_key(self, key, text, at=0):
        if key not in self.grammar:
            if text[at:].startswith(key):
                return (at + len(key), (key, []))
            else: return (at, None)
        rules = self.grammar[key]
        for rule in rules:
            if not rule: continue # we process [] separately
            l, res = self.unify_rule(rule, text, at)
            if res is not None: return l, (key, res)

if [] in rules: # process []
            l, res = self.unify_rule([], text, at)
            if res is not None: return l, (key, res)

return (0, None)

def unify_rule(self, parts, text, tfrom):
        results = []
        for part in parts:
            tfrom, res = self.unify_key(part, text, tfrom)
            if res is None: return tfrom, None
            results.append(res)
        return tfrom, results

def accept(self, start_key, text):
        (at, res) = self.unify_key(start_key, text)
        if res is None: return False
        if at == len(text): return True
        return False

Let us test the reg parse.

RPNI Algorithm

We next detail the RPNI algorithm. The first step is to build the PTA

Building the Prefix Tree Acceptor (PTA)

We construct a PTA, from all examples.

The PTA data structure

A PTA is nothing but a simple DFA in the form of a tree. We wrap our grammar in a simple class.

Build a Prefix Tree Acceptor from positive examples. Given an example, we take it apart to characters, then take the first character, and identify the transition (i.e. the rule) from start symbol that contains the given character. Since this is a new DFA, such a transition will not be found. Hence, we create a new rule with a new nonterminal. Then, we proceed with the next character in the input the same way extending the transitions step by step in the grammar. We mark the final nonterminal created as accepting.

From the second example onward, we can start to trace the existing transitions. But as before, if a transition can’t be found, we create a new nonterminal representing the new state, and add the given transition to the grammar.

Let us test it out!

State Merging

Check if two states can be merged without accepting negative examples Two states are compatible if we can represent both state by a new common state The idea is to create a new DFA where first, we replace both states by a new state, which creates a regular grammar, then, convert it to a canonical regular grammar, then check if they are still rejecting negative samples.

First, we define a helper function

to merge two states, we first create the name of the new state, which we name as <or(state1,state2)>. Then, we replace the nonterminals representing state1 and state2 in all rules with the nonterminal <or(state1,state2)>. Once we do this, the regular grammar we obtain is no longer a simple DFA, but an NFA, i.e., a Nondeterministic Finite Automaton. We will need to reduce this back to a DFA later.

Let us test it. We can make an NFA to a DFA by calling canonical_regular_grammar() from earlier posts. We need a DFA to parse the input using dfa_parse. So we do that.

When merging two states, we also need to verify that the end result is consistent with the negative examples. So, we define this function.

Let us test it.

RPNI algorithm implementation

def rpni(positive_examples, negative_examples):
    # Step 1: Build PTA
    dfa = DFA().build_pta(positive)
    start = dfa.start_symbol

# Step 2: Try to merge states
    # In the canonical order: try to merge each state with earlier states
    changed = True
    while changed:
        changed = False
        keys = list(dfa.grammar.keys())
        for i in range(1, len(keys)):
            for j in range(i):
                state_i, state_j = keys[i], keys[j]

# Check if merge is compatible
                merged_nfa, new_state = merge_to_nfa(dfa.grammar, state_i, state_j)
                if state_i == start or state_j == start: new_start = new_state
                else: new_start = start
                merged, new_start = rxcanonical.canonical_regular_grammar(
                        merged_nfa, new_start)

merged_dfa = DFA(start_symbol = new_start)
                merged_dfa.grammar = merged
                res = merged_dfa.is_consistent(negative_examples, positive_examples)
                if res:
                    dfa = merged_dfa
                    start = new_start
                    changed = True
                    break
            if changed: break
    return dfa

A Simple Example

Let us use RPNI to learn a simple pattern: strings over {a, b} that end with ‘b’.

positive = [
    "b",
    "ab",
    "bb",
    "aab",
    "abb",
    "bab"
]
negative = [
    "",
    "a",
    "aa",
    "ba",
    "aba",
    "bba"
]

learned_dfa = rpni(positive, negative)
print('should all be accepted', "Y" )
for s in positive:
    result = "Y" if learned_dfa.accepts(s) else "X"
    print(f"{result} '{s}'")
print()
print('should all be rejected', "X")
for s in negative:
    result = "Y" if learned_dfa.accepts(s) else "X"
    print(f"{result} '{s}'")

print()
print('others')

test_strings = ["aab", "aba", "bbb", "aa"]
for s in test_strings:
    result = "Y" if learned_dfa.accepts(s) else "X"
    print(f"{result} '{s}'")

print('What is the grammar learned?')
gatleast.display_grammar(learned_dfa.grammar, learned_dfa.start_symbol)
print('This looks intimidating, but let us convert this to a regular expression')
fixed = rgtorx.add_single_accept_state(learned_dfa.grammar)
rx = rgtorx.rg_to_regex(fixed, learned_dfa.start_symbol)
print(rgtorx.convert_rexs(rx))
print('Does minimizing help?')
m =  mrg.DRGMinimize(learned_dfa.grammar, learned_dfa.start_symbol)
newg, news = m.minimized_grammar()
fixed = rgtorx.add_single_accept_state(newg)
rx = rgtorx.rg_to_regex(fixed, news)
print(rgtorx.convert_rexs(rx))

The result illustrates a limitation of both RPNI and the DFA minimization algorithm. Effectively, we can’t factor out complex loops from the DFA because the DFA minimization is purely based on state equivalence. When there are complex loops, simple pairwise state equivalence would fail.

Next, let us attempt a larger example. Let us keep the negative strings, but use a fuzzer to generate positive strings.

Modified RPNI with NFA

Do we actually need DFA conversion as in the classical RPNI? State elimination is far easier if we are working with NFAs. NFAs are also smaller than corresponding DFAs, and the regular expression resulting from NFAs are likely to be correspondingly smaller. Finally, regular expression matching is quite fast in modern machines.

We now do not reduce the NFA to a DFA. The intuition is as follows:

We can directly convert the regular grammar (an NFA) to a regular expression, and quickly match.
We can keep merging the states in the NFA, and it should not affect correctness.
It should afford more repair opportunities by keeping as NFA.
The size should be exponentially smaller.

def rpni_nfa(positive_examples, negative_examples):
    # Step 1: Build PTA
    dfa = DFA().build_pta(positive)
    start = dfa.start_symbol
    nfa = dfa.grammar

changed = True
    while changed:
        changed = False
        keys = list(nfa.keys())
        for i in range(1, len(keys)):
            for j in range(i):
                bad = False
                state_i, state_j = keys[i], keys[j]

# Check if merge is compatible
                merged_nfa, new_state = merge_to_nfa(nfa, state_i, state_j)
                if state_i == start or state_j == start: new_start = new_state
                else: new_start = start
                fixed = rgtorx.add_single_accept_state(merged_nfa)
                regex = rgtorx.rg_to_regex(fixed, new_start)
                pattern = re.compile('^(' + rgtorx.convert_rexs(regex)+ ')$')
                for neg_example in negative_examples:
                    if pattern.match(neg_example):
                        bad = True
                        break
                if bad: continue
                for pos_example in positive_examples:
                    assert pattern.match(pos_example)
                nfa = merged_nfa
                start = new_start
                changed = True
                break
            if changed: break

fixed = rgtorx.add_single_accept_state(nfa)
    regex = rgtorx.rg_to_regex(fixed, start)
    return rgtorx.convert_rexs(regex), nfa, start

Let us test it.

Complexity and Limitations

Time Complexity: The RPNI algorithm is $O(p^3 n)$ where p is the size (sum of the lengths of all strings) of positive data, and n is the size of negative data.

Extensions and Improvements

Some of the important extensions to RPNI include EDSM ² and RPNI2 ³.

Artifacts

The runnable Python source for this notebook is available here.

The installable python wheel rpni is available here.

J Oncia and P Garcia, Inferring regular languages in polynomial update time, 1992 ↩
Kevin J. Lang, Barak A. Pearlmutter, and Rodney A. Price., Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm, 1998 ↩
Pierre Dupont, Incremental Regular Inference, 2005 ↩

Learning Regular Languages with RPNI Algorithm