Learning Regular Languages with RPNI Algorithm
TLDR; This tutorial is a complete implementation of RPNI algorithm which infers the input specification from positive and negative samples. Such grammars are can be useful for identifying the specification of blackbox programs when it is impossible to query the program directly, and only its behaviour on a limited set of samples is known. The Python interpreter is embedded so that you can work through the implementation steps.
In my previous posts, I have discussed how to parse with, fuzz with, and manipulate regular and context-free grammars. However, the availability of the input grammars may not be guaranteed in many cases. Often, the available documentation may not fully capture the input specification. For example, an input field advertizing the ISO 8601 date format as input may accept slight variations, or may not implement the format fully. In such cases, one has to rely on the available behaviour to infer the true specification.
Regular Positive and Negative Inference, RPNI for short, is a classical algorithm that allows us to determine the input specification from a set of positive and negative samples. It was introducd by Oncia and Garcia in 1992 1. The core idea of RPNI is to construct a Prefix Tree Acceptor (PTA, also called a Trie), then iteratively try and merge each state pair, and checking the resulting DFA against negative samples to verify that the resulting DFA is not overgenaralizing. Because the new DFA is produced by merging states within the DFA, it will continue to accept all existing positive samples. The negative samples provide the defense against overgeneralization. If any negative samples are accepted, the merging of the state pair is rejected. Continuing in this fashion, RPNI will compute the DFA that can accept all positive samples, and reject all negative samples.
To things to note: Every DFA has a characteristic set of samples including positive and negative samples, when given to the RPNI algorithm, will result in the exact DFA of the blackbox. Secondly, RPNI will also infer the precise DFA in the limit. Secondly, L* is called an active grammar inference algorithm because it is able to query the blackbox, while RPNI is called a passive grammar inference algorithm because it cannot query the blackbox.
Definitions
- Input symbol: A single symbol that is consumed by the machine which can move it from one state to another. The set of such symbols is called an alphabet, and is represented by \(A\).
- DFA: A Deterministic Finite-state Automaton is a machine which is a model of an input processor. It is represented as a set of discrete states, and transitions between them. The DFA is initialized to the start state.
- State: A state is one of the components of the DFA, and represents a particular phase of the machine. A state transitions to another when it is fed an input symbol. Some states are marked as accepting.
- Accepting State: Starting from the start state, and after consuming all the symbols in the input, the state reached is one of the accept states, then we say that the input was accepted by the machine.
- Starting State: The machine starts at the start state.
- Transition: A transition is a link between two states. A DFA transitions from one state to the next on consuming an input symbol.
- Terminal Symbol: A terminal symbol is a symbol in the alphabet of the language in a regular grammar. It is the equivalnt of DFA transition in the grammar.
- Nonterminal Symbol: A nonterminal symbol defines the rules of expansion in the grammar. A nonterminal symbol is associated with a definition in the grammar which contains the above rules. It corresponds to a state in the DFA.
- Definition: A set of rules which can be applied to a nonterminal symbol for expansion.
- Rule: A single way to expand a nonterminal symbol. In the DFA context, it contains the transition as the token, and the ending state as the final nonterminal symbol.
Contents
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
Available Packages
These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.- simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".
- rxcanonical-0.0.1-py2.py3-none-any.whl from "Converting a Regular Expression to DFA using Regular Grammar".
- gatleastsinglefault-0.0.1-py2.py3-none-any.whl from "Specializing Context-Free Grammars for Inducing Faults".
- earleyparser-0.0.1-py2.py3-none-any.whl from "Earley Parser".
- hdd-0.0.1-py2.py3-none-any.whl from "Hierarchical Delta Debugging".
- ddset-0.0.1-py2.py3-none-any.whl from "Simple DDSet".
- rxfuzzer-0.0.1-py2.py3-none-any.whl from "Fuzzing With Regular Expressions".
- rxregular-0.0.1-py2.py3-none-any.whl from "Regular Expression to Regular Grammar".
- rgtorx-0.0.1-py2.py3-none-any.whl from "Regular Grammars to Regular Expressions".
- minimizeregulargrammar-0.0.1-py2.py3-none-any.whl from "Minimize Regular Grammar".
Prerequisites
We need the fuzzer to generate inputs to parse and also to provide some utilities such as conversion of regular expression to grammars, random sampling from grammars etc. Hence, we import all that.
Grammar Inference
As before, let us assume that the specification to be inferred is a regular language.
So, given a list of inputs which contain strings that were accepted (postive samples) and rejected (negative samples), and without the ability to query the blackbox program, or look inside the blackbox program, how do you find what the program accepts? We start by constructing a PTA, which is then generalized into a DFA. The intuition here is that the PTA is already an over specialized DFA, and all we need to do is to identify which of the states are actually the same. So we keep pairing the states, and checking whether the constraints, i.e. all negative strings are rejected, hold. This is the key intuition for all algorithms that work with positive and negative samples only.
The RPNI Algorithm
RPNI is a classic algorithm for learning regular languages from positive and negative strings. As we mentioned before, the algorithm has two phases.
- Building a Prefix Tree Acceptor: Create a PTA accepting all positive samples
- State Merging: Iteratively merge states to generalize the DFA while holding the constraints.
The Reg Parse
Before we go further, let us define a few helper functions. First,
we need to represent a DFA. A Deterministic Finite Automation can
be adequately represented by a right linear regular grammar. That is,
each onterminal is equivalent to a state in the DFA. The start symbol
of the grammar is the entry point for the DFA. Transitions from
a state A to another state B is indicated as a rule <A> := b <B>.
That is, state A transitions to state B on receiving the token b.
This is represented as a Python dictionary.
{
'<A>' : [['b', '<B>'], ['c', '<C>']],
'<B>' : [['b', '<B>'], []],
'<C>' : [[]]
}
As you can noticce in the above representation, the accepting states are
indicated by adding [] to the state. In the above, both <B> and <C> are
accepting states. To make it easier to recognize, we insist that [] be only
added as the last rule.
With this in mind, it becomes easy to define a matcher for this DFA. It is
close to our original peg_parse, but with additional handling for the
accepting states.
Reg parse
Let us test the reg parse.
RPNI Algorithm
We next detail the RPNI algorithm. The first step is to build the PTA
Building the Prefix Tree Acceptor (PTA)
We construct a PTA, from all examples.
The PTA data structure
A PTA is nothing but a simple DFA in the form of a tree. We wrap our grammar in a simple class.
Build a Prefix Tree Acceptor from positive examples. Given an example, we take it apart to characters, then take the first character, and identify the transition (i.e. the rule) from start symbol that contains the given character. Since this is a new DFA, such a transition will not be found. Hence, we create a new rule with a new nonterminal. Then, we proceed with the next character in the input the same way extending the transitions step by step in the grammar. We mark the final nonterminal created as accepting.
From the second example onward, we can start to trace the existing transitions. But as before, if a transition can’t be found, we create a new nonterminal representing the new state, and add the given transition to the grammar.
Let us test it out!
State Merging
Check if two states can be merged without accepting negative examples Two states are compatible if we can represent both state by a new common state The idea is to create a new DFA where first, we replace both states by a new state, which creates a regular grammar, then, convert it to a canonical regular grammar, then check if they are still rejecting negative samples.
First, we define a helper function
to merge two states, we first create the name of the new state, which
we name as <or(state1,state2)>. Then, we replace the nonterminals
representing state1 and state2 in all rules with the nonterminal
<or(state1,state2)>. Once we do this, the regular grammar we obtain
is no longer a simple DFA, but an NFA, i.e., a
Nondeterministic Finite Automaton. We will need to reduce this back to
a DFA later.
Let us test it. We can make an NFA to a DFA by calling
canonical_regular_grammar() from earlier posts. We need a DFA to
parse the input using dfa_parse. So we do that.
When merging two states, we also need to verify that the end result is consistent with the negative examples. So, we define this function.
Let us test it.
RPNI algorithm implementation
A Simple Example
Let us use RPNI to learn a simple pattern: strings over {a, b} that end with ‘b’.
The result illustrates a limitation of both RPNI and the DFA minimization algorithm. Effectively, we can’t factor out complex loops from the DFA because the DFA minimization is purely based on state equivalence. When there are complex loops, simple pairwise state equivalence would fail.
Next, let us attempt a larger example. Let us keep the negative strings, but use a fuzzer to generate positive strings.
Complexity and Limitations
Time Complexity: The RPNI algorithm is \(O(p^3 n)\) where p is the size (sum of the lengths of all strings) of positive data, and n is the size of negative data.
Extensions and Improvements
Some of the important extensions to RPNI include EDSM 2 and RPNI2 3.
Artifacts
The runnable Python source for this notebook is available here.
The installable python wheel rpni is available here.
-
J Oncia and P Garcia, Inferring regular languages in polynomial update time, 1992 ↩
-
Kevin J. Lang, Barak A. Pearlmutter, and Rodney A. Price., Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm, 1998 ↩
-
Pierre Dupont, Incremental Regular Inference, 2005 ↩