TLDR; This tutorial is a complete implementation of Angluin’s L-star algorithm with PAC learning for inferring input grammars of blackbox programs in Python (i.e. without using equivalence queries). Such grammars are typically useful for fuzzing such programs. The Python interpreter is embedded so that you can work through the implementation steps.

In many previous posts, I have discussed how to parse with, fuzz with, and manipulate regular and context-free grammars. However, in many cases, such grammars may be unavailable. If you are given a blackbox program, where the program indicates in some way that the input was accepted or not, what can we do to learn the actual input specification of the blackbox? In such cases, the best option is to try and learn the input specification.

This particular research field which investigates how to learn the input specification of blackbox programs is called blackbox grammar inference or grammatical inference (see the Note at the end for a discussion on other names). In this post, I will discuss one of the classic algorithms for learning the input specification called L*. The L* algorithm was invented by Dana Angluin in 1987 ¹. While the initial algorithm used what is called an equivalence query, which assumes that you can check the correctness of the learned grammar separate from yes/no oracle, Angluin in the same paper also talks about how to update this algorithm to make use of the PAC (Probably Approximately Correct) framework from Valiant ². Angluin expands on this further in 1988 ³.

Definitions

Input symbol: A single symbol that is consumed by the machine which can move it from one state to another. The set of such symbols is called an alphabet, and is represented by \(A\).
Membership query: A string that is passed to the blackbox. The blackbox answers yes or no.
Equivalence query: A grammar that is passed to the teacher as a hypothesis of what the target language is. The teacher answers yes or a counter example that behaves differently on the blackbox and the hypothesis grammar.
Prefix closed: a set is prefix closed if all prefixes of any of its elements are also in the same set.
Suffix closed: a set is suffix closed if all suffixes of any of its elements are also in the same set.
Observation table: A table whose rows correspond to the candidate states. The rows are made up of prefix strings that can reach given states — commonly represented as \(S\), but here we will denote these by \(P\) for prefixes — and the columns are made up of suffix strings that serves to distinguish these states — commonly expressed as \(E\) for extensions, but we will use \(S\) to denote suffixes here. The table contains auxiliary rows that extends each item \(p \in P\) with each alphabet \(a \in A\) as we discuss later in closedness. This table defines the language inferred by the algorithm. The contents of the table are the answers from the oracle on a string composed of the row and column labels — prefix + suffix. That is \(T[s,e] = O(s.e)\). The table has two properties: closedness and consistency. If these are not met at any time, we take to resolve it.
The state: A state in the DFA is represented by a prefix in the observation table, and is named by the pattern of 1s and 0s in the cell contents. We represent a state corresponding the prefix \(p\) as \([p]\).
Closedness of the observation table means that for each \(p \in P\) and each \(a \in A\), the state represented by the auxiliary row \([p.a]\) (i.e., its contents) exists in \(P\). That is, there is some \(p' \in P\) such that \([p.a] == [p']\). The idea is that, the state corresponding to \([p]\) accepts alphabet \(a\) and transitions to the state \([p']\), and \(p'\) must be in the main set of rows \(P\).
Consistency of the observation table means that if two prefixes represents the same state (i.e. the contents of two rows are equal), that is \([p1] = [p2]\) then \([p1 . a] = [p2 . a]\) for all alphabets. The idea is that if two prefixes reach the state, then when fed any alphabet, both prefixes should transition to the same next state (represented by the pattern produced by the suffixes).
The candidate states P is prefix closed, while the set of suffixes S is suffix closed.

Definitions
1. Prerequisites
Grammar Inference
Notes
1. Context Free Languages
2. Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

Prerequisites

We need the fuzzer to generate inputs to parse and also to provide some utilities such as conversion of regular expression to grammars, random sampling from grammars etc. Hence, we import all that.

Grammar Inference

Let us start with the assumption that the blackbox program accepts a regular language. By accept I mean that the program does some processing with input given rather than error out. For example, if the blackbox actually contained a URL parser, it will accept a string that look like a URL, and reject strings that are not in the URL format.

So, given such a program, and you are not allowed to peek inside the program source code, how do you find what the program accepts? assuming that the program accepts a regular language, we can start by constructing a DFA (A deterministic finite state machine).

Finite state machines are of course the bread and butter of computer science. The idea is that the given program can be represented as a set of discrete states, and transitions between them. The DFA is initialized to the start state. A state transitions to another when it is fed an input symbol. Some states are marked as accepting. That is, starting from the start state, and after consuming all the symbols in the input, the state reached is one of the accept states, then we say that the input was accepted by the machine.

Given this information, how would we go about reconstructing the machine? An intuitive approach is to recognize that a state is represented by exactly two sets of strings. The first set of strings (the prefixes) is how the state can be reached from the start state. The second set of strings are continuations of input from the current state that distinguishes this state from every other state. That is, two states can be distinguished by the DFA if and only if there is at least one suffix string, which when fed into the pair of states, produces different answers – i.e. for one, the machine accepts (or reaches one of the accept states), while for the other is rejected (or the end state is not an accept).

Given this information, a data structure for keeping track of our experiments presents itself – the observation table where we keep our prefix strings as rows, and suffix strings as columns. The cell content simply marks whether program accepted the prefix + suffix string or not. So, here is our data structure.

ObservationTable

We initialize the observation table with the alphabet. We keep the table itself as an internal dict _T. We also keep the prefixes in P and suffixes in S. We initialize the set of prefixes P to be \({\epsilon}\) and the set of suffixes S also to be \({\epsilon}\). We also add a few utility functions.

Using the observation table with some pre-cooked data.

Convert Table to Grammar

Given the observation table, we can recover the grammar from this table (corresponding to the DFA). The unique cell contents of rows are states. In many cases, multiple rows may correspond to the same state (as the cell contents are the same). The start state is given by the state that correspond to the \(\epsilon\) row. A state is accepting if it on query of \(\epsilon\) i.e. '', it returns 1.

The formal notations are as follows. The notation \([p]\) means the state corresponding to the prefix \(p\). The notation \([[p,s]]\) means the result of oracle for the prefix \(p\) and the suffix \(s\). The notation \([p](a)\) means the state obtained by feeding the input symbol \(a\) to the state \([p]\). We take the first prefix that resulted in a particular state as its access prefix, and we denote the access prefix of a state \(s\) by \(\lfloor{}s\rfloor\) (this is not used in this post). The following is the DFA from our table.

states: \(Q = {[p] : p \in P}\)
start state: \(q0 = [\epsilon]\)
transition function: \([p](a) \rightarrow [p.a]\)
accepting state: \(F = {[p] : p \in P : [[p,\epsilon]] = 1}\)

For constructing the grammar from the table, we first identify all distinguished states. Next, we identify the start state, followed by accepting states. Finally, we connect states together with transitions between them.

class ObservationTable(ObservationTable):
    def table_to_grammar(self):
        # Step 1: identify all distinguished states.
        prefix_to_state = {}  # Mapping from row string to state ID
        states = {}
        grammar = {}
        for p in self.P:
            stateid = self.state(p)
            if stateid not in states: states[stateid] = []
            states[stateid].append(p)
            prefix_to_state[p] = stateid

for stateid in states: grammar[stateid] = []

# Step 2: Identify the start state, which corresponds to epsilon row
        start_nt = prefix_to_state['']

# Step 3: Identify the accepting states
        accepting = [prefix_to_state[p] for p in self.P if self.cell(p,'') == 1]
        if not accepting: return {'<start>': []}, '<start>'
        for s in accepting: grammar[s] = [['<_>']]
        grammar['<_>'] = [[]]

# Step 4: Create the transition function
        for sid1 in states:
            first_such_row = states[sid1][0]
            for a in self.A:
                sid2 = self.state(first_such_row + a)
                grammar[sid1].append([a, sid2])

return grammar, start_nt

Let us try the observation to grammar conversion for an observation table that corresponds to recognition of the string a. We will use the alphabet a, b.

Cleanup Grammar

This gets us a grammar that can accept the string a, but it also has a problem. The issue is that the key <00> has no rule that does not include <00> in its expansion. That is, <00> is an infinite loop that once the machine goes in, is impossible to exit. We need to remove such rules. We do that using the compute_cost() function of LimitFuzzer. The idea is that if all rules of a nonterminal have inf as the cost, then that nonterminal produces an infinite loop and hence both the nonterminal, as well as any rule that references that nonterminal have to be removed recursively.

We can wrap up everything in one method.

once again

Now that we are convinced that we can produce a DFA or a grammar out of the table let us proceed to examining how to produce this table.

We start with the start state in the table, because we know for sure that it exists, and is represented by the empty string in row and column, which together (prefix + suffix) is the empty string '' or \(\epsilon\). We ask the program if it accepts the empty string, and if it accepts, we mark the corresponding cell in the table as accept (or 1).

For any given state in the DFA, we should be able to say what happens when an input symbol is fed into the machine in that state. So, we can extend the table with what happens when each input symbol is fed into the start state. This means that we extend the table with rows corresponding to each symbol in the input alphabet.

So, we can initialize the table as follows. First, we check whether the empty string is in the language. Then, we extend the table T to (P u P.A).S using membership queries. This is given in update_table()

The update table has two parts. First, it takes the current set of prefixes (rows) and determines the auxiliary rows to compute based on extensions of the current rows with the symbols in the alphabet (auxrows). This gives the complete set of rows for the table. Then, for each suffix in S, ensure that the table has a cell, and it is updated with the oracle result.

Using init_table and update_table

Since we want to know what state we reached when we fed the input symbol to the start state, we add a set of cleverly chosen suffixes (columns) to the table, determine the machine response to these suffixes (by feeding the machine prefix+suffix for each combination), and check whether any new state other than the start state was identified. A new state reached by a prefix can be distinguished from the start state using some suffix, if, after consuming that particular prefix, followed by the particular suffix, the machine moved to say accept, but when the machine at the start state was fed the same suffix, the end state was not accept. (i.e. the machine accepted prefix + suffix but not suffix on its own). Symmetrically, if the machine did not accept the string prefix + suffix but did accept the string suffix, that also distinguishes the state from the start state. Once we have identified a new state, we can then extend the DFA with transitions from this new state, and check whether more states can be identified.

While doing this, there is one requirement we need to ensure. The result of transition from every state for every alphabet needs to be defined. The property that ensures this for the observation table is called closedness or equivalently, the observation table is closed if the table has the following property.

Closed

The idea is that for every prefix we have, in set \(P\), we need to find the state that is reached for every \(a \in A\). Then, we need to make sure that the state represented by that prefix exists in \(P\). (If such a state does not exist in P, then it means that we have found a new state).

Formally: An observation table \(P \times S\) is closed if for each \(t \in P·A\) there exists a \(p \in P\) such that \([t] = [p]\)

Using closed.

Add prefix

Using add_prefix

This is essentially the intuition behind most of the grammar inference algorithms, and the cleverness lies in how the suffixes are chosen. In the case of L*, when we find that one of the transitions from the current states result in a new state, we add the alphabet that caused the transition from the current state and the suffix that distinguished the new state to the suffixes (i.e, a + suffix is added to the columns).

This particular aspect is governed by the consistence property of the observation table.

Consistent

An observation table \(P \times S\) is consistent if, whenever \(p1\) and \(p2\) are elements of P such that \([p1] = [p2]\), for each \(a \in A\), \([p1.a] = [p2.a]\). If there are two rows in the top part of the table repeated, then the corresponding suffix results should be the same. If not, we have found a counter example. So we report the alphabet and the suffix that distinguished the rows. We will then add the new string (a + suffix) as a new suffix to the table.

Add suffix

Using add_suffix

(Of course readers will quickly note that the table is not the best data structure here, and just because a suffix distinguished two particular states does not mean that it is a good idea to evaluate the same suffix on all other states. These are ideas that will be explored in later algorithms).

Finally, L* also relies on a Teacher for it to suggest new suffixes that can distinguish unrecognized states from current ones.

Teacher

We now construct our teacher. We have two requirements for the teacher. The first is that it should fulfil the requirement for Oracle. That is, it should answer is_member() queries. Secondly, it should also answer is_equivalent() queries. First, we define the oracle interface.

As I promised, we will be using the PAC framework rather than the equivalence oracles.

We define a simple teacher based on regular expressions. That is, if you give it a regular expression, will convert it to an acceptor based on a parser and a generator based on a random sampler, and will then use it for verification of hypothesis grammars.

PAC Learning

PAC learning was introduced by Valiant in 1984 ² as a way to think about inferred models in computational linguistics and machine learning. The basic idea is that given a blackbox model, we need to be able to produce samples that can then be tested against the model to construct an inferred model (i.e, to train the model). For sampling during training, we have to assume some sampling procedure, and hence a distribution for training. Per PAC learning, we can only guarantee the performance of the learned model when tested using samples from the same distribution. Given that we are sampling from a distribution, there is a possibility that due to non-determinism, the data is not as spread out as we may like, and hence the training data is not optimal by a certain probability. This reflects on the quality of the model learned. This is indicated by the concept of confidence intervals, and indicated by the \(\delta\) parameter. That is, \(1 - \delta\) quantifies the confidence we have in our model. Next, given any training data, due to the fact that the training data is finite, our grammar learned is an approximation of the real grammar, and there will always be an error term. This error is quantified by the \(\epsilon\) parameter. Given the desired \(\delta\) and \(\epsilon\) Angluin provides a formula to compute the number of calls to make to the membership oracle at the \(i^{th}\) equivalence query.

\[n=\lceil\frac{1}{\epsilon}\times log(\frac{1}{\delta}+i\times log(2))\rceil\]

In essence the PAC framework says that there is \(1 - \delta\) probability that the model learned will be approximately correct. That is, it will classify samples with an error rate less than \(\epsilon\).

We input the PAC parameters delta for confidence and epsilon for accuracy

We can define the membership query is_member() as follows:

Given a grammar, check whether it is equivalent to the given grammar. The PAC guarantee is that we only need num_calls for the ith equivalence query. For equivalence check here, we check for strings of length 1, then length 2 etc, whose sum should be num_calls. We take the easy way out here, and just use num_calls as the number of calls for each string length. We have what is called a cooperative teacher, that tries to respond with a shortest possible counter example. We # also take the easy way out and only check for a maximum length of 10. (I will revisit this if there is interest on expanding this).

Due to the limitations of our utilities for random sampling, we need to remove epsilon tokens from places other than the start rule.

Next, we have a helper for producing the random sampler, and the parser for easy comparison.

Check Grammar Equivalence

Checking if two grammars are equivalent to a length of string for n count.

Let us test this out.

L star main loop

Given the observation table and the teacher, the algorithm itself is simple. The L* algorithm loops, doing the following operations in sequence. (1) keep the table closed, (2) keep the table consistent, and if it is closed and consistent (3) ask the teacher if the corresponding hypothesis grammar is correct.

Using it

we define a match function for converting syntax error to boolean

Evaluating the model accuracy.

In the PAC model, we infer a grammar with parameters \(\epsilon\) and \(\delta\). Here, from a software engineering perspective, the interpretation is that if we use the same input distribution for inference as the sampling distribution, then we can expect an accuracy of \(1 - \epsilon\) in classification. That is, we can expect \(n \times \epsilon\) misclassified strings out of \(n\) strings, and we have a confidence level of \(1 - \delta\) in this statement.

The F1 score.

There is of course an additional question here. From the perspective of language learning for software engineering, the error rate is too coarse. We need more detail as to the rate of false positives and false negatives. This is because most of our languages are very sparse compared to the universe of all strings, and it is easy to get a very small error rate when almost any input generated randomly is unlikely to be in the language. That is, the precision and recall of the model that we learned is important. I have discussed how to compute the precision and recall, and the F1 score previously. So, we can compute the precision and recall as follows.

Notes

While there is no strict specifications as to what grammar induction, inference, identification, and learning is, according to Higuera, Grammar inference is about learning the grammar (i.e. the representation) when given information about a language, and focuses on the target, the grammar. That is, you start with the assumption that a target grammar exists. Then, try to guess that grammar based on your observations. If on the other hand, you do not believe that a particular target grammar exists, but want to do the best to learn the underlying principles, or find a grammar that explain the data, then it is grammar induction. (This is also mentioned in the preface of the book “Grammatical Inference” by Higuera) That is, it focuses on the best possible grammar for the given data. Closely related fields are grammar mining, grammar recovery, and grammar extraction which are all whitebox approaches based on program or related artifact analysis. Language acquisition is another related term. Another is Model Learning ⁴.

Context Free Languages

Here, we discussed how to infer a regular grammar. In many cases the blackbox may accept a language that is beyond regular, for example, it could be context-free. The nice thing about L* is that it can provide us a close approximation of such context-free language in terms of a regular grammar. One can then take the regular grammar thus produced, and try and identify context-free structure in that DFA based on recognition of repeating structures. It is still an open question on how to recover the context-free structure from such a DFA.

Artifacts

The runnable Python source for this notebook is available here.

Learning Regular Sets from Queries and Counterexamples, 1987 ↩
A theory of the learnable, 1984 ↩ ↩²
Queries and Concept Learning, 1988 ↩
Model learning, Frits Vaandrager, 2017 ↩

Learning Regular Languages with L* Algorithm