# Learning Regular Languages with L* Algorithm

TLDR; This tutorial is a complete implementation of Angluin’s L-star algorithm with PAC learning for inferring input grammars of blackbox programs in Python (i.e. without using equivalence queries). Such grammars are typically useful for fuzzing such programs. The Python interpreter is embedded so that you can work through the implementation steps.

In many previous posts, I have discussed how to parse with, fuzz with, and manipulate regular and context-free grammars. However, in many cases, such grammars may be unavailable. If you are given a blackbox program, where the program indicates in some way that the input was accepted or not, what can we do to learn the actual input specification of the blackbox? In such cases, the best option is to try and learn the input specification.

This particular research field which investigates how to learn the input
specification of blackbox programs is called blackbox *grammar inference* or
*grammatical inference* (see the **Note** at the end for a discussion on other
names). In this post, I will discuss one of the classic algorithms for
learning the input specification called L*. The L* algorithm was invented by
Dana Angluin in 1987 ^{1}. While the initial algorithm used what is
called an equivalence query, which assumes that you can check the correctness
of the learned grammar separate from yes/no oracle, Angluin in the same paper
also talks about how to update this algorithm to make use of the PAC
(*Probably Approximately Correct*) framework from Valiant ^{2}.
Angluin expands on this further in 1988 ^{3}.

## Contents

**Important:** Pyodide takes time to initialize.
Initialization completion is indicated by a red border around *Run all* button.

## Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.- simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".
- rxfuzzer-0.0.1-py2.py3-none-any.whl from "Fuzzing With Regular Expressions".
- earleyparser-0.0.1-py2.py3-none-any.whl from "Earley Parser".
- cfgrandomsample-0.0.1-py2.py3-none-any.whl from "Uniform Random Sampling of Strings from Context-Free Grammar".
- cfgremoveepsilon-0.0.1-py2.py3-none-any.whl from "Remove Empty (Epsilon) Rules From a Context-Free Grammar.".
- gatleastsinglefault-0.0.1-py2.py3-none-any.whl from "Specializing Context-Free Grammars for Inducing Faults".
- hdd-0.0.1-py2.py3-none-any.whl from "Hierarchical Delta Debugging".
- ddset-0.0.1-py2.py3-none-any.whl from "Simple DDSet".

#### Prerequisites

We need the fuzzer to generate inputs to parse and also to provide some utilities such as conversion of regular expression to grammars, random sampling from grammars etc. Hence, we import all that.

# Grammar Inference

Let us start with the assumption that the blackbox program
accepts a regular language.
By *accept* I mean that the program does some processing with input given
rather than error out. For example, if the blackbox actually contained a URL
parser, it will *accept* a string that look like a URL, and *reject* strings
that are not in the URL format.

So, given such a program, and you are not allowed to peek inside the program source code, how do you find what the program accepts? assuming that the program accepts a regular language, we can start by constructing a DFA (A deterministic finite state machine).

Finite state machines are of course the bread and butter of
computer science. The idea is that the given program can be represented as a
set of discrete states, and transitions between them. The DFA is
initialized to the start state. A state transitions to another when it is
fed an input symbol. Some states are marked as *accepting*. That is, starting
from the start state, and after consuming all the symbols in the input, the
state reached is one of the accept states, then we say that the input was
*accepted* by the machine.

Given this information, how would we go about reconstructing the machine? An intuitive approach is to recognize that a state is represented by exactly two sets of strings. The first set of strings (the prefixes) is how the state can be reached from the start state. The second set of strings are continuations of input from the current state that distinguishes this state from every other state. That is, two states can be distinguished by the DFA if and only if there is at least one suffix string, which when fed into the pair of states, produces different answers – i.e. for one, the machine accepts (or reaches one of the accept states), while for the other is rejected (or the end state is not an accept).

Given this information, a data structure for keeping track of our experiments
presents itself – the *observation table* where we keep our prefix strings
as rows, and suffix strings as columns. The cell content simply marks
whether program accepted the prefix + suffix string or not. So, here
is our data structure.

## ObservationTable

We initialize the observation table with the alphabet. We keep the table
itself as an internal dict `_T`

. We also keep the prefixes in `P`

and
suffixes in `S`

.
We initialize the set of prefixes `P`

to be \({\epsilon}\)
and the set of suffixes `S`

also to be \({\epsilon}\). We also add
a few utility functions.

Using the observation table with some pre-cooked data.

### Convert Table to Grammar

Given the observation table, we can recover the grammar from this table
(corresponding to the DFA). The
unique cell contents of rows are states. In many cases, multiple rows may
correspond to the same state (as the cell contents are the same).
The *start state* is given by the state that correspond to the \(\epsilon\)
row.
A state is accepting if it on query of \(\epsilon\) i.e. `''`

,
it returns 1.

The formal
notations are as follows. The notation \([p]\) means the state
corresponding to the prefix \(p\). The notation \([[p,s]]\) means the
result of oracle for the prefix \(p\) and the suffix \(s\).
The notation \([p](a)\) means the state obtained by feeding the input
symbol \(a\) to the state \([p]\). We take
the first prefix that resulted in a particular state as its *access prefix*,
and we denote the access prefix of a state \(s\) by
\(\lfloor{}s\rfloor\) (this is not used in this post). The following
is the DFA from our table.

- states: \(Q = {[p] : p \in P}\)
- start state: \(q0 = [\epsilon]\)
- transition function: \([p](a) \rightarrow [p.a]\)
- accepting state: \(F = {[p] : p \in P : [[p,\epsilon]] = 1}\)

For constructing the grammar from the table, we first identify all distinguished states. Next, we identify the start state, followed by accepting states. Finally, we connect states together with transitions between them.

Let us try the observation to grammar conversion for an observation table
that corresponds to recognition of the string `a`

. We will use the alphabet
`a`

, `b`

.

### Cleanup Grammar

This gets us a grammar that can accept the string `a`

, but it also has a
problem. The issue is that the key `<00>`

has no rule that does not include
`<00>`

in its expansion. That is, `<00>`

is an infinite loop that once the
machine goes in, is impossible to exit. We need to remove such rules. We do
that using the `compute_cost()`

function of LimitFuzzer. The idea is that
if all rules of a nonterminal have `inf`

as the cost, then that nonterminal
produces an infinite loop and hence both the nonterminal, as well as any rule
that references that nonterminal have to be removed recursively.

We can wrap up everything in one method.

once again

Now that we are convinced that we can produce a DFA or a grammar out of the table let us proceed to examining how to produce this table.

We start with the start state in the table, because we know for sure
that it exists, and is represented by the empty string in row and column,
which together (prefix + suffix) is the empty string `''`

or \(\epsilon\).
We ask the program if it accepts the empty string, and if it accepts, we mark
the corresponding cell in the table as *accept* (or `1`

).

For any given state in the DFA, we should be able to say what happens when an input symbol is fed into the machine in that state. So, we can extend the table with what happens when each input symbol is fed into the start state. This means that we extend the table with rows corresponding to each symbol in the input alphabet.

So, we can initialize the table as follows. First, we check whether the
empty string is in the language. Then, we extend the table `T`

to `(P u P.A).S`

using membership queries. This is given in `update_table()`

The update table has two parts. First, it takes the current set of prefixes
(`rows`

) and determines the auxiliary rows to compute based on extensions of
the current rows with the symbols in the alphabet (`auxrows`

). This gives the
complete set of rows for the table. Then, for each suffix in `S`

, ensure that
the table has a cell, and it is updated with the oracle result.

Using init_table and update_table

Since we want to know what state we reached when we
fed the input symbol to the start state, we add a set of cleverly chosen
suffixes (columns) to the table, determine the machine response to these
suffixes (by feeding the machine prefix+suffix for each combination), and
check whether any new state other than the start state was identified. A
new state reached by a prefix can be distinguished from the start state using
some suffix, if, after consuming that particular prefix, followed by the
particular suffix, the machine moved to say *accept*, but when the machine
at the start state was fed the same suffix, the end state was not *accept*.
(i.e. the machine accepted prefix + suffix but not suffix on its own).
Symmetrically, if the machine did not accept the string prefix + suffix
but did accept the string suffix, that also distinguishes the state from
the start state. Once we have identified a new state, we can then extend
the DFA with transitions from this new state, and check whether more
states can be identified.

While doing this, there is one requirement we need to ensure. The result
of transition from every state for every alphabet needs to be defined.
The property that ensures this for the observation table is called
*closedness* or equivalently, the observation table is *closed* if the
table has the following property.

### Closed

The idea is that for every prefix we have, in set \(P\), we need to find
the state that is reached for every \(a \in A\). Then, we need to make sure
that the *state* represented by that prefix exists in \(P\). (If such a
state does not exist in P, then it means that we have found a new state).

Formally: An observation table \(P \times S\) is closed if for each \(t \in P·A\) there exists a \(p \in P\) such that \([t] = [p]\)

Using closed.

### Add prefix

Using add_prefix

This is essentially the intuition behind most of the grammar inference algorithms, and the cleverness lies in how the suffixes are chosen. In the case of L*, when we find that one of the transitions from the current states result in a new state, we add the alphabet that caused the transition from the current state and the suffix that distinguished the new state to the suffixes (i.e, a + suffix is added to the columns).

This particular aspect is governed by the *consistence* property of the
observation table.

### Consistent

An observation table \(P \times S\) is consistent if, whenever \(p1\)
and \(p2\)
are elements of P such that \([p1] = [p2]\), for each \(a \in A\),
\([p1.a] = [p2.a]\).
*If* there are two rows in the top part of the table repeated, then the
corresponding suffix results should be the same.
If not, we have found a counter example. So we report the alphabet and
the suffix that distinguished the rows. We will then add the new
string (a + suffix) as a new suffix to the table.

### Add suffix

Using add_suffix

(Of course readers will quickly note that the table is not the best data structure here, and just because a suffix distinguished two particular states does not mean that it is a good idea to evaluate the same suffix on all other states. These are ideas that will be explored in later algorithms).

Finally, L* also relies on a *Teacher* for it to suggest new suffixes that
can distinguish unrecognized states from current ones.

## Teacher

We now construct our teacher. We have two requirements for the teacher.
The first is that it should fulfil the requirement for Oracle. That is,
it should answer `is_member()`

queries. Secondly, it should also answer
`is_equivalent()`

queries.
First, we define the oracle interface.

As I promised, we will be using the PAC framework rather than the equivalence oracles.

We define a simple teacher based on regular expressions. That is, if you give it a regular expression, will convert it to an acceptor based on a parser and a generator based on a random sampler, and will then use it for verification of hypothesis grammars.

### PAC Learning

PAC learning was introduced by Valiant in 1984 ^{2} as a way to
think about inferred models in computational linguistics and machine learning.
The basic idea is that given a blackbox model, we need to be able to produce
samples that can then be tested against the model to construct an inferred
model (i.e, to train the model). For sampling during training, we have to
assume some sampling procedure, and hence a distribution for training.
Per PAC learning, we can only guarantee the performance of the
learned model when tested using samples from the same distribution. Given
that we are sampling from a distribution, there is a possibility that due
to non-determinism, the data is not as spread out as we may like, and hence
the training data is not optimal by a certain probability. This reflects on
the quality of the model learned. This is indicated by the concept of
confidence intervals, and indicated by the \(\delta\) parameter. That is,
\(1 - \delta\) quantifies the confidence we have in our model. Next,
given any training data, due to the fact that the training data is finite,
our grammar learned is an approximation of the real grammar, and there will
always be an error term. This error is quantified by the \(\epsilon\)
parameter. Given the desired \(\delta\) and \(\epsilon\) Angluin provides
a formula to compute the number of calls to make to the membership oracle
at the \(i^{th}\) equivalence query.

In essence the PAC framework says that there is \(1 - \delta\) probability that the model learned will be approximately correct. That is, it will classify samples with an error rate less than \(\epsilon\).

We input the PAC parameters delta for confidence and epsilon for accuracy

We can define the membership query `is_member()`

as follows:

Given a grammar, check whether it is equivalent to the given grammar.
The PAC guarantee is that we only need `num_calls`

for the `i`

th equivalence
query. For equivalence check here, we check for strings of length 1, then
length 2 etc, whose sum should be `num_calls`

. We take the easy way out here,
and just use `num_calls`

as the number of calls for each string length.
We have what is called a *cooperative teacher*, that tries to respond with
a shortest possible counter example. We # also take the easy way out and only
check for a maximum length of 10.
(I will revisit this if there is interest on expanding this).

Due to the limitations of our utilities for random sampling, we need to remove epsilon tokens from places other than the start rule.

Next, we have a helper for producing the random sampler, and the parser for easy comparison.

## Check Grammar Equivalence

Checking if two grammars are equivalent to a length of string for n count.

Let us test this out.

## L star main loop

Given the observation table and the teacher, the algorithm itself is simple. The L* algorithm loops, doing the following operations in sequence. (1) keep the table closed, (2) keep the table consistent, and if it is closed and consistent (3) ask the teacher if the corresponding hypothesis grammar is correct.

Using it

we define a match function for converting syntax error to boolean

## The F1 score.

There is of course an additional question here. From the perspective of
language learning for software engineering *how we learned* is less important
than *what we learned*. That is, the precision and recall of the model that
we learned is important. I have discussed how to compute the precision and
recall, and the F1 score previously.
So, we can compute the precision and recall as follows.

# Definitions

- Input symbol: A single symbol that is consumed by the machine which can move it from one state to another. The set of such symbols is called an alphabet, and is represented by \(A\).
- Membership query: A string that is passed to the blackbox. The blackbox answers yes or no.
- Equivalence query: A grammar that is passed to the teacher as a hypothesis of what the target language is. The teacher answers yes or a counter example that behaves differently on the blackbox and the hypothesis grammar.
- Prefix closed: a set is prefix closed if all prefixes of any of its elements are also in the same set.
- Suffix closed: a set is suffix closed if all suffixes of any of its elements are also in the same set.
- Observation table: A table whose rows correspond to the
*candidate states*. The rows are made up of prefix strings that can reach given states — commonly represented as \(S\), but here we will denote these by \(P\) for prefixes — and the columns are made up of suffix strings that serves to distinguish these states — commonly expressed as \(E\) for extensions, but we will use \(S\) to denote suffixes here. The table contains auxiliary rows that extends each item \(p \in P\) with each alphabet \(a \in A\) as we discuss later in*closedness*. This table defines the language inferred by the algorithm. The contents of the table are the answers from the oracle on a string composed of the row and column labels — prefix + suffix. That is \(T[s,e] = O(s.e)\). The table has two properties:*closedness*and*consistency*. If these are not met at any time, we take to resolve it. - The state: A state in the DFA is represented by a prefix in the observation table, and is named by the pattern of 1s and 0s in the cell contents. We represent a state corresponding the prefix \(p\) as \([p]\).
- Closedness of the observation table means that for each \(p \in P\) and each \(a \in A\), the state represented by the auxiliary row \([p.a]\) (i.e., its contents) exists in \(P\). That is, there is some \(p' \in P\) such that \([p.a] == [p']\). The idea is that, the state corresponding to \([p]\) accepts alphabet \(a\) and transitions to the state \([p']\), and \(p'\) must be in the main set of rows \(P\).
- Consistency of the observation table means that if two prefixes represents the same state (i.e. the contents of two rows are equal), that is \([p1] = [p2]\) then \([p1 . a] = [p2 . a]\) for all alphabets. The idea is that if two prefixes reach the state, then when fed any alphabet, both prefixes should transition to the same next state (represented by the pattern produced by the suffixes).
- The candidate states
`P`

is prefix closed, while the set of suffixes`S`

is suffix closed.

# Notes

While there is no strict specifications as to what grammar induction,
inference, identification, and learning is, according to Higuera,
Grammar inference is about learning **the** grammar (i.e. the representation)
when given information about a language, and focuses on the target, the
grammar. That is, you start with the assumption that a target grammar exists.
Then, try to guess that grammar based on your observations.
If on the other hand, you do not believe that a particular target grammar
exists, but want to do the best to learn the underlying principles, or find
*a* grammar that explain the data, then it is grammar induction.
(This is also mentioned in the preface of the book “Grammatical Inference”
by Higuera)
That is, it focuses on the best possible grammar for the
given data. Closely related fields are grammar mining, grammar recovery,
and grammar extraction which are all whitebox approaches based on program
or related artifact analysis. Language acquisition is another related term.

## Context Free Languages

Here, we discussed how to infer a regular grammar. In many cases the blackbox may accept a language that is beyond regular, for example, it could be context-free. The nice thing about L* is that it can provide us a close approximation of such context-free language in terms of a regular grammar. One can then take the regular grammar thus produced, and try and identify context-free structure in that DFA based on recognition of repeating structures. It is still an open question on how to recover the context-free structure from such a DFA.

## Artifacts

The runnable Python source for this notebook is available here.