Blackbox Inference of Context Free Grammars -- Verification
The problem of inferring the input specification of a program has recently seen some interest12 from researchers. This is of huge interest in fuzzing as recovery of the input specification can improve the efficiency, effectiveness, and generality of a fuzzer by leaps and bounds3.
The best way to go about it is to look for a general solution with proofs. However, a fully general solution to the problem is at least as hard as RSA inversion and, under standard cryptographic assumptions, no efficient learning algorithm exists4. However, this has not stopped researchers from attempting to look for heuristics which are applicable to context-free grammars that are found in real-world software. The idea being that, the theoretical limitation could be about pathological grammars. Even if one could recover the grammar of a reasonable subset of context-free grammars, it is a win5. Indeed, Clark et al.6 takes this approach. Another reasonable approach is to look for approximations. The research in this field was summarized by Higuera7.
We need three definitions before we proceed.
-
Precision (P): The percentage of inputs that were generated by the synthesized grammar that were accepted by the blackbox. That is, the synthesized grammar acts as a producer.
-
Recall (R): The percentage of inputs that were generated by the blackbox that were accepted by the synthesized grammar. That is, the synthesized grammar acts as an acceptor.
-
F1 score: The F1 score is the harmonic mean of precision and recall, and is given by \(2 \times \frac{PR}{P + R}\) .
Now, how does one verify their approach? One approach that recent research has taken is to try and recover the context free grammar from one of the blackbox programs, then use the recovered grammar to generate inputs, and check how many of these inputs are accepted by the program under grammar mining12 (P). Now, for completeness, one needs to turn around, and use the grammar as an acceptor for the input produced by the program (R). The problem here is that going the other way is tricky. That is, the program under grammar mining is an acceptor. How do you turn it into a producer? The route that is taken by Bastani et al. is to write a golden grammar by hand. Then, use this grammar to generate inputs, which are parsed by the mined grammar 8 However, this is unsatisfying and error prone. Is there a better way?
Turns out, there is indeed a better way. Unlike in whitebox grammar recovery
for blackbox grammar inference, there is no particular reason to start with a program.
Instead, start with a context-free grammar of your choice, and use a standard
context-free parser such as GLL, GLR, CYK or Earley parser to turn it into an acceptor. Then, use
your blackbox grammar inference algorithm to recover the grammar. Now, you have the
original grammar, and the recovered grammar. You can now use a number of tools9 to
check how close they are to each other. Indeed, the most basic idea is to make
one grammar the producer, and see how many of the produced is accepted by the second.
Then, turn it around, make the second the producer, and see how many of the produced
is accepted by the first. Make sure to use random sampling
for producing inputs from any grammar.
We note that doing both is extremely important to have
confidence in the results. What if we only do one side? That is, only verify that
the inputs produced by the first are accepted by the second? In that case, nothing
prevents us from inferring an extremely permissive grammar for the second that never
rejects any input – say /.*/. This grammar would have 100% accuracy on this metric even though
it is a very poor inferred grammar. Such problems can only be detected if we turn
around and use the inferred grammar as producer. Now, imagine that the inferred grammar
doesn’t generalize at all, and produces only a small set of inputs. In that case, again
the original grammar will accept all generated inputs, resulting in 100% accuracy even though
the inferred grammar was bad. Hence, both tests are equally important.
The TLDR; is that if you are doing blackbox grammar inference, please start with a grammar rather than a program. Use a parser to turn the grammar into an acceptor, and infer the grammar of that acceptor. Then verify both grammars against each other . Extracting the grammar from a program is not a useful proof for the effectiveness of your technique unless you are doing whitebox grammar mining.
A secondary issue is that it is not immediately clear what F1 actually means in this setting. Note that F1 score is actually used to evaluate the model accuracy in machine learning. In machine learning setting, we have some set of samples, and we have a process that classifies them as accepted or rejected. We then learn a model that tries to approximate this process. The model accuracy is then understood in terms of the F1 score.
That is, given this set of samples, and the learned model, we have:
- TP: Number of samples correctly predicted as positive.
- FP: Number of samples wrongly predicted as positive.
- TN: Number of samples correctly predicted as negative.
- FN: Number of samples wrongly predicted as negative.
Precision = \(\frac{TP}{TP + FP}\)
Recall = \(\frac{TP}{TP + FN}\)
Where the universe is given by:
In grammar induction, we do not actually have a set of samples to classify. In particular, the process for generation of strings from original and learned grammars may differ, and may have different density even if one randomly samples from the grammars.
The intuition here is as follows:
Say we have the original grammar Gb. The language represented by Gb when seen through the lens of learned grammar Gl would have the following structure:
\[L(Gb) = \left[ (FN (TP)) \right]\]Where FN is the set of strings generated by Gb that is not recognized by Gl. That is, it was falsely labeled as negative.
The language represented by the grammar Gl when seen through the lens of original grammar Gb will have the following structure:
\[L(Gl) = \left[ ((TP) FP) \right]\]Where FP is the set of strings generated by Gl that is not recognized by Gb. That is, it was falsely labeled as positive.
The universe is given as before by:
\[\left[ \left(FN (TP) FP \right) TN \right]\]We can now see that:
Gl_samples = generate(Gl) # Generate strings from Gl
precision = sum(recognize(Gb, s) for s in Gl_samples) / len(Gl_samples)
Gb_samples = generate(Gb) # Generate strings from Gb
recall = sum(recognize(Gl, s) for s in Gb_samples) / len(Gb_samples)
That is, to estimate each measure we sample exclusively from one grammar and use the other as an acceptor.
Interpreting F1
Note: The cardinality-based expressions below should therefore be viewed as intuition. For infinite languages we replace them with probabilities induced by sampling.
Using the language-theoretic definitions above,
\[P = \frac{|L(Gl) \cap L(Gb)|}{|L(Gl)|}\]and
\[R = \frac{|L(Gb) \cap L(Gl)|}{|L(Gb)|},\]the F1 score can be rewritten directly in terms of the two languages:
\[F_1 = \frac{2PR}{P + R}\]Note: Resist the temptation to simplify this further. For example, given the commutativity of intersection, it is tempting to simplify it as:
\[F_1 = \frac{2|L(Gl) \cap L(Gb)|} {|L(Gl)| + |L(Gb)|}.\]But this simplification breaks down once we switch from counting strings to sampling strings. There is no indication of which grammar is the generator and which is the recognizer. In particular, under the probabilistic interpretation used here, the intersection is measured under two different distributions. Since these measures are not symmetric, the cardinality-based Dice formulation does not apply.
Hence, compute the F1 score from precision and recall.
That is, in the finite language case, the F1 score is simply the Dice coefficient between the language represented by the original grammar and the language represented by the learned grammar. An F1 score of 1 indicates that the two languages are identical, while an F1 score of 0 indicates that they are disjoint.
There is one caveat. Most context-free grammars represent infinite languages, which means that quantities such as \(|L(Gl)|\) and \(|L(Gb)|\) are often infinite. In practice, what we actually estimate is
\[P = \Pr_{s \sim L(Gl)}[s \in L(Gb)]\]and
\[R = \Pr_{s \sim L(Gb)}[s \in L(Gl)],\]where the probability distribution is induced by the sampling procedure used to generate strings from the grammar. Precision and recall are then approximated by Monte Carlo sampling. This is why random sampling is not merely an implementation detail; it defines the measure over the language. Different sampling procedures may induce different distributions over the same language and therefore produce different empirical estimates of precision, recall, and F1.
This framing connects naturally to the PAC (Probably Approximately Correct) framework for language learning10. In PAC learning, a hypothesis Gl is considered \(\varepsilon\)-accurate if the probability of disagreement with the target Gb is at most \(\varepsilon\) under some fixed distribution. Here, \(1 - P\) and \(1 - R\) are exactly the disagreement rates under Gl’s and Gb’s own sampling distributions respectively. Requiring high F1 therefore resembles a two-sided PAC condition: \(Gl\) must be \(\varepsilon\) -close to\(Gb\)under both distributions simultaneously, which is strictly stronger than the one-sided version.
This also explains why starting with a known grammar Gb rather than a blackbox program is not merely a convenience: a PAC guarantee requires a well-defined sampling distribution, and a blackbox program provides none.
Finally, the Monte Carlo estimation of precision and recall has a direct sample-complexity interpretation:
We treat precision and recall separately, one at a time.
Precision. Draw \(n\) strings independently from \(Gl\), calling them \(s_1, s_2, \ldots, s_n\), and run \(Gb\) on each one. Define the outcome of the \(i\)-th string as:
\[X_i = \begin{cases} 1 & \text{if } Gb \text{ accepts } s_i \\ 0 & \text{otherwise} \end{cases}\]The empirical precision is then the fraction of accepted strings, \(\hat{P} = \frac{1}{n}\sum_i X_i\), and the true precision \(P\) is the expected value \(\mathbb{E}[X_i]\) — that is, the probability that a randomly drawn string from \(Gl\) is accepted by \(Gb\). Since each \(X_i \in \{0,1\} \subseteq [0,1]\), the variables are bounded, and Hoeffding’s inequality applies. It says: for a chosen tolerance \(\varepsilon\) (how close we want \(\hat{P}\) to be to \(P\)) and confidence parameter \(\delta\) (the acceptable probability of the estimate being further than \(\varepsilon\) away from the truth):
\[\Pr\!\left[|\hat{P} - P| \geq \varepsilon\right] \leq 2\exp(-2n\varepsilon^2)\]Setting this at most \(\delta\) and solving for \(n\):
\[2\exp(-2n\varepsilon^2) \leq \delta \;\Longrightarrow\; n \geq \frac{\ln(2/\delta)}{2\varepsilon^2}\]Recall. The argument is symmetric: draw \(n\) strings independently from \(Gb\), run \(Gl\) on each one, and define \(Y_i = 1\) if \(Gl\) accepts the \(i\)-th string and \(0\) otherwise. The empirical recall is \(\hat{R} = \frac{1}{n}\sum_i Y_i\), and the same Hoeffding bound gives the same sample requirement. Applying the union bound so that both estimates hold simultaneously (each allowed failure probability \(\delta/2\)) gives:
\[n \geq \frac{\ln(4/\delta)}{2\varepsilon^2}\]samples from each grammar, for a total of \(N = O(\log(1/\delta)/\varepsilon^2)\).
A word of caution. This bound answers the question: given a fixed learned grammar \(Gl\), how many test strings do we need to measure its precision and recall accurately? It says nothing about how many oracle queries the inference algorithm itself needed to produce \(Gl\) in the first place. Those are two separate questions — this is the evaluation cost, not the learning cost. The analogy: Hoeffding tells you how many test examples you need to estimate a classifier’s accuracy; it does not tell you how many training examples were needed to learn the classifier.
PAC Learning of Regular Grammars
To make the comparison between learning cost and evaluation cost precise, it helps to look at the PAC model for regular grammars.
In this model, the learning algorithm interacts with the blackbox through two types of queries:
-
Membership query: does the blackbox accept this string \(s\)? The blackbox already acts as a membership oracle — this is exactly how we use it when estimating precision and recall.
-
Equivalence query: is my current hypothesis grammar \(Gl\) exactly equivalent to \(Gb\)? If yes, the algorithm is done. If no, the oracle returns a counterexample, a string that is in one language but not the other, which the algorithm uses to refine \(Gl\).
In the software engineering setting, equivalence queries are impossible. Instead, equivalence queries must be approximated by random sampling. We draw strings and check whether any of them exposes a disagreement between \(Gl\) and \(Gb\).
How many samples does an approximate equivalence query need? If \(Gl\) is \(\varepsilon\)-wrong — that is, it disagrees with \(Gb\) on at least a fraction \(\varepsilon\) of strings under the sampling distribution — then each drawn string has probability \(\geq \varepsilon\) of being a counterexample. The probability of missing a counterexample in \(n\) draws is:
\[(1 - \varepsilon)^n \leq e^{-\varepsilon n}\]Setting this at most \(\delta\) gives:
\[n \geq \frac{\ln(1/\delta)}{\varepsilon}\]per approximate equivalence query.
Compare this to the Hoeffding bound for evaluating precision or recall:
\[n \geq \frac{\ln(2/\delta)}{2\varepsilon^2}\]Per query, evaluation requires more samples than a single equivalence query by a factor of \(1/\varepsilon\). However, the learning algorithm may need many equivalence queries. For a DFA with \(m\) states, up to \(m\) queries are needed, giving a total equivalence query cost of \(O(m \cdot \ln(1/\delta)/\varepsilon)\).
The reason the per-query costs differ is fundamental: an equivalence query only needs a binary answer — found a counterexample or not — and a single bad string suffices. Precision and recall estimation needs a quantitative answer — what fraction of strings are accepted — which requires many more samples to pin down accurately.
A nice result that I should mention here is that even though comparison of context-free grammars in general is undecidable11, comparison of deterministic context-free grammars is decidable!12. Géraud Sénizergues was awarded the Gödel Prize in 2002 for this discovery. What this means is that if the grammars are deterministic (these are the LR(k) grammars), you can even compare them directly.
The main communities working on grammar inference are
- Language and Automata Theory and Applications (LATA)
- International Conference on Grammatical Inference (ICGI)
- JMLR
-
Bastani, O., Sharma, R., Aiken, A., & Liang, P. (2017). Synthesizing program input grammars. ACM SIGPLAN Notices, 52(6), 95-110. ↩ ↩2
-
Wu, Z., Johnson, E., Yang, W., Bastani, O., Song, D., Peng, J., & Xie, T. (2019, August). REINAM: reinforcement learning for input-grammar inference. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 488-498). ↩ ↩2
-
Stevenson, A., & Cordy, J. R. (2014). A survey of grammatical inference in software engineering. Science of Computer Programming, 96, 444-459. ↩
-
Angluin, D., & Kharitonov, M. (1995). When Won′ t Membership Queries Help?. Journal of Computer and System Sciences, 50(2), 336-355. ↩
-
Curley, S. S., & Harang, R. E. (2016, May). Grammatical Inference and Machine Learning Approaches to Post-Hoc LangSec. In 2016 IEEE Security and Privacy Workshops (SPW) (pp. 171-178). IEEE. ↩
-
Clark, A., Eyraud, R., & Habrard, A. (2008, September). A polynomial algorithm for the inference of context free languages. In International Colloquium on Grammatical Inference (pp. 29-42). Springer, Berlin, Heidelberg. ↩
-
C. de la Higuera,Grammatical Inference: Learning Automata andGrammars. Cambridge University Press, 2010. ↩
-
We note here that the grammar derived by GLADE is not in the usual format, and hence, we could not verify that their parser is correct. Unfortunately, general context-free parsers are notoriously difficult to get right as shown by the history of the Earley parser. ↩
-
Madhavan, R., Mayer, M., Gulwani, S., & Kuncak, V. (2015, October). Automating grammar comparison. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (pp. 183-200). ↩
-
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. ↩
-
S. Ginsburg,The Mathematical Theory of Context Free Languages.McGraw-Hill Book Company, 1966. ↩
-
Sénizergues, G. (2001). L (A)= L (B)? decidability results from complete formal systems. Theoretical Computer Science, 251(1-2), 1-166. ↩