Blackbox Inference of Context Free Grammars -- Verification

The problem of inferring the input specification of a program has recently seen some interest¹² from researchers. This is of huge interest in fuzzing as recovery of the input specification can improve the efficiency, effectiveness, and generality of a fuzzer by leaps and bounds³.

The best way to go about it is to look for a general solution with proofs. However, a fully general solution to the problem is impossible, and is as hard as reversing the RSA encryption⁴. However, this has not stopped researchers from attempting to look for heuristics which are applicable to context-free grammars that are found in real-world software. The idea being that, the theoretical limitation could be about pathological grammars. Even if one could recover the grammar of a reasonable subset of context-free grammars, it is a win⁵. Indeed, Clark et al.⁶ takes this approach. Another reasonable approach is to look for approximations. The research in this field was summarized by Higuera⁷.

We need three definitions before we proceed.

Precision (P): The percentage of inputs that were generated by the synthesized grammar that were accepted by the blackbox. That is, synthesized grammar acts as a producer.
Recall (R): The percentage of inputs that were generated by the blackbox that were accepted by the synthesized grammar. That is, synthesized grammar acts as a acceptor.
F1 score: The F1 score is the harmonic mean of precision and recall, and is given by \(2 \times \frac{P R}{P + R}\).

Now, how does one verify their approach? One approach that recent research has taken is to try and recover the context free grammar from one of the blackbox programs, then use the recovered grammar to generate inputs, and check how many of these inputs are accepted by the program under grammar mining¹² (P). Now, for completeness, one needs to turn around, and use the grammar as an acceptor for the input produced by the program (R). The problem here is that going the other way is tricky. That is, the program under grammar mining is an acceptor. How do you turn it into a producer? The route that is taken by Bastani et al. is to write a golden grammar by hand. Then, use this grammar to generate inputs, which are parsed by the mined grammar ⁸ However, this is unsatisfying and error prone. Is there a better way?

Turns out, there is indeed a better way. Unlike in whitebox grammar recovery for blackbox grammar inference, there is no particular reason to start with a program. Instead, start with a context-free grammar of your choice, and use a standard context-free parser such as GLL, GLR, CYK or Earley parser to turn it into an acceptor. Then, use your blackbox grammar inference algorithm to recover the grammar. Now, you have the original grammar, and the recovered grammar. You can now use a number of tools⁹ to check how close they are to each other. Indeed, the most basic idea is to make one grammar the producer, and see how many of the produced is accepted by the second. Then, turn it around, make the second the producer, and see how many of the produced is accepted by the first. Make sure to use random sampling for producing inputs from any grammar. We note that doing both is extremely important to have confidence in the results. What if we only do one side? That is, only verify that the inputs produced by the first are accepted by the second? In that case, nothing prevents us from inferring an extremely permissive grammar for the second that never rejects any input – say /.*/. This grammar would have 100% accuracy in this testing even though it is a very poor inferred grammar. Such problems can only be detected if we turn around and use the inferred grammar as producer. Now, imagine that the inferred grammar doesn’t generalize at all, and produces only a small set of inputs. In that case, again the original grammar will accept all generated inputs, resulting in 100% accuracy even though the inferred grammar was bad. Hence, both tests are equally important.

The TLDR; is that if you are doing blackbox grammar inference, please start with a grammar rather than a program. Use a parser to turn the grammar into an acceptor, and infer the grammar of that acceptor. Then verify both grammars against each other . Extracting the grammar from a program is not a useful proof for the effectiveness of your technique unless you are doing whitebox grammar mining.

A secondary issue is that it is not immediately clear what F1 actually means in this setting. Note that F1 score is actually used to evaluate the model accuracy in machine learning. In machine learning setting, we have some set of samples, and we have a process that classifies them as accepted or rejected. We then learn a model that tries to approximate this process. The model accuracy is then understood in terms of the F1 score.

That is, given this set of samples, and the learned model, we have:

TP: Number of samples correctly predicted as positive.
FP: Number of samples wrongly predicted as positive.
TN: Number of samples correctly predicted as negative.
FN: Number of samples wrongly predicted as negative.

Precision = \(\frac{TP}{TP + FP}\)

Recall = \(\frac{TP}{TP + FN}\)

In grammar induction, we do not actually have a set of samples to classify. In particular, the process for generation of strings from original and learned grammars may differ, and may have different density even if one randomly samples from the grammars.

The intuition here is as follows:

Say we have the original grammar G. The language represented by the grammar Gb when seen through the lens of learned grammar Gl would have the following structure (the below is a crude representation of a Venn Diagram):

\[L(Gb) = \left[ (FN (TP)) \right]\]

Where FN is the set of strings generated by Gb that is not recognized by Gl. That is, it was falsely labeled as negative.

The language represented by the grammar Gl when seen through the lens of original grammar Gb will have the following structure:

\[L(Gl) = \left[ ((TP) FP) \right]\]

Where FP is the set of strings generated by GL that is not recognized by Gb. That is, it was falsely labeled as positive.

The universe is then given by

\[\left[ \left(FN (TP) FP \right) TN \right]\]

We can now see that precision is \(\frac{|L(Gl) \cap L(Gb)|}{|L(Gl)|}\) and recall is \(\frac{|L(Gb) \cap L(Gl)|}{|L(Gb)|}\). That is, we can now sample from the particular grammars exclusively to approximate precision and recall using the second grammar as an acceptor. To accomplish \(|L(Gl) \cap L(Gb)|\) we use \(L(Gl)\) as the generator, and count all the successfully recognized strings by \(L(Gb)\). Since \(L(Gl)\) is the generator in the numerator, the denominator of \(L(Gl)\) is OK. To accomplish \(|L(Gb) \cap L(Gl)|\) we use \(L(Gb)\) as the generator, and count all the successfully recognized strings by \(L(Gl)\). Again, because \(L(Gb)\) is the generator in the numerator, we can use \(L(Gb)\) in the denominator. This allows us to ignore the fact that precision and recall are generated using two different processes.

A nice result that I should mention here is that even though comparison of context-free grammars in general is undecidable¹⁰, comparison of deterministic context-free grammars is decidable!¹¹. Géraud Sénizerguese was awarded the Gödel Prize in 2002 for this discovery. What this means is that if the grammars are deterministic (these are the LR(k) grammars), you can even compare them directly.

The main communities working on grammar inference are

Language and Automata Theory and Applications (LATA)
International Conference on Grammatical Inference (ICGI)
JMLR

Bastani, O., Sharma, R., Aiken, A., & Liang, P. (2017). Synthesizing program input grammars. ACM SIGPLAN Notices, 52(6), 95-110. ↩ ↩²
Wu, Z., Johnson, E., Yang, W., Bastani, O., Song, D., Peng, J., & Xie, T. (2019, August). REINAM: reinforcement learning for input-grammar inference. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 488-498). ↩ ↩²
Stevenson, A., & Cordy, J. R. (2014). A survey of grammatical inference in software engineering. Science of Computer Programming, 96, 444-459. ↩
Angluin, D., & Kharitonov, M. (1995). When Won′ t Membership Queries Help?. Journal of Computer and System Sciences, 50(2), 336-355. ↩
Curley, S. S., & Harang, R. E. (2016, May). Grammatical Inference and Machine Learning Approaches to Post-Hoc LangSec. In 2016 IEEE Security and Privacy Workshops (SPW) (pp. 171-178). IEEE. ↩
Clark, A., Eyraud, R., & Habrard, A. (2008, September). A polynomial algorithm for the inference of context free languages. In International Colloquium on Grammatical Inference (pp. 29-42). Springer, Berlin, Heidelberg. ↩
C. de la Higuera,Grammatical Inference: Learning Automata andGrammars. Cambridge University Press, 2010. ↩
We note here that the grammar derived by GLADE is not in the usual format, and hence, we could not verify that their parser is correct. Unfortunately, general context-free parsers are notoriously difficult to get right as shown by the history of the Earley parser. ↩
Madhavan, R., Mayer, M., Gulwani, S., & Kuncak, V. (2015, October). Automating grammar comparison. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (pp. 183-200). ↩
S. Ginsburg,The Mathematical Theory of Context Free Languages.McGraw-Hill Book Company, 1966. ↩
Sénizergues, G. (2001). L (A)= L (B)? decidability results from complete formal systems. Theoretical Computer Science, 251(1-2), 1-166. ↩