The problem of inferring the input specification of a program has recently seen some interest12 from researchers. This is of huge interest in fuzzing as recovery of the input specification can improve the efficiency, effectiveness, and generality of a fuzzer by leaps and bounds3.
The best way to go about it is to look for a general solution with proofs. However, a fully general solution to the problem is impossible, and is as hard as reversing the RSA encryption4. However, this has not stopped researchers from attempting to look for heuristics which are applicable to context-free grammars that are found in real-world software. The idea being that, the theoretical limitation could be about pathological grammars. Even if one could recover the grammar of a reasonable subset of context-free grammars, it is a win5. Indeed, Clark et al.6 takes this approach. Another reasonable approach is to look for approximations. The research in this field was summarized by Higuera7.
Now, how does one verify their approach? One approach that recent research has taken is to try and recover the context free grammar from one of the blackbox programs, then use the recovered grammar to generate inputs, and check how many of these inputs are accepted by the program under grammar mining12. Now, for completeness, one needs to turn around, and use the grammar as an acceptor for the input produced by the program. The problem here is that going the other way is tricky. That is, the program under grammar mining is an acceptor. How do you turn it into a producer? The route that is taken by Bastani et al. is to write a grammar by hand. Then, use this grammar to generate inputs, which are parsed by the mined grammar 8 However, this is unsatisfying and error prone. Is there a better way?
Turns out, there is indeed a better way. Unlike in whitebox grammar recovery
for blackbox grammar inference, there is no particular reason to start with a program.
Instead, start with a context-free grammar of your choice, and use a standard
context-free parser such as GLL, GLR, CYK or Earley parser to turn it into an acceptor. Then, use
your blackbox grammar inference algorithm to recover the grammar. Now, you have the
original grammar, and the recovered grammar. You can now use a number of tools9 to
check how close they are to each other. Indeed, the most basic idea is to make
one grammar the producer, and see how many of the produced is accepted by the second.
Then, turn it around, make the second the producer, and see how many of the produced
is accepted by the first. We note that doing both is extremely important to have
confidence in the results. What if we only do one side? That is, only verify that
the inputs produced by the first are accepted by the second? In that case, nothing
prevents us from inferring an extremely permissive grammar for the second that never
rejects any input – say
/.*/. This grammar would have 100% accuracy in this testing even though
it is a very poor inferred grammar. Such problems can only be detected if we turn
around and use the inferred grammar as producer. Now, imagine that the inferred grammar
doesn’t generalize at all, and produces only a small set of inputs. In that case, again
the original grammar will accept all generated inputs, resulting in 100% accuracy even though
the inferred grammar was bad. Hence, both tests are equally important.
The TLDR; is that if you are doing blackbox grammar inference, please start with a grammar rather than a program. Use a parser to turn the grammar into an acceptor, and infer the grammar of that acceptor. Then verify both grammars against each other . Extracting the grammar from a program is not a useful proof for the effectiveness of your technique unless you are doing whitebox grammar mining.
A nice result that I should mention here is that even though comparison of context-free grammars in general is undecidable10, comparison of deterministic context-free grammars is decidable!11. Géraud Sénizerguese was awarded the Gödel Prize in 2002 for this discovery. What this means is that if the grammars are deterministic (these are the LR(k) grammars), you can even compare them directly.
The main communities working on grammar inference are
- Language and Automata Theory and Applications (LATA)
- International Conference on Grammatical Inference (ICGI)
Wu, Z., Johnson, E., Yang, W., Bastani, O., Song, D., Peng, J., & Xie, T. (2019, August). REINAM: reinforcement learning for input-grammar inference. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 488-498). ↩ ↩2
Stevenson, A., & Cordy, J. R. (2014). A survey of grammatical inference in software engineering. Science of Computer Programming, 96, 444-459. ↩
Angluin, D., & Kharitonov, M. (1995). When Won′ t Membership Queries Help?. Journal of Computer and System Sciences, 50(2), 336-355. ↩
Curley, S. S., & Harang, R. E. (2016, May). Grammatical Inference and Machine Learning Approaches to Post-Hoc LangSec. In 2016 IEEE Security and Privacy Workshops (SPW) (pp. 171-178). IEEE. ↩
Clark, A., Eyraud, R., & Habrard, A. (2008, September). A polynomial algorithm for the inference of context free languages. In International Colloquium on Grammatical Inference (pp. 29-42). Springer, Berlin, Heidelberg. ↩
C. de la Higuera,Grammatical Inference: Learning Automata andGrammars. Cambridge University Press, 2010. ↩
We note here that the grammar derived by GLADE is not in the usual format, and hence, we could not verify that their parser is correct. Unfortunately, general context-free parsers are notoriously difficult to get right as shown by the history of the Earley parser. ↩
Madhavan, R., Mayer, M., Gulwani, S., & Kuncak, V. (2015, October). Automating grammar comparison. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (pp. 183-200). ↩
S. Ginsburg,The Mathematical Theory of Context Free Languages.McGraw-Hill Book Company, 1966. ↩
Sénizergues, G. (2001). L (A)= L (B)? decidability results from complete formal systems. Theoretical Computer Science, 251(1-2), 1-166. ↩