Note: The algorithm discussed in this post is only useful for nonambiguous context-free grammars without epsilon (empty expansion nonterminal).

In the previous post I talked about how to generate input strings from any given context-free grammar. While that algorithm is quite useful for fuzzing, one of the problems with that algorithm is that the strings produced from that grammar is skewed toward shallow strings.

For example, consider this grammar:

Contents

  1. A Naive Implementation.
    1. Counting the number of strings
    2. Count strings generated by a nonterminal
    3. Count strings generated by a rule
    4. Generation of strings
  2. A Memoized Implementation.
    1. Counting the number of strings
    2. Populating the linked data structure.
    3. Count
      1. Key Count
      2. Rule Count
    4. Strings
      1. Key Strings
      2. Rule Strings
    5. Random Access
      1. At Keys
      2. At Rules
    6. Random Sampling
  3. References
    1. Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.



To generate inputs, let us load the limit fuzzer from the previous post.

Available Packages These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.
  1. simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".


The generated strings (which generate random integers) are as follows



As you can see, there are more single digits in the output than longer integers. Almost half of the generated strings are single character. If we modify our grammar as below



and run it again



you will notice a lot more three digit wide binary numbers are produced. In fact, now, more than one third of the generated strings are likely to be three digits. This is because of the way the grammar is written. That is, there are three possible expansions of the <digits> nonterminal, and our fuzzer chooses one of the expansions with equal probability. This leads to the skew in the inputs we see above. This is interesting, but why is it a problem for practitioners? Consider these two grammars.



Now, let us look at the generated strings. The first:



And the second:



Notice that both grammars describe exactly the same language, but the first one has a higher proportion of multiplications and divisions while the second one has a higher proportion of additions and subtractions. This means that the effectiveness of our fuzzers is determined to a large extent by the way the context-free grammar is written. Another case is when one wants to compare the agreement between two grammars. I talked about this before. As you can see from the above cases, the same language can be described by different grammars, and it is undecidable in general whether two context-free grammars describe the same language 1. So, we will have to go for statistical means. To start with, we need to be able to randomly sample from the strings that can be produced from the grammar. So, the minimal requirement is as follows:

  • We need to be able to randomly sample a string that can be generated from the grammar. To make this happen, let us split this into two simpler requirements:

  • We can find the number of strings of a given size that can be produced from the grammar.

  • We can enumerate the strings that can be produced from the grammar, and pick a specific string given its index in the enumeration.

Once we have both these abilities, then we can combine them to provide random sampling of derived strings. So, let us see how to achieve that.

A Naive Implementation.

Counting the number of strings

Let us first implement a way to count the number of strings that can be possibly generated.

Count strings generated by a nonterminal

We first implement key_get_num_strings() to count the number of strings of a given size l_str generated by a token. For finding the number we first check if the given token is a terminal symbol. If it is, then there is only one choice. That symbol is the string. The constraint is that the length of the string should be as given by l_str. if not, the total number of strings that can be generated from this token is the total number of strings we can generated from each individual expansion.



Let us test this out, but only with a token



Count strings generated by a rule

Next, we implement rule_get_num_strings() which counts the number of strings of given size l_str that can be generated by a rule (an expansion of a nonterminal). Here, we treat each rule as a head followed by a tail. The token is the first symbol in the rule. The idea is that, the total number of strings that can be generated from a rule is the multiplication of the number of strings that can be generated from the head by the total strings that can be generated by the tail.

We need to handle the base case when there is no tail. In that case, we return the number of strings for the head.

The complication is that we want to generate a specific size string. So, we split that size (l_str) between the head and tail and count strings generated by each possible split.



Using it.



Generation of strings

Let us next implement a way to generate all strings of a given size. Here, in key_get_strings(), we pass in the key, the grammar and the length of the string expected. Note the symmetry to key_get_num_strings().

For generating a string from a key, we first check if it is a terminal symbol. If it is, then there is only one choice. That symbol is the string. The constraint is that the length of the string should be as given by l_str. if not, then we find all the expansion rules of the corresponding definition and generate strings from each expansion of the given size l_str; the concatenation of which is the required string list.



Next, we come to the rule implementation given by rule_get_strings() Here, we treat each rule as a head followed by a tail. The token is the first symbol in the rule. The idea is that, the strings that are generated from this rule will have one of the strings generated from the token followed by one of the strings generated from the rest of the rule. This also provides us with the base case. If the rule is empty, we are done. if it is not the base case, then we first extract the strings from the token head, then extract the strings from the tail, and concatenate them pairwise.

Note the symmetry to rule_get_num_strings()

The complication here is the number of characters expected in the string. We can divide the number of characters — l_str between the head and the tail. That is, if the string from head takes up x characters, then we can only have l_str - x characters in the tail. To handle this, we produce a loop with all possible splits between head and tail. Of course not all possible splits may be satisfiable. Whenever we detect an impossible split — by noticing that s_ is empty, we skip the loop.



Using it.



The problem with these implementations is that it is horribly naive. Each call recomputes the whole set of strings or the count again and again. However, many nonterminals are reused again and again, which means that we should be sharing the results. Let us see how we can memoize the results of these calls.

A Memoized Implementation.

Counting the number of strings

We first define a data structure to keep the key nodes. Such nodes help us to identify the corresponding rules of the given key that generates strings of l_str size.



We also define a data structure to keep the rule nodes. Such rules contain both the head token as well as the tail, the l_str as well as the count



globals



Populating the linked data structure.

This follows the same skeleton as our previous functions. First the keys



Now the rules. The complication from before is that, if the count is zero, we do not return an array with a zero rulenode. Instead we return an empty array.



Using it.



We can of course extract the same things from this data structure.

Count

For example, if we wanted to recompute counts without using the count attribute

Key Count

For the keys



Rule Count

For the rules



Using it.



Strings

For example, if we wanted to compute strings

Key Strings



Rule Strings



Using it.



Random Access

But more usefully, we can now use it to randomly access any particular string. The idea is same as before. If the index being requeted is within the strings of the node expansion, return it. Any given nonterminal may be either a terminal symbol or it may be expanded by one or more rules.

In the casee of a terminal symbol (no rules), we have no choice, but to reutrn the token. (We should probably assert at == 0). But in the case of nonterminal symbol, we can pass the request to the specifc rule that has the requested index.

At Keys



At Rules

In the case of rules, the idea is mostly the same as before. If there is no tail, we get the base case.

In case there is a tail, we split the rule into a head and a tail. Note that a single rule node corresponds to a specific partition between the head and tail. That is, the head and tails in the rule node are compatible with each other in terms of length. That is, we do not have to worry about partitions.

The total number of strings is num(strings in head) x num(strings in tail). That is, for each string that correspond to the head, there is a set of tails. So, to get a string at a particular index, we need to iterate through each previous string in the head, multiplied by the number of strings in the tail. The count of such strings in head is given by len_s_h, and each head is indexed by head_idx. Then, we keep appending the number of strings in the rule tail. When the count reaches a given head, we identify the corresponding head by head_idx, and extract the corresponding string in the tail.



Using it.



Random Sampling

Once we have random access of a given string, we can turn it to random sampling.





This is random sampling from restricted set — the set of derivation strings of a given length. How do we extend this to lengths up to a given length? The idea is that for generating strings of length up to n, we produce and use nonterminals that generate strings of length up to n-x where x is the length of first terminals in expansions. This means that we can build the key_node data structures recursively from 1 to n, and most of the parts will be shared between the key_node data structures of different lengths. Another issue this algorithm has is that it fails when there is left recursion. However, it is fairly easy to solve as I showed in a previous post. The idea is that there is a maximum limit to the number of useful recursions. Frost et. al.2 suggests a limit of m * (1 + |s|) where m is the number of nonterminals in the grammar and |s| is the length of input. So, we use that here for limiting the recursion.



Populating the linked data structure.



Using it.



What about a left-recursive grammar? Here is an example:



Using it.



There are a few limitations to this algorithm. The first is that it does not take into account epsilons – that is empty derivations. It can be argued that it is not that big of a concern since any context-free grammar can be made epsilon free. However, if you are not too much worried about exactness, and only want an approximate random sample, I recommend that you replace empty rules with a rule containing a special symbol. Then, produce your random sample, and remove the special symbol from the generated string. This way, you can keep the structure of the original grammar. The next limitation is bigger. This implementation does not take into account ambiguity in grammar where multiple derivation trees can result in the same string. This means that such strings will be more likely to appear than other strings. While there are a number of papers 3 4 5 6 7 that tackle the issue of statistical sampling, with better runtime and space characteristics, we are not aware of any that fixes both issues (Gore et al.8 is notable for showing an almost uniform random sampling result). Bertoni et al. shows9 that for some inherently ambiguous languages, the problem becomes intractable.

The code for this post is available here.

References

Artifacts

The runnable Python source for this notebook is available here.

The installable python wheel cfgrandomsample is available here.

  1. Bar-Hillel, Yehoshua, Micha Perles, and Eli Shamir. “On formal properties of simple phrase structure grammars.” STUF-Language Typology and Universals 14.1-4 (1961): 143-172. 

  2. Richard A. Frost, Rahmatullah Hafiz, and Paul C. Callaghan. Modular and efficient top-down parsing for ambiguous left recursive grammars. IWPT 2007 

  3. Madhavan, Ravichandhran, et al. “Automating grammar comparison.” Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. 2015. 

  4. McKenzie, Bruce. “The Generation of Strings from a CFG using a Functional Language.” (1997). 

  5. McKenzie, Bruce. “Generating strings at random from a context free grammar.” (1997). 

  6. Harry G. Mairson. Generating words in a context-free language uniformly at random. Information Processing Letters, 49(2):95{99, 28 January 1994 

  7. Hickey, Timothy, and Jacques Cohen. “Uniform random generation of strings in a context-free language.” SIAM Journal on Computing 12.4 (1983): 645-655. 

  8. Gore, Vivek, et al. “A quasi-polynomial-time algorithm for sampling words from a context-free language.” Information and Computation 134.1 (1997): 59-74. 

  9. Bertoni, Alberto, Massimiliano Goldwurm, and Nicoletta Sabadini. “The complexity of computing the number of strings of given length in context-free languages.” Theoretical Computer Science 86.2 (1991): 325-342.