Trung Ma
Honours Thesis
Language Model For Structured Input Generation
By Trung Ma
Supervised by: Rahul Gopinath
Abstract
Fuzz testing has long been a vital technique in software security, typically used to un- cover vulnerabilities by examining program actions using random or semi-structured input generation. This thesis explores how machine learning, specifically pre-trained language models, can be used as structured input generation for fuzz testing. Through the use of language models, this research examines their capacity to internalize and generalize complex inputs that still adhere to the established grammar.
This study evaluates machine learning-based fuzzing techniques on several key aspects: input validity, code coverage, and structural diversity. By conducting experiments on context-free grammars (CFGs) and structured data such as JSON, the study evaluates the ability of models like GPT-2 to understand syntactical rules and produce structured inputs.