Honours Thesis

Language Model For Structured Input Generation

By Trung Ma

Supervised by: Rahul Gopinath

Abstract

Fuzz testing has long been a vital technique in software security, typically used to un- cover vulnerabilities by examining program actions using random or semi-structured input generation. This thesis explores how machine learning, specifically pre-trained language models, can be used as structured input generation for fuzz testing. Through the use of language models, this research examines their capacity to internalize and generalize complex inputs that still adhere to the established grammar.

This study evaluates machine learning-based fuzzing techniques on several key aspects: input validity, code coverage, and structural diversity. By conducting experiments on context-free grammars (CFGs) and structured data such as JSON, the study evaluates the ability of models like GPT-2 to understand syntactical rules and produce structured inputs.