Automatic Data Repair without Format Specifications

In data processing, datasets are expected to adhere to specific formats. However, inconsistencies due to human error, data corruption, or partial transmission can render these datasets nonconforming, hindering automated processing. This necessitates manual data repair, a time-consuming and error-prone task, especially when formal specifications are unavailable.

To address this challenge, we introduce ϵREPAIR, a novel format-free approach to automating data repair. ϵREPAIR leverages parser feedback to detect and correct data inconsistencies, making it a versatile solution for handling diverse and inconsistent datasets.

In our evaluation, ϵREPAIR achieves 2.6× higher-quality repairs than its closest competitor, DDMax, in terms of the number of edits required to restore corrupted data, while reducing data loss by 2.8× compared to DDMax, with only a modest 1.4× runtime overhead.

This work presents a practical, robust, and flexible format-free data repair alternative to DDMax. Its applications extend to domains such as data science, software development, and other human-centric systems, where handling diverse and inconsistent datasets is critical