Abstract: | Testing and static analysis tools can help root out bugs in programs, but not bugs in data. Checking data for errors is arguably as important as finding program errors, but lacks effective tool support. Previous approaches like data cleaning and statistical outlier analysis require either ground truth data for cross-validation, or that the data follow a known statistical distribution.
This paper introduces data debugging, an approach that combines data dependence analysis with statistical analysis to find potential data errors. Since it is impossible to know a priori whether data are erroneous or not, data debugging instead reveals data whose impact on the computation is unusual. Data debugging is particularly promising in the context of data-intensive programming environments that intertwine data with programs (in the form of queries or formulas).
This paper presents the first data debugging tool, CheckCell, an add-in for Microsoft Excel. CheckCell highlights values in shades proportional to the unusualness of their impact on the spreadsheet’s computation, which includes charts and formulas. CheckCell is efficient; and the current prototype runs in seconds for most spreadsheets we examine. We perform a case study by employing workers via a crowdsourcing platform, and show that CheckCell is effective at finding actual data entry errors. |