Diff in the Loop: Supporting Data Comparison in Exploratory Data Analysis
Published at
CHI
| New Orleans, LA
2022
Abstract
Data science is characterized by evolution: since data science is exploratory,
results evolve from moment to moment; since it can be collaborative, results
evolve as the work changes hands. While existing tools help data scientists
track changes in code, they provide less support for understanding the iterative
changes that the code produces in the data. We explore the idea of visualizing
differences in datasets as a core feature of exploratory data analysis, a
concept we call Diff in the Loop (DITL). We evaluated DITL in a user study with
16 professional data scientists and found it helped them understand the
implications of their actions when manipulating data. We summarize these
findings and discuss how the approach can be generalized to different data
science workflows.