Strategies for Reuse and Sharing among Data Scientists in Software Teams
Published at
ICSE
| Pittsburgh, PA
2022
Abstract
Effective sharing and reuse practices have long been hallmarks of proficient
software engineering. Yet the exploratory nature of data science presents new
challenges and opportunities to support sharing and reuse of analysis code. To
better understand current practices, we conducted interviews (N=17) and a survey
(N=132) with data scientists at Microsoft, and extract five commonly used
strategies for sharing and reuse of past work: personal analysis reuse, personal
utility libraries, team shared analysis code, team shared template notebooks,
and team shared libraries. We also identify factors that encourage or discourage
data scientists from sharing and reusing. Our participants described obstacles
to reuse and sharing including a lack of incentives to create shared code,
difficulties in making data science code modular, and a lack of tool
interoperability. We discuss how future tools might help meet these needs.