Interactive Data Science
Additional course information available on Canvas.
The goal of this course is to provide you with the tools to understand data and build data-driven interactive systems. You will learn to tell a story with the data and explore opportunities enabled by interactive data analysis through a combination of lectures, readings of current literature, and practical skills development. Over the course of the semester, you will learn about data science and the entire data pipeline from collecting and analyzing to interacting with data. We will also cover human-centered aspects of data science and how HCI methods can enhance the interpretation of data. This course requires comfort with programming, as required projects make use of Python and Git. A series of homework assignments help to lay the groundwork for a final larger group project.
Jump to
- TOC
Schedule and Readings
Subject to modification
Introduction and the Data Science Pipeline Slides
- Required Data Analysis and Statistics: An Expository Overview by J.W. Tukey and M.B. Wilk 1966
Value of Visualization Slides
- Required Information Visualization (Chapter 1) by Stuart Card, Jock Mackinlay, and Ben Shneiderman 1999
- Optional The Value of Visualization by Jarke van Wijk 2005
Sketching Slides
- Required The Anatomy of Sketching (Chapter 9) by Bill Buxton in Sketching User Experiences
- Required Sketching with Data Opens the Mind's Eye by Giorgia Lupi
- Optional The Shape of My Thoughts by Giorgia Lupi in Eyeo 2014
Visual Encodings with Colab and Altair Slides
- Required Getting Started by Marian Dörk
- Required Data Types, Graphical Marks, and Visual Encoding Channels by Jeffrey Heer, Dominik Moritz, Jake VanderPlas, and Brock Craf
Data Quality Slides
- Required Voyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations by Kanit Wongsuphasawat, Dominik Moritz, Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer in IEEE Transactions on Visualization and Computer Graphics 2016
- Optional Live-coding Altair from Tuesday's Class (Feb 1)
Interactivity 1 Slides
- Required Toward a Deeper Understanding of the Role of Interaction in Information Visualization by Ji Soo Yi, Youn ah Kang, John T. Stasko and Julie A. Jacko in IEEE Transactions on Visualization and Computer Graphics 2007
- Optional Live-coding Data Quality with Pandas from Thursday's Class (Feb 3)
Practical Machine Learning Slides
- Required Black Boxes are not Required by Cynthia Rudin in Data Skeptic Podcast
- Optional Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead by Cynthia Rudin in Nature Machine Intelligence
Intepretability 2 Slides
Perception Slides
- Required Layering and Seperation (Chapter 3) by Edward Tufte in Envisioning Information
- Optional Live-coding Practical+Interpretable ML from Thursday's Class (Feb 24)
Final Project Introduction + Designing with Effective Visual Encodings Slides
- Required Basic Principles of Visualization (Chapter 5) by Alberto Cairo in The Truthful Art
- Optional What to consider when choosing colors for data visualization by Lisa Charlotte Muth 2018
No class - Spring Break Slides
No class - Spring Break Slides
Telling Stories with Data Slides
- Required Chapter 8 - Storytelling with Data by Cecilia Aragon, Shion Guha, Marina Kogan, Michael Muller and Gina Neff in Human-Centered Data Science: An Introduction
- Optional Narrative Visualization: Telling Stories with Data by Edward Segel, Jeffrey Heer in IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis) 2010
Natural Language Processing (guest lecture by Dr. Hendrik Strobelt @ MIT-IBM Watson AI Lab) Slides
Data Science Ethics Slides
- Required Introduction by Catherine D'Ignazio and Lauren F. Klein in Data Feminism
- Optional Chapter 1 by Catherine D'Ignazio and Lauren F. Klein in Data Feminism
Critique Workshop 1 Slides
Critique Workshop 2 Slides
Final Project Feedback Session (optional) Slides
No Class - Spring Carnival Slides
Controlled Experiments + Evaluation Slides
Uncertainty Slides
- Required The Curious Absence of Uncertainty from Many (Most?) Visualizations by Jessica Hullman 2019
- Optional The effects of communicating uncertainty on public trust in facts and numbers by Anne Marthe van der Bles, Sander van der Linden, Alexandra L. J. Freeman, and David J. Spiegelhalter 2020
Fairness (guest lecture by Prof. Ken Holstein) Slides
- Required Chapter 1: Introduction by Solon Barocas, Moritz Hardt, Arvind Narayanan in Fairness and Machine Learning: Limitations and Opportunities 2019
- Required Predictions — whether algorithmic or human — may not be fair by Sharad Goel, Julian Nyarko, Roseanna Sommers in The Boston Globe 2020
Visualization and Machine Learning (guest lecture by Dr. Fred Hohman @ Apple) Slides
- Required The Myth of the Impartial Machine by Alice Feng, Shuyan Wu 2019
- Required The Beginner's Guide to Dimensionality Reduction by Matthew Conlen, Fred Hohman 2018
- Optional Gamut: A Design Probe to Understand How Data Scientists Understand Machine Learning Models by Fred Hohman, Andrew Head, Rich Caruana, Robert DeLine, Steven Drucker in ACM CHI 2019
Syllabus
Course Goals
The learning goals of the course are as follows:
- To be able to analyze a dataset, evaluate potential insights, and identify specific questions.
- To introduce the value of data visualization and its principles for designing effective interactive visualizations (e.g. human perception, color theory, storytelling techniques)
- To have a working ability to obtain, analyze, manipulate, transform, and distribute data.
- To introduce common problems with data such as structural problems, outliers, incomplete data, and dirty data
- To introduce basic concepts in data interpretation including feature generation, statistical analysis and classification (e.g. assumptions of data, data quality, missing data, outliers)
- To introduce basic concepts in data collection including data formats, parsing and sources of data (Data Structure and Storage)
- Understand and implement basic A/B experiments and understand experimental reliability and validity
- To introduce human-centered data science topics including ethics, fairness, and interpretability
- To provide practical applied examples of the data pipeline through an examination of current literature
- To provide hands on experience with creating data driven applications and a produce a portfolio of such applications
Concepts
- Structured vs unstructured data
- Dealing with heterogeneous data
- Sampling and Bias in Data Collection
- Data transformation and analysis
- Data visualization
- Current research in information driven interfaces
Skills
- Getting Web data
- Dealing with APIs
- Common data formats
- Data parsing
- Common problems with data
- Tools for analyzing data
- Tools for visualizing data
Some of the specific skills that will be covered in projects include:
- Display data from an API on a data-driven application you create
- Create interactive visualizations of data
- Answer a series of intriguing questions from both the data and corresponding visualizations
Prerequisites
The class will involve programming and debugging. If you find programming or debugging extremely difficult, this course may not be for you as you will have to master several very different programming languages/libraries/concepts in very short order (projects make use use of web programming frameworks including Pandas, Altair, Streamlit; and multiple languages including Python, JavaScript, and SQL). That being said, the assignments will mostly only require Python unless you decide on a project using any other language.
Projects
The course is project-oriented. It includes a large final group-defined project along with 2 homework assignments designed to provide the stepping stones needed to complete the final project. Tentative due dates for these projects can be found in the schedule above. Your work will be evaluated relative to your background and level of effort. This is a graduate-level class, and the assumption is that you are a mature and motivated student, and that you will define your work so that you learn and grow, given your background.
All homework assignments are to be done as individual work. It is expected that students may assist each other with conceptual issues, but not provide code. If you use example code, you must explicitly acknowledge this in your assignment submission. If you are unsure about these boundaries, please ask the instructors.
Work Required
This will not be an exam-heavy course. Instead, much of the work will focus on projects. The course will focus on understanding the techniques of data science and visualization through developing creative analyses and visualizations using tools to solve defined problems.
There is no final exam in this course. Students who do well will be invited to continue on an independent project on topics related to the course, working with Prof. Perer or others in the DIG lab during a future semester.
Course Material
Readings will be made available on the schedule listed above.
You will be expected to read assigned readings before the lecture they pertain to. These may include chapters drawn from textbooks about data, or readings about the research literature. To incentivize this, each student will be required to make at least one relevant postings to the discussion group before the class on which each reading is due. This participation will count toward the Participation and Attendance portion of their grade.
All students are required to submit at least 1 substantive discussion post per lecture on Canvas related to the course readings. Each student has 1 pass for skipping comments.
Good comments typically exhibit one or more of the following:
- Critiques of arguments made in the papers
- Analysis of implications or future directions for work discussed in lecture or readings
- Clarification of some point or detail presented in the class
- Insightful questions about the readings or answers to other people’s questions
- Links to web resources or examples that pertain to a lecture or reading
Grades
The tentative breakdown for grading is below. As a reminder, here is the university policy on academic integrity.
- 30% Homework Assignments
- 60% Final Project
- 10% Participation and Attendance
Respect for Diversity
It is our intent that students from all diverse backgrounds and perspectives be well served by this course, that students’ learning needs be addressed both in and out of class, and that the diversity that students bring to this class be viewed as a resource, strength and benefit. It is our intent to present materials and activities that are respectful of diversity: gender, sexuality, disability, age, socioeconomic status, ethnicity, race, and culture. Your suggestions are encouraged and appreciated. Please let us know ways to improve the effectiveness of the course for you personally or for other students or student groups. In addition, if any of our class meetings conflict with your religious events, please let us know so that we can make arrangements for you.
Accommodations for Students with Disabilities
If you have a disability and are registered with the Office of Disability Resources, we encourage you to use their online system to notify us of your accommodations and discuss your needs with us as early in the semester as possible. We will work with you to ensure that accommodations are provided as appropriate. If you suspect that you may have a disability and would benefit from accommodations but are not yet registered with the Office of Disability Resources, we encourage you to contact them at access@andrew.cmu.edu.
Health and Well-being
If you are experiencing COVID-like symptoms or have a recent COVID exposure, do not attend class if we are meeting in-person. Please email the instructors for accomodations.
If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help; call 412-268-2922 and visit their website at www.cmu.edu/counseling/. Consider reaching out to a friend, faculty or family member you trust for help getting connected to the support that can help. If you or someone you know is feeling suicidal or in danger of self-harm, call someone immediately, day or night:
- CaPS: 412-268-2922
- Re:solve Crisis Network: 888-796-8226
If the situation is life threatening, call the police. On campus call CMU Police: 412-268-2323. Off campus: 911.
If you have questions about this or your coursework, please let the instructors know. Thank you, and have a great semester.