05-898B Mini Data Science for Product Managers
Sherry Tongshuang Wu
sherryw@cs.cmu.edu
2024/03/10
Meet your instructor! (Me)
Sherry Wu (She/Her/Hers)
Assistant Prof. @ CMU HCII
Work on HCI and NLP!
I study how humans interact with AI systems.
Office hour: Talk to me after class
Email: sherryw@cs.cmu.edu
Teaching Assistants
Jaehee Kim
Senior in School of Computer Science
AI major w/ concentration in HCI
Email: jaeheek@andrew.cmu.edu
Office hour: Wednesdays, 2:30 - 3:30PM, GHC 7501
Teaching Assistants
Yeonji Baek
Junior studying Statistics and ML
Minor in Business Analytics + HCI
Office hour: Fridays, 2-3PM, GHC 7101
05-898B Mini Data Science for Product Managers
Sherry Tongshuang Wu
sherryw@cs.cmu.edu
2024/03/10
What is Data Science and why it matters?
CLASS QUESTION
What is Data Science?
What is Data Science?
“The sexiest job of the century”
Harvard Business Review
What is Data Science?
What is Data Science?
A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.
What is Data Science?
“The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, … because now we really do have essentially free and ubiquitous data.”
Hal Varian, Google’s Chief Economist - The McKinsey Quarterly, Jan 2009
What is Data Science?
Manipulate & hack data,
understand operations
Have in-depth questions you want to answer / hypotheses you want to test!
Know basic analytics & can interpret data!
Source: Drew Conway
Why do we care?
CLASS QUESTION
Why do we care?
Data is everywhere! 2017: 2.5 exabytes (quintillion bytes) of data per day, largely unstructured —DOMO
Business? Data-centered & computational!
Biology? Data-centered & computational!
Physics? Data-centered & computational!
Medicine? Data-centered & computational!
Social Sciences? Data-centered & computational!
How can we leverage data?
Improve your fitness by targeted training
Improve your product by targeting your audience
Make better decisions (e.g. choose right medication, pick good restaurant)
Predict elections, events, crowd behavior, etc.
Many more applications...
Why do we (as PMs) care?
CLASS QUESTION
Let’s preview the course, with a in-class breakout activity!
You will have many of these throughout the class.
Class Challenge
5mins!
Form a group of 2-3 people.
Help each other sign up for the slack channel.
Navigate to the #lecture channel.
This is how we take class attendance.
Post answer to #lecture on Slack & tag all members.
[Slack invitation link on canvas]
https://bit.ly/2025s-pmds-slack
CLASS CHALLENGE
Look at this visualization on the left (from a data scientist on you team), and discuss with your neighbor:
Post your answer to #lecture!
CLASS CHALLENGE
CLASS CHALLENGE
CLASS CHALLENGE
“I trained a model and it has 98% accuracy”
-Data Scientist on your team
CLASS QUESTION
-Data Scientist on your team
CLASS QUESTION
-Data Scientist on your team
CLASS QUESTION
-Data Scientist on your team
CLASS QUESTION
Why Data Science for Product Managers?
You can improve your product with data
But if your product relies on data, do YOU have the skills to interrogate it effectively? Can you interpret the data, the analysis your teammate give you, or the errors your teammates made?
The Data Pipeline
Question
What problem do you want to solve?
Is it the right question?
Is it answerable?
Question
Collection
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
Collection: How to collect data?
Is it the right data for the question? How hard/easy to collect?
Question
Collection
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
Cleaning: How dirty is real data?
Jan 19, 2020
January 19, 20
1/19/20
2020-01-19
19/1/20
What flaws exist in the data?
How do we address them?
Question
Collection
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
Integration
Question
Collection
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
How do you combine data from multiple sources to provide the user with a unified view?
Analysis: How will you analyze the data?
Classification: Predicting which of a set of classes an entity belongs to
Regression: Predict the numerical value of some variable for an entity
Similarity Matching: Find similar entities based on what we know about them
Clustering: Group entities together by their similarity
And lots more (co-occurrence grouping, pattern mining, link prediction, data reduction, etc, etc)
Question
Collection
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
Visualization
Question
Collection
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
“The use of computer-supported, interactive, visual representations of abstract data to amplify cognition.”
Presentation
Question
Collection
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
Papers
Talks
Videos
Blog Posts
Interactive Notebooks
Explorables
Dissemination
Source Code
Web Applications
Products
Question
Collection
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
COVIDcast
What data are good indicators of COVID-19?
APIs, Datasets
Messy geographic data, backfills
Merge data from all of the indicators
Forecasting, hotspot detection
Map, Small Multiples, Animation
Blogs, Social Media, Notebooks, Source code
Web application, Public APIs
Question
Collection
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
Questions?
Question
Collection
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
Which ones will we focus on?
Question
Collection
Cleaning
Integration
Analysis
Visualization
Presentation
Dissemination
THIS CLASS: The Content
Preview: Data Quality and Wrangling
“In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.”
Preview: Data Collection, Biases, and Provenance
“Garbage-in, garbage-out!”
Preview: Exploratory Data Analysis
"Exploratory data analysis is detective work. It involves looking at data in many different ways, digging deep, and uncovering hidden insights."
Preview: Visualization
“The use of computer-supported, interactive, visual representations of abstract data to amplify cognition.”
Preview: Feature Engineering and Stats
Find dimensions that matter to the questions being asked.
Preview: Machine Learning
Find dimensions that matter to the questions being asked.
Preview: Communication
Data storytelling
Interpretation methods
THIS CLASS: The logistics
General information
Syllabus and Class Structure
05-898 B4, Spring 2025, 6 units mini course
Monday/Wednesdays 12:30-1:50pm
Syllabus on course webpage (link on Canvas)
Slides posted after each lecture
Check the schedule of topics (may change!)
Be familiar with the class rules
https://bit.ly/2025s-pmds-syllabus
Active lecture
Active lecture
Case study driven
Discussions highly encouraged
Regular in-class activities, breakouts
Setup the ability to read/post to Slack during lecture
Contribute your own experience!
Discussions over definitions
Recordings and Attendance
Try to attend lecture -- discussions are important to learning, especially this topic
Participation is part of your grade (more on this later!)
Slides will be released after class
No lecture recordings by default
Contact me for accommodations (illness, interview travel, unforseen events) or have your advisor reach out. I try to be flexible
Communication
Assignments, quizzes, and grades will be posted to Canvas
Course schedule and slides will be posted on the webpage.
All announcements through Slack #announcements
Post questions on Slack: Please use #general or #assignments and post publicly if possible; your classmates will benefit from your Q&A!
Invite link on Canvas
We also just set it up today!
Grading
A+ (97-100%) Professional level work, showing highest level of achievement
A (93–96.9%) Extraordinarily high achievement and command of subject matter
A- (90–92.9%) Excellent and thorough knowledge of the subject matter
B+ (87–89.9%) Full understanding of material; quality work
B (83–86.9%) Above average fulfillment of all course requirements
B- (80–82.9%) Fulfillment of all course requirements, acceptable work
C+ (77–79.9%) Satisfactory quality of work
C (73–76.9%) Minimally acceptable performance and quality of work
C- (70–72.9%) Unacceptable work, does not demonstrate mastery
D+ (65–69.9%) Unacceptable work
Below 65 Failure
Grade Breakdown
Quizzes 10%
Participation 10%
Assignments 80%
(Bonus points) 5%
There won’t be a final exam.
Quizzes, bi-weekly
Very easy, mostly multi-choice questions, test your understanding of simple concepts, answers all in the lecture
Participation
Participation != Attendance
Grading:
100%: Participates actively at least once in most lectures by (1) asking or responding to questions or (2) contributing to breakout discussions
75%: Participates actively at least once in two thirds of the lectures
50%: Participates actively at least once in over half of the lectures
25%: Participates actively at least once in one quarter of the lectures
Assignments, (almost) weekly
A series of assignments built on each other
Essentially, 4 steps in data science�HW0 is a preview on a simple dataset, and will help you set up all the necessary env – come to this Wed class!
Be careful of error propagation, fix things early!
Will have a final presentation
Research in this Course
We want to know what makes an effective/happy human-AI pair for DS tasks!
You will be able to use GenAI across all homeworks, on Google Colab.
The PhD student will help you set this the homework env on Wed and explain more. Bring your laptop!
All data will be anonymized.
You can opt out & the instructing team won’t know & it won’t affect your grade.
But if you sign up for 60-90min think aloud interview with the researcher, you can also get bonus point (+5). First come first serve!
Assignments - Late Policy
Each day late is 10% off (up to maximum of 50%) - Automatic
If you have questions, contact me or the TA early �(not after the assignment is due!)
Submitted via canvas - make sure you can login!
Assignment #0 – Pre-survey
Due: Due Wed (March 12) at 12:30 PM Eastern Time
Already posted on Canvas
HOMEWORK
Academic honesty
See web page
In a nutshell: do not copy from other students, do not lie, do not share or publicly release your solutions
If you feel overwhelmed or stressed, please come and talk to us (see syllabus for other support opportunities)
Introductions
Before the next lecture, introduce yourself in Slack channel #social:
See you Wed!
Check canvas access
Join the slack workspace