Course Overview
An overview of data science, Data 100/200, and the data science lifecycle.
Data 100/Data 200, Fall 2022 @ UC Berkeley
Will Fithian and Fernando Pérez
Content credit: Lisa Yan, Josh Hug
1
LECTURE 1
Roadmap
Lecture 01, Data 100 Fall 2022
2
Intros - Fernando Pérez
Intro - Will Fithian
4
What is Data Science?
Lecture 01, Data 100 Fall 2022
5
Why Data Science Matters
6
Why we need data science
7
The world is complicated, decisions are hard
A Covid story (with a happy ending)
June 2, 2022: My wife Kari, 8.5 months pregnant, tests positive for Covid-19
8
Doctor prescribes Paxlovid off-label
First question: how much danger is Kari in?
But how?
FAtality rate
Immediate question: how much danger is Kari in?
Meta-analysis combining 24 studies from different regions
Any problems with using this estimate?
9
Fatality rate: take two
Kari is:
No one has done a study on this population!
Informally, seems that Kari & baby are at low risk of death from Covid. But what about:
10
The happy ending
11
Belief is Social
12
Data is a Tool for Finding Truth
13
What about vaccine mandates
Question: should schools / employers / restaurants require proof of vaccination?
Many of the same challenges:
Additional challenges:
14
Why study data science
Data is used everywhere to answer hard questions and make tough decisions:
Claims about data come up in discussing almost any important issue
15
The world is complicated, decisions are hard
What is Data Science?
16
PRINCIPLES AND TECHNIQUES OF DATA SCIENCE
Data is changing the world
17
From Joey Gonzalez.
Data science is a fundamentally interdisciplinary field
Joey Gonzalez (co-creator of this course)
18
Data Science is the application of data centric, computational, and inferential thinking to:
Data Science Venn Diagram
19
by Drew Conway in 2010 (link)
Data science in industry
The tasks that data scientists say they work on regularly. Self-reported. Based on the results of the 2016 Data Science Salary Survey.
20
Insight
Good data analysis is not:
There are many tools out there for data science, but they are merely tools.
“The purpose of computing is insight, not numbers.”
R. Hamming. Numerical Methods for Scientists and Engineers (1962).
21
Example Questions in Data Science
Some (broad) questions we might try to answer with data science:
22
What will you learn in this class?
Lecture 01, Data 100 Fall 2022
23
What are the Principles and Techniques that We’ll Learn?
24
PRINCIPLES AND TECHNIQUES OF DATA SCIENCE
Course goals
25
Prepare
Enable
Empower
Prepare students for advanced Berkeley courses in data management, machine learning, and statistics, by providing the necessary foundation and context.
Enable students to start careers as data scientists by providing experience working with real-world data, tools, and techniques.
Empower students to apply computational and inferential thinking to address real-world problems.
Tentative List of Topics to be Covered in Data 100
26
Prerequisites
Official prerequisites for this course:
The prereqs are being strictly enforced! We will not be teaching:
Homework 1 and Lab 1 will help calibrate your background.
27
Course Overview
Lecture 01, Data 100 Fall 2022
28
Staff
29
GSIs
GSIs teach discussion, hold office hours, and help create assignments and exams. Contact info: ds100.org/fa22/staff.
30
Jimmy Butler
Bella Crouch
Kanu Grover
Connie Huang
Samantha Hing
Shiangyi Lin
Dominic Liu
Vasanth Madhavan
Minh Phan
Siddhant Satapathy
Stella Wang
Eric Hao
Alina Herri
Rohan Jha
Ishaan Mishra
Pragnay Nevatia
Yiming Ni
Heather Sizlo
Verona Teo
Arda Ulug
Shiny Weng
Samantha Wray
Nancy Xu
Jacob Yim
Michael Zhu
Bold denotes 20 hour GSI.
Readers
Readers hold office hours and grade the written components of homeworks and projects. Contact info: ds100.org/fa22/staff.
31
Natalie Chan
Kishore Chidambaram
Floyd Fang
Mary Guo
Wesley Little
Zaid Maayah
Ruchi Maheshwari
Mihran Miroyan
Elaine Qian
Milad Shafaie
Yaqian Tang
Yuerou Tang
Course Websites / Platforms
32
Online platforms
Course website (ds100.org/fa22)
DataHub (data100.datahub.berkeley.edu)
Ed (https://edstem.org/us/courses/25695)
Gradescope (gradescope.com, by invitation)
Textbook (www.textbook.ds100.org)
33
Programming Environment for our Course: JupyterLab
34
Learning Advanced JupyterLab
JupyterLab offers notebooks and more tools for data science.
We’ll be accessing JupyterLab using DataHub (data100.datahub.berkeley.edu).
Resources for learning fancier JupyterLab functionality:
35
Course Logistics
Content and workflow
36
Note: See online syllabus at https://ds100.org/fa22/syllabus/ and Ed announcements for complete information
Weekly Flow
37
All deadlines subject to change
Lectures
Two lectures per week.
38
Discussion Section
Weekly live discussion sections
Graded for attendance (0/1 each week): 5% of final grade
Section sign-ups
39
Homework and Projects
Homeworks and Projects: Assignments for in-depth understanding and synthesis.
Graduate final project for Data 200: details TBA
40
Labs
Labs: short weekly programming assignments to give you familiarity with new concepts.
41
Quick Checks
Weekly short assignments to check you are keeping up with lectures
42
Office hours and communication
Office hours are listed on the calendar, mainly in person but with some virtual options
Please check Ed or the FAQ page first before emailing instructors
Email options
43
Exams
Two exams:
Alternate exam policies:
44
Grading
45
Grading Logistics
Grades will be posted on Gradescope (including discussion attendance if applicable).
Deadlines are firm at 11:59PM.
If you have DSP accommodations, you should receive an email from us shortly.
46
Collaboration and Academic Dishonesty
We will be following the EECS Department Policy on Academic Dishonesty, which states that using work or resources that are not your own or permitted by the course constitutes plagiarism and may lead to disciplinary actions.
Assignments
Data science is a collaborative activity! It is okay to discuss problems with friends.
Exams
47
Weekly Announcements
Weekly announcements will appear on Ed only
48
We are Here to Help!
We want you to succeed!
Welcome to Data 100/Data 200!
49
Data Science Lifecycle
Lecture 01, Data 100 Fall 2022
50
The “data science lifecycle” you will see out in the wild may be slightly different than�the one we teach you, but the core ideas are all the same.
51
Data science lifecycle
The data science lifecycle is a high-level description of the data science workflow.
Note the two distinct entry points!
52
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions, and Solutions
1. Question/Problem Formulation
53
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions, and Solutions
2. Data Acquisition and Cleaning
54
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions, and Solutions
3. Exploratory Data Analysis & Visualization
55
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions, and Solutions
4. Prediction and Inference
56
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions, and Solutions
Demo: The Data Science Lifecycle
Lecture 01, Data 100 Fall 2022
57
[1] Ask a Question: Who are you?
58
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions
Demo Slides
[2] Data Acquisition and Cleaning
59
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions
Demo Slides
[3] Exploratory Data Analysis and Visualization
Let’s understand what our data tells us, and let’s clean the data while we’re at it.
60
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions
Demo Slides
[3] Exploratory Data Analysis and Visualization
Population: Data 100 students, Fall 2022
Some sub-questions:
61
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions
✅
✅
✅
Demo Slides
[3] A harder direction to explore
Diversity ...?
Unfortunately, surveys of data scientists suggest that there are far fewer women:
To learn more, check out the Kaggle Executive Summary or study the Raw Data.
62
Demo Slides
[4, 1] “What fraction of the students are female?”
This is a complex question. Are we asking about sex (biological trait) or gender (individual, social, cultural identity)?
The Data Science Program wants to improve gender diversity.
63
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions
Demo Slides
What is the gender diversity of this class?
We don’t currently have data to answer this question. We could either:
*Do not attempt #2 alone; it is flawed in many ways (we’ll discuss this later).
We are only exploring #2 in this lecture to illustrate inferential modeling and combining multiple data sources to reason about something we haven’t measured.
64
Demo Slides
[1, 2] (again, but for Baby Names Data)
1. Can we estimate a person’s sex using their name?
2. Obtain more data: SSN Baby Names
Discuss: Based on the description of the SSN data: What are some limitations of this datasource?�What limitations might it have�with respect to our original task?
65
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions
We’ll come back to this…
🤔
Demo Slides
[2, 3] (again, but for Baby Names Data)
What does each row/column represent?
What can you observe about how U.S. baby names have changed over time?
66
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions
Demo Slides
[4] Prediction and Inference: Simple Classifier
Let’s use this data to estimate the fraction of female students in the class.
Simple classifier:
67
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions
1. How do you feel about the estimated proportion of females in this class?�2. Do you trust it?
Demo Slides
A Classifier that Captures Uncertainty
Our current model doesn’t capture the we saw in the data. We can use simulation to provide a better distributional estimate.
Updated classifier:
68
1. How do you feel about the estimated proportion of females in this class?�2. Do you trust it?
Demo Slides
Recap of what we just saw:
Find Fall 2022 DS100 data
Explore interesting things about our class: names, majors, counts
Find more data: Baby Names (U.S. SSN)
Create a classifier
69
Gut check: How comfortable were you being the data subject in this study?
Reality check: What about those limitations we talked about?
🤔
Demo Slides
What are some limitations of our analysis?
Possible limitations:
How might this impact our analysis?
70
Demo Slides
Human Contexts in Data Science
Representation: How does data stand in for complex phenomena in the world?
Identity: What kinds of identities are involved in the data? Whose? What happens to identity in the process of data analysis?
In our (faulty) analysis:� Name → Sex → Gender
Reductions of Identity based on Name have historically reproduced existing social bias against minoritized groups:
Job seekers with White-sounding first names received 50% more callbacks from employers than job seekers with Black-sounding names. �[Bertrand & Mullainathan, 2003]
71
Demo Slides
How can we fix these flaws?
Our original question:
What is the gender diversity of our class?
We didn’t have data to answer this question. We could either:
72
❌
✅
What you learn in Data 100 will help you explore, challenge, and justify these beliefs in every step of the Data Science Lifecycle.
…And sometimes the takeaway is that we need to collect better data.
Demo Slides
What’s the point of this demo?
There are many assumptions in data science:
�
Data Science does not and cannot live in a theoretical vacuum. Data Science is a human-centered technical practice.
73
Demo Slides
See you soon!
74
Course Overview
Content credit: Suraj Rampure, Allen Shen, Joey Gonzalez, Josh Hug, and Sam Lau
75
LECTURE 1