1 of 75

Course Overview

An overview of data science, Data 100/200, and the data science lifecycle.

Data 100/Data 200, Fall 2022 @ UC Berkeley

Will Fithian and Fernando Pérez

Content credit: Lisa Yan, Josh Hug

1

LECTURE 1

2 of 75

Roadmap

Lecture 01, Data 100 Fall 2022

  • Intros
  • What is data science?
  • What will you learn in this class?
  • Course overview
    • Lots of important details
  • Data Science Lifecycle
  • Demo

2

3 of 75

Intros - Fernando Pérez

  • Faculty in Statistics since 2017 (419 Evans Hall, fperez.org).
  • @ Berkeley since 2008, formerly building tools for neuroscience.
  • Background: Physics PhD, then Applied Mathematics (numerical methods).
  • Big focus of my career: open source tools for science, especially in Python.
    • 2001: Created IPython as a graduate student.
    • 2012: Co-founded NumFOCUS, non-profit; open source for Science & Education.
    • 2014: Co-founded IPython's evolution, Project Jupyter.
    • 2020: Co-founded 2i2c.org, non-profit; interactive computing in S & E.
  • Research interests
    • Interactive computational tools for scientific discovery and education.
      • Collaborative, open and reproducible research.
    • Data Science and Machine Learning in the physical sciences, esp. geoscience.
    • Applications in Cryosphere science (ice!), climate change & the environment.

4 of 75

Intro - Will Fithian

  • Faculty in statistics since 2015.
  • Background: PhD in stats from Stanford
  • First time teaching Data 100. Also have taught:
    • Data 8
    • Stat 131A
    • Stat 210A, 6 times.
  • Research interests in
    • Multiple testing
    • Replicability and “meta-science”
    • Ecological statistics & microbiome data
  • When / how can we draw reliable inferences from:
    • Non-representative or flawed data sources
    • Bodies of literature with severe publication bias
    • Research on questions that were inspired by the data

4

5 of 75

What is Data Science?

Lecture 01, Data 100 Fall 2022

  • Intros
  • What is data science?
  • What will you learn in this class?
  • Course overview
    • Lots of important details
  • Data Science Lifecycle
  • Demo

5

6 of 75

Why Data Science Matters

6

7 of 75

Why we need data science

7

The world is complicated, decisions are hard

  • To make wise decisions, we must quantitatively balance tradeoffs
  • To quantify things reliably we must
    • Find relevant data
    • Recognize its limitations
    • Ask the right questions

  • Apply critical thinking and skepticism at every step
  • Consider how our decisions affect others
    • Make reasonable assumptions
    • Conduct an appropriate analysis
    • Synthesize and explain our insights

8 of 75

A Covid story (with a happy ending)

June 2, 2022: My wife Kari, 8.5 months pregnant, tests positive for Covid-19

8

Doctor prescribes Paxlovid off-label

  • Warns it has not been tested in pregnant women
  • Tells us: our call
  • “Need to balance risks both to mother and baby”

First question: how much danger is Kari in?

But how?

9 of 75

FAtality rate

Immediate question: how much danger is Kari in?

  • Infection fatality rate = deaths / cases

Meta-analysis combining 24 studies from different regions

  • Hard to assess biases of each study:
    • How many cases (or deaths) went unrecorded?
    • Who was getting sick?
    • How representative was the sample?
  • Some studies fiercely debated
  • Many assumptions involved in combining studies
  • Covid-19 infection fatality rate: about 0.68%

Any problems with using this estimate?

9

10 of 75

Fatality rate: take two

Kari is:

  • A young, healthy woman (- mortality)
  • Pregnant → suppressed immune system (+ mortality)
  • Fully vaccinated and recently boosted (- mortality)
  • Probably infected with Omicron, not Delta (?)

No one has done a study on this population!

  • But (sometimes conflicting) studies exist about effect of each of these factors
  • Can we assume the effects “stack” on each other?

Informally, seems that Kari & baby are at low risk of death from Covid. But what about:

  • Complications of labor → stress on infant
  • Long covid for Kari
  • Exposing baby to new medication
  • Implications for lactation (during infant formula shortage!)

10

11 of 75

The happy ending

11

12 of 75

Belief is Social

12

13 of 75

Data is a Tool for Finding Truth

13

14 of 75

What about vaccine mandates

Question: should schools / employers / restaurants require proof of vaccination?

Many of the same challenges:

  • Should risks be assessed differently for
    • People with prior infections?
    • Children and young adults?
  • How does risk change with time?

Additional challenges:

  • Making decisions for others
  • With different values
  • Many of whom may deeply mistrust you
  • Questions of privacy and bodily autonomy

14

15 of 75

Why study data science

Data is used everywhere to answer hard questions and make tough decisions:

  • Science
  • Medicine
  • Social science
  • Engineering
  • Sports

Claims about data come up in discussing almost any important issue

  • Instead of “Aquinas says,” now it’s “the data says”
  • It is usually not easy to tell what the data “says”
  • Empower yourself to participate in the arguments that shape your life and your society

15

The world is complicated, decisions are hard

  • Journalism
  • Advocacy
  • Government
  • Business
  • Personal decision-making

16 of 75

What is Data Science?

16

PRINCIPLES AND TECHNIQUES OF DATA SCIENCE

17 of 75

Data is changing the world

17

From Joey Gonzalez.

18 of 75

Data science is a fundamentally interdisciplinary field

Joey Gonzalez (co-creator of this course)

18

Data Science is the application of data centric, computational, and inferential thinking to:

  • Understand the world (science).
  • Solve problems (engineering).

19 of 75

Data Science Venn Diagram

19

by Drew Conway in 2010 (link)

20 of 75

Data science in industry

The tasks that data scientists say they work on regularly. Self-reported. Based on the results of the 2016 Data Science Salary Survey.

20

21 of 75

Insight

Good data analysis is not:

  • Simple application of a statistics recipe.
  • Simple application of statistical software.

There are many tools out there for data science, but they are merely tools.

  • They don’t do any of the important thinking!

“The purpose of computing is insight, not numbers.”

R. Hamming. Numerical Methods for Scientists and Engineers (1962).

21

22 of 75

Example Questions in Data Science

Some (broad) questions we might try to answer with data science:

  • What show should we recommend to our user to watch?
  • In which markets should we focus our advertising campaign?
  • Is the use of the COMPAS algorithm for prison sentencing fair?
  • Should I send my kids to daycare?
  • Is the world getting better or worse?
  • What areas of the world are at higher risks for climate change impact in 10 years? 20?
  • Where should we put docking ports for our bikes?
  • What should we eat to avoid dying early of heart disease?
  • Do immigrants from poor countries have a positive or negative impact on the economy?

22

23 of 75

What will you learn in this class?

Lecture 01, Data 100 Fall 2022

  • Intros
  • What is data science?
  • What will you learn in this class?
  • Course overview
    • Lots of important details
  • Data Science Lifecycle
  • Demo

23

24 of 75

What are the Principles and Techniques that We’ll Learn?

24

PRINCIPLES AND TECHNIQUES OF DATA SCIENCE

25 of 75

Course goals

25

Prepare

Enable

Empower

Prepare students for advanced Berkeley courses in data management, machine learning, and statistics, by providing the necessary foundation and context.

Enable students to start careers as data scientists by providing experience working with real-world data, tools, and techniques.

Empower students to apply computational and inferential thinking to address real-world problems.

26 of 75

Tentative List of Topics to be Covered in Data 100

  • Pandas and NumPy
  • Relational Databases & SQL
  • Exploratory Data Analysis
  • Regular Expressions
  • Visualization
    • matplotlib
    • Seaborn
    • plotly
  • Sampling
  • Probability and random variables
  • Model design and loss formulation

  • Linear Regression
  • Feature Engineering
  • Regularization, Bias-Variance Tradeoff, Cross-Validation
  • Gradient Descent
  • Data science in the physical world
  • Causality
  • Logistic Regression
  • Decision Trees and Random Forests
  • PCA

26

27 of 75

Prerequisites

Official prerequisites for this course:

  • Completion of Data 8.
  • Completion of CS 61A or CS 88.
  • Co-enrollment in EE 16A or Math 54 or Stat 89A.

The prereqs are being strictly enforced! We will not be teaching:

  • How to use Python.
  • How to use Jupyter notebooks.
  • Inference from Data 8.
  • Linear algebra (though we will review this topic to a greater degree since linear algebra is a corequisite, not prerequisite).

Homework 1 and Lab 1 will help calibrate your background.

  • For Homework 1, the Data 8 textbook will be helpful.

27

28 of 75

Course Overview

Lecture 01, Data 100 Fall 2022

  • Intros
  • What is data science?
  • What will you learn in this class?
  • Course overview
    • Lots of important details
  • Data Science Lifecycle
  • Demo

28

29 of 75

Staff

29

30 of 75

GSIs

GSIs teach discussion, hold office hours, and help create assignments and exams. Contact info: ds100.org/fa22/staff.

30

Jimmy Butler

Bella Crouch

Kanu Grover

Connie Huang

Samantha Hing

Shiangyi Lin

Dominic Liu

Vasanth Madhavan

Minh Phan

Siddhant Satapathy

Stella Wang

Eric Hao

Alina Herri

Rohan Jha

Ishaan Mishra

Pragnay Nevatia

Yiming Ni

Heather Sizlo

Verona Teo

Arda Ulug

Shiny Weng

Samantha Wray

Nancy Xu

Jacob Yim

Michael Zhu

Bold denotes 20 hour GSI.

31 of 75

Readers

Readers hold office hours and grade the written components of homeworks and projects. Contact info: ds100.org/fa22/staff.

31

Natalie Chan

Kishore Chidambaram

Floyd Fang

Mary Guo

Wesley Little

Zaid Maayah

Ruchi Maheshwari

Mihran Miroyan

Elaine Qian

Milad Shafaie

Yaqian Tang

Yuerou Tang

32 of 75

Course Websites / Platforms

32

33 of 75

Online platforms

Course website (ds100.org/fa22)

  • Where all lectures, assignments, and discussions are posted.

DataHub (data100.datahub.berkeley.edu)

  • Where you will work on all assignments (links on the course website automatically take you here).

Ed (https://edstem.org/us/courses/25695)

  • A place to ask and answer questions about assignments and concepts.
  • Where all announcements are posted (exam logistics, new assignment released, etc).

Gradescope (gradescope.com, by invitation)

  • Where all assignments are submitted, and where all of your grades in this course will live.

Textbook (www.textbook.ds100.org)

  • Supplemental reading.

33

34 of 75

Programming Environment for our Course: JupyterLab

34

35 of 75

Learning Advanced JupyterLab

JupyterLab offers notebooks and more tools for data science.

We’ll be accessing JupyterLab using DataHub (data100.datahub.berkeley.edu).

  • At the end of the semester we’ll tell you how to use JupyterLab locally on your own machine.

Resources for learning fancier JupyterLab functionality:

  • A quickest intro is this great 2-minute overview by Serena Bonaretti.
    • Note: Unlike Serena’s example, in our course we’re using JupyterLab notebooks hosted on the internet, not on your own local computer.
  • The interface overview from the official docs has more details and short, embedded videos.
  • A more detailed discussion from a bio/data angle: ~45 minute video.
  • Full ~3h in-depth tutorial is available from the core team.

35

36 of 75

Course Logistics

Content and workflow

36

Note: See online syllabus at https://ds100.org/fa22/syllabus/ and Ed announcements for complete information

37 of 75

Weekly Flow

37

All deadlines subject to change

38 of 75

Lectures

Two lectures per week.

  • Tuesday/Thursday 9:40 - 11:00am.
  • Options:
    • Attend in person in Wheeler 150
    • Watch live broadcast on Zoom.
    • Watch recording afterwards (posted by the following morning).
  • Links to slides + supplementary code.
  • Posted on bCourses.

38

39 of 75

Discussion Section

Weekly live discussion sections

  • Every Tuesday, for one hour
  • Some in-person, some online

Graded for attendance (0/1 each week): 5% of final grade

  • Mandatory for Data 100, optional for Data 200
  • Weeks 1-3: you may attend any section you’re signed up for
    • You may switch sections if there is space
  • Weeks 4+: you must attend the single section you’re signed up for
    • You may no longer switch your section

Section sign-ups

  • Sign-ups will be posted on Ed today at noon
  • One section will be recorded and posted. Only sign up for this section if you are OK with being recorded.

39

40 of 75

Homework and Projects

Homeworks and Projects: Assignments for in-depth understanding and synthesis.

  • Homeworks: typically released on Friday and due the following Thursday
  • Projects: less frequent week-long assignments
  • Can get homework help in office hours and Ed.
  • Autograded and manually graded. Contain hidden test cases.
  • Must be completed individually (for details, see the Collaboration Policy).
  • Homework / project parties Wednesday 5-8pm

Graduate final project for Data 200: details TBA

40

41 of 75

Labs

Labs: short weekly programming assignments to give you familiarity with new concepts.

  • Typically released on Friday and is due the following Tuesday.
  • All lab autograder tests are visible.
  • Extensive lab support provided on Ed (no lab sections)
  • Designed to take ~1 hour
  • Walkthroughs released after due date

41

42 of 75

Quick Checks

Weekly short assignments to check you are keeping up with lectures

  • Assigned on Gradescope, mainly multiple choice
  • Should take about 10 minutes to complete
  • Released on Fridays, and are due the following Monday.
  • Lowest three grades are dropped

42

43 of 75

Office hours and communication

Office hours are listed on the calendar, mainly in person but with some virtual options

  • These are led by GSIs and readers.
  • Come to get help on assignments – labs, homeworks, and projects – and concepts.
  • Office Hour queues at oh.ds100.org.
    • When joining the queue, specify which assignment and question you need help with

Please check Ed or the FAQ page first before emailing instructors

Email options

  • Preferred email: data100.instructors@berkeley.edu (timeliest, monitored by entire team)
  • Email lead TAs Kanu Grover and Dominic Liu with sensitive issues
  • Email Fernando and Will only for matters requiring strict privacy and their direct attention

43

44 of 75

Exams

Two exams:

  • Midterm: Wednesday October 19th, 7-9PM Pacific.
  • Final: Tuesday, December 13, 3-6pm Pacific (exam group 7).

Alternate exam policies:

  • Midterm TBA.
  • ONLY course approved for final exam time conflicts is CS 70
  • Alternate exam will be offered only once, after the main final

44

45 of 75

Grading

45

46 of 75

Grading Logistics

Grades will be posted on Gradescope (including discussion attendance if applicable).

Deadlines are firm at 11:59PM.

  • 5 slip days total for all assignments (Homeworks, Projects, and Labs).
    • We will not accept work that would bring your slip day total above 5
    • No assignment may be extended >5 days (incl slip + DSP accommodations)
  • Extensions provided only for DSP or for truly exceptional circumstances
    • Contact Samantha Hing (DSP coordinator) for DSP-related extensions

If you have DSP accommodations, you should receive an email from us shortly.

46

47 of 75

Collaboration and Academic Dishonesty

We will be following the EECS Department Policy on Academic Dishonesty, which states that using work or resources that are not your own or permitted by the course constitutes plagiarism and may lead to disciplinary actions.

Assignments

Data science is a collaborative activity! It is okay to discuss problems with friends.

  • List their names at the top of your assignments. We provide a place to do this.
  • You must write your solutions individually! Do not copy any other student’s work.
  • If we suspect that you have submitted plagiarized work, we will call you in for a meeting. If we then determine that plagiarism has occurred, we reserve the right to give you a negative full score (-100%) or lower on the assignments in question, along with reporting your offense to the Center of Student Conduct.

Exams

  • Cheating on exams is a serious offense. We will have proctoring in place and will prosecute those caught cheating, with serious consequences for your career – so don’t do it!

47

48 of 75

Weekly Announcements

Weekly announcements will appear on Ed only

  • You should receive emails from Ed announcements. You are responsible for reading them
  • We will also try to cover announcements in lecture
  • Ed posts + course website are authoritative

48

49 of 75

We are Here to Help!

We want you to succeed!

  • These policies are intended to keep you on track and learning efficiently.
  • But exceptions are possible and conditions change
  • We can change course if something needs to change
    • Feel free to reach out to staff with comments or concerns!

Welcome to Data 100/Data 200!

49

50 of 75

Data Science Lifecycle

Lecture 01, Data 100 Fall 2022

  • Intros
  • What is data science?
  • What will you learn in this class?
  • Course overview
    • Lots of important details
  • Data Science Lifecycle
  • Demo

50

51 of 75

The “data science lifecycle” you will see out in the wild may be slightly different than�the one we teach you, but the core ideas are all the same.

51

52 of 75

Data science lifecycle

The data science lifecycle is a high-level description of the data science workflow.

Note the two distinct entry points!

52

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions, and Solutions

53 of 75

1. Question/Problem Formulation

  • What do we want to know?
  • What problems are we trying to solve?
  • What are the hypotheses we want to test?
  • What are our metrics for success?

53

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions, and Solutions

54 of 75

2. Data Acquisition and Cleaning

  • What data do we have and what data do we need?
  • How will we sample more data?
  • Is our data representative of the population we want to study?

54

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions, and Solutions

55 of 75

3. Exploratory Data Analysis & Visualization

  • How is our data organized and what does it contain?
  • Do we already have relevant data?
  • What are the biases, anomalies, or other issues with the data?
  • How do we transform the data to enable effective analysis?

55

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions, and Solutions

56 of 75

4. Prediction and Inference

  • What does the data say about the world?
  • Does it answer our questions or accurately solve the problem?
  • How robust are our conclusions and can we trust the predictions?

56

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions, and Solutions

57 of 75

Demo: The Data Science Lifecycle

Lecture 01, Data 100 Fall 2022

  • Intros
  • What is data science?
  • What will you learn in this class?
  • Course overview
    • Lots of important details
  • Data Science Lifecycle
  • Demo

57

Available on the course website:

https://ds100.org/fa22/lecture/lec01

58 of 75

[1] Ask a Question: Who are you?

58

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions

Demo Slides

59 of 75

[2] Data Acquisition and Cleaning

59

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions

Demo Slides

60 of 75

[3] Exploratory Data Analysis and Visualization

Let’s understand what our data tells us, and let’s clean the data while we’re at it.

60

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions

Demo Slides

61 of 75

[3] Exploratory Data Analysis and Visualization

Population: Data 100 students, Fall 2022

Some sub-questions:

  1. How many students are in the class?
  2. What are your majors?
  3. What year are you?
  4. Diversity ...?

61

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions

Demo Slides

62 of 75

[3] A harder direction to explore

Diversity ...?

Unfortunately, surveys of data scientists suggest that there are far fewer women:

To learn more, check out the Kaggle Executive Summary or study the Raw Data.

62

Demo Slides

63 of 75

[4, 1] “What fraction of the students are female?”

This is a complex question. Are we asking about sex (biological trait) or gender (individual, social, cultural identity)?

The Data Science Program wants to improve gender diversity.

63

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions

Demo Slides

64 of 75

What is the gender diversity of this class?

We don’t currently have data to answer this question. We could either:

  1. Survey the students, or…
  2. …Use the data we have to estimate the sex of the students as a proxy for gender???*

*Do not attempt #2 alone; it is flawed in many ways (we’ll discuss this later).

We are only exploring #2 in this lecture to illustrate inferential modeling and combining multiple data sources to reason about something we haven’t measured.

64

Demo Slides

65 of 75

[1, 2] (again, but for Baby Names Data)

1. Can we estimate a person’s sex using their name?

2. Obtain more data: SSN Baby Names

Discuss: Based on the description of the SSN data: What are some limitations of this datasource?�What limitations might it have�with respect to our original task?

65

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions

We’ll come back to this…

🤔

Demo Slides

66 of 75

[2, 3] (again, but for Baby Names Data)

What does each row/column represent?

What can you observe about how U.S. baby names have changed over time?

66

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions

Demo Slides

67 of 75

[4] Prediction and Inference: Simple Classifier

Let’s use this data to estimate the fraction of female students in the class.

Simple classifier:

  1. SSN: Proportion of F babies per name
  2. Use step 1 to classify each student name as F, M, or Unknown
  3. Average step 2 to get a class prop. F

67

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions

1. How do you feel about the estimated proportion of females in this class?�2. Do you trust it?

Demo Slides

68 of 75

A Classifier that Captures Uncertainty

Our current model doesn’t capture the we saw in the data. We can use simulation to provide a better distributional estimate.

Updated classifier:

  • SSN: Proportion of F babies per name
  • For each student name with step 1:
    1. Pick a number in [0.0, 1.0)
    2. If 2a is less than SSN prop F (or 0.5 for Unknowns), classify student as F. Otherwise, classify as M.
  • Average step 2 to get a class prop. F

68

1. How do you feel about the estimated proportion of females in this class?�2. Do you trust it?

Demo Slides

69 of 75

Recap of what we just saw:

Find Fall 2022 DS100 data

Explore interesting things about our class: names, majors, counts

  • Get stuck on a question: gender diversity

Find more data: Baby Names (U.S. SSN)

  • Approximate gender with sex

Create a classifier

  • Simple classifier: names are exactly F/M
  • Random classifier: all names have some probability of F

69

Gut check: How comfortable were you being the data subject in this study?

Reality check: What about those limitations we talked about?

🤔

Demo Slides

70 of 75

What are some limitations of our analysis?

Possible limitations:

  • U.S. name data, not global data
  • Everyone born since 1937
  • No “rare” names
  • Sex as a proxy for gender

How might this impact our analysis?

  • UC Berkeley students are from around the world
  • Most of our class is born around 2000
  • Gender has been proxied to a binary classification (Learn more: GenEq)
  • A lot is encoded in a name. Maybe our class data was fundamentally insufficient to answer our original question on gender diversity.

70

Demo Slides

71 of 75

Human Contexts in Data Science

Representation: How does data stand in for complex phenomena in the world?

Identity: What kinds of identities are involved in the data? Whose? What happens to identity in the process of data analysis?

In our (faulty) analysis:� Name → Sex → Gender

Reductions of Identity based on Name have historically reproduced existing social bias against minoritized groups:

Job seekers with White-sounding first names received 50% more callbacks from employers than job seekers with Black-sounding names. �[Bertrand & Mullainathan, 2003]

71

Demo Slides

72 of 75

How can we fix these flaws?

Our original question:

What is the gender diversity of our class?

We didn’t have data to answer this question. We could either:

  • Survey the students, or…
  • …Use the data we have to estimate the sex of the students as a proxy for gender???

72

What you learn in Data 100 will help you explore, challenge, and justify these beliefs in every step of the Data Science Lifecycle.

…And sometimes the takeaway is that we need to collect better data.

Demo Slides

73 of 75

What’s the point of this demo?

There are many assumptions in data science:

  • Whether the data is representative:
    • Of the question being asked
    • Of the world and its implications
  • Beliefs/backgrounds of data collectors
  • Beliefs/backgrounds of data analysts
  • Beliefs/backgrounds of the population

Data Science does not and cannot live in a theoretical vacuum. Data Science is a human-centered technical practice.

73

Demo Slides

74 of 75

See you soon!

Pre-Semester Survey (due Monday by 11:59 PM)

https://forms.gle/wZhCTfFsmYTSfHgU7

74

75 of 75

Course Overview

Content credit: Suraj Rampure, Allen Shen, Joey Gonzalez, Josh Hug, and Sam Lau

75

LECTURE 1