1 of 22

Conclusion

What you learned in this class, and what you will learn in the future. A farewell.

Data 100/Data 200, Spring 2022 @ UC Berkeley

Josh Hug and Lisa Yan

1

LECTURE 26

2 of 22

Announcements

Final Exam Logistics�Friday, May 13th: 7:00 PM - 10:00 PM

Final Exam Review Sessions:

Tuesday (May 3rd): Pre-MT2 content� Thursday (May 5th): Post-MT2 content

3:30 PM - 5:00 PM�In Li Ka Shing 245 + on Lecture Zoom

Online Final Exam Accommodation Form (due Mon, May 2 11:59 PM)

Course evaluations extra credit (Ed post)

  • 1 point of EC for completing the internal final survey AND official course evaluations
  • 2 points of EC If 80% of the class fills out both.

2

3 of 22

What did you learn in Data C100/C200?

Lecture 26, Data 100 Spring 2022

Logistics

What did you learn Data C100/C200?

What’s Next?

Join course staff!

3

4 of 22

What were we supposed to teach you?

4

Prepare

Enable

Empower

Prepare students for advanced Berkeley courses in data management, machine learning, and statistics, by providing the necessary foundation and context.

Enable students to start careers as data scientists by providing experience working with real-world data, tools, and techniques.

Empower students to apply computational and inferential thinking to address real-world problems.

5 of 22

The Data Science Lifecycle

5

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

?

6 of 22

The Data Science Lifecycle

6

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

?

7 of 22

The Data Science Lifecycle

7

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

?

8 of 22

The Data Science Lifecycle

8

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

?

Feature Engineering

9 of 22

The Data Science Lifecycle

9

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

?

Modeling

10 of 22

The Data Science Lifecycle

10

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Reports, Decisions, and Solutions

?

Prediction and

Inference

Modeling

11 of 22

What’s Next?

Lecture 26, Data 100 Spring 2022

Logistics

What did you learn Data C100/C200?

What’s Next?

Join course staff!

11

12 of 22

What Didn’t We Focus on in Data 100?

  • Causal inference.
    • How do we establish causality when we identify a correlation observed during EDA?
  • Deep learning.
    • How can the machine do the hard work of picking the right features instead of requiring humans to pick them in advance?
  • Decision making.
    • After we build our fancy regression/classification/clustering algorithms, what do we do next?
    • How do we avoid false discoveries?
  • Time series analysis.
  • Other flavors of machine learning (e.g. reinforcement learning).
  • Open-ended exploration of problems and datasets picked by you.
  • Non-tabular data (e.g. images/video, sensor-generated/spatial data, natural language...).

12

13 of 22

Classes to Take Next

Data Management:

  • DATA 101: Data Engineering: “principles and practices of managing data at scale”.
  • CS 186: A look at how databases are structured and how they work internally. Less useful for data science (IMO) than DATA 101.

Statistics:

  • STAT 134 or STAT 140: Deeper study of probability theory. 140 is targeted at data scientists.
  • EECS 126: Deeper study of probability theory, with engineering applications and some inference.
  • STAT 135: Deeper introduction to statistics (parameter estimation, bootstrapping, etc.).
  • STAT 151A: In-depth study of linear modeling (assumptions of linear models, statistical significance of parameters, etc.)
  • STAT 153: Time-series analysis.

13

14 of 22

Classes to Take Next

Machine Learning, Optimization, and Artificial Intelligence:

  • CS 182: Neural networks and deep learning. Very project-heavy.
    • Extension of logistic regression!
  • CS 188: Artificial intelligence (which is a superset of machine learning).
  • CS 189 or STAT 154: A rigorous study of machine learning. Lots of overlap with 100, but more mathematical.
  • EECS 127: Optimization Models.
  • DATA 102: Data, inference, and decisions.
    • How do you find the best layout for your website?
    • How do you train a classifier on sensitive data?
    • How do you account for very different training and testing data?
    • Very popular class but not really a “machine learning” class.

14

15 of 22

Real-world Applications

15

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

?

16 of 22

Real-world Applications

A great way to strengthen your knowledge of the ideas you learned in this class (and to build a portfolio to help you find jobs!) is to use your new skills to analyze real-world data.

  • There are countless sources of data available on the internet.
  • Find one, load it into a notebook, and get to work!

Places to look for data and applications of data science:

16

17 of 22

Real-world Applications

Even though you’re new to the discipline, you are also among the most skilled people in the world at data science. Use your power wisely!

Let us know what you do next!

17

Spider-Man (2002), or 1st century BC [Wikipedia]

18 of 22

Join course staff!

Lecture 26, Data 100 Spring 2022

Logistics

What did you learn Data C100/C200?

What’s next?

Join course staff!

18

19 of 22

Data C100/C200 Spring 2022 Staff

This class would not have been possible without our GSIs, tutors, and academic interns.

Kunal Agarwal, Anirudhan Badrinath, Parth Baokar, Bella Crouch, Jay Feng, Francis Geng, Kanu Grover, Kelly Han, Neha Haq, Samantha Hing, Aaron Huang, Priyanka Kargupta, Andrew Lenz, Michelle Li, Wallace Lim, Yulei Lin, Dominic Liu, Lucy Liu, Vasanth Madhavan, Mrunali Manjrekar, Minh Phan, James Susilo, Arda Ulug, Zachary Wu, Xinqi Yu, Ayela Chughtai, Eric Hao, Alina Herri, Jenny Jiang, Neal Kothari, Emily Le, Shiangyi Lin, Rachel McCarty, Conan Minihan, Ishaan Mishra, Pragnay Nevatia, Yiming Ni, Siddhant Satapathy, Yike Wang, Nancy Xu, William Xu, Jacob Yim, Natalie Chan, Tingyue Cui, Ziyi Ding, Nei Fang, Floyd Fang, Mary Guo, Brandon Hong, Daniel Huang, Arya Krishnan, Arjun Kshirsagar, Tanish Kumar, Ariel Kuo, Angela Lin, Wesley Little, Ruchi Maheshwari, Kaiona Martinson, Staten Maughan, Saurabh Narain, Jeanice Santosa, Amber Shao, Heather Sizlo, Jenea Spinks, Verona Teo, Albert Tran, Lili Wang, Andrea Wang, Winnie Xiao, Jennifer Yang, Andrew Zhang, Michael Zhu

19

20 of 22

Helping out with Data 100

Teaching is a great way to deeply understand course material.

  • Data 100 has grown rapidly.
  • This growth is only possible due to the help of students.
  • The path to course staff starts as an academic intern.

We need you!

https://data.berkeley.edu/joining-data-course-staff

Academic Intern applications to be released closer to the start of the Fall semester.

20

21 of 22

Thank you!

Data 100/Data 200, Spring 2022 @ UC Berkeley

Josh Hug and Lisa Yan

21

22 of 22

Bonus links - demos at the end

22