1 of 45

Pre-Announcement

databears.org/join

For students who enjoy working on data science projects of general interests.
1st mission: Looking at projects like infrastructure, campus consulting, and tinkering around with various datasets.
2nd mission: Teaching at high school level.

2 of 45

DS100: Fall 2018

Lecture 29 (Josh Hug / Fernando Perez): Conclusion

What We’ve Done So Far
What Was New This Time
Ask Us Anything
What’s Next

datastructur.es

3 of 45

A Brief Look at What We’ve Done in this Class

4 of 45

Data Science Lifecycle

Formulate Question or Problem

Acquire and Clean Data

Exploratory Data Analysis

Draw Conclusions from Predictions and Inference

Reports, Decisions, and Solutions

5 of 45

Quick Fun Demo

Let’s look at that arbitrary DS100 dataset we had you create at the beginning of the semester.

6 of 45

Useful Libraries and Programming Tools

Core libraries and tools we used:

numpy
Jupyter / IPython
matplotlib
pandas
seaborn
regular expressions
scipy.minimize
sklearn ← You will probably use this a lot.
SQL / sqlalchemy

Less practice, but also:

Ray (hw6), Spark (lecture), Plotly (lecture), Google datacommons (lab 12)

7 of 45

Key Concepts

Sampling: Simple random / cluster / stratified samples.

Saw that a small but good random sample can be MUCH better than a large but biased sample. An 80% response rate does not mean unbiased.

Probability:

Random variables.
Expectation and variance of RVs.

The regression problem:

Given a matrix of features, compute a real number prediction for each row.

8 of 45

Key Concepts

Loss function:

Measurement of the quality of our predictions for regression (or other machine learning tasks like classification).

Specific loss functions:

L1 loss: |ŷ - y|
L2 loss: (ŷ - y)²
Huber loss: Smooth combination of L1 and L2 loss.
Kullback-Leibler divergence / cross entropy loss: Used in logistic regression.

9 of 45

Key Concepts

Linear regression models:

Predictions are a linear combination of the columns of a feature matrix.
Using the average L2 loss, solution given by the normal equation.

Analytic and geometric derivations of the normal equation.

Surprisingly, the data to the right is linear!

Just need the right features in your feature matrix.

Training vs. test set:

Use only the training data to train.
Test data gives sense of how model generalizes.

10 of 45

Key Concepts

Gradient descent: Descend the gradient of the average loss over all data.

If loss function is non-convex, may get stuck in local minima.

Convexity:

A function is convex if the line segment any two points on the function is greater than the function itself.
A local minimum of a convex function is also a global minimum.
Convex loss functions are nice: Minimization algorithms run quickly.

�

11 of 45

Key Concepts

Bias/Variance tradeoff:

Regularization and cross validation:

Regularization lets us limit the complexity of our models, reducing overfitting.
Cross-validation is used to learn our regularization parameter(s).

12 of 45

Key Concepts

The classification problem:

Given a matrix of features, predict the class that each row belongs to.

Logistic regression:

Can think of regression where the prediction is the probability of belonging to a class.

Example where we have 2 features and 2 classes.

13 of 45

Key Concepts

Evaluating classifiers:

Confusion matrix.
Precision and recall.
ROC curve (see textbook)/area under curve(AUC).

14 of 45

Key Concepts

Bootstrap: Lets us estimate our confidence in a population statistic using only one sample.

Caveats: minimal sample size depends on population distribution!

Pseudorandom number generator: Generates a sequence of "random-looking" numbers from a random seed.

Hypothesis testing:

P-value gives the probability that our test-statistic would be generated under the null hypothesis.
Prosecutor's fallacy: 1-p is the chance of a specific non-null hypothesis.
Permutation testing is a particularly neat tool for computing p-values.

15 of 45

A few (important!) odds and ends

Numerical issues

In the computer, (a+b)+c != a + (b+c)...

Condition number:

A matrix is singular (can't be inverted) if its determinant is zero.
BUT: in practice, we care about whether we can invert NUMERICALLY
Condition number: "how close to zero" is your matrix
np.linalg.cond(mat)
http://www.ohiouniversityfaculty.com/youngt/IntNumMeth/lecture11.pdf

Higher dimensions:

Low-d intuition is NOT enough in high-d.
Weird things happen: e.g. all volume concentrates "at the edges"
https://www.youtube.com/watch?v=zwAD6dRSVyI (watch it!!)

16 of 45

Computational Topics That Concluded The Course

Basic SQL: The structure of basic SQL select statements.
Advanced SQL: Different flavors of joins. Fancy things you can do with SQL.
Big Data: The story of the data lake. Using Spark for high level parallelization.
Distributed Computing and Ray. A brief overview of distributed systems. Ray for lower level parallelization.
A/B Testing: Carefully designed, randomized controlled experiments.
Google Data Commons: A resource for transparently interacting with lots of datasets.

17 of 45

Labs

Lab 1: Matplotlib.
Lab 2: Pandas.
Lab 3: Data cleaning and visualization.
Lab 4: Visualizations, data transformations, and kernel density estimators.
Lab 5: Implementing linear models and loss functions, scipy.optimize.
Lab 6: Multiple linear regression and feature engineering on tips dataset.
Lab 7: Feature engineering and cross validation (sklearn) on boston house prices.
Lab 8: Midterm review.
Lab 9: Logistic regression (sklearn) on breast cancer dataset.
Lab 10: Bootstrap estimation of mean and variance on tips dataset.
Lab 11: SQL and the FEC political donation dataset.
Lab 12: Google Data Commons.

18 of 45

HWs and Projects

HW1: Cleaning and EDA on SF restaurant food safety scores.
HW2: EDA and Visualization of SF bike sharing data.
HW3: Modeling, Estimation, and Gradient Descent on toy data.
HW4: Logistic Regression on Spam/Ham email.
HW5: Hypothesis Testing the the hot hand effect from basketball data.
HW6: Ray and Map Reduce.

Data science lifecycle projects:

Project 1: Trump tweets.
Project 2: NYC taxi and accident data sets.

Grad project: Computer vision / image classification.

19 of 45

Course Reflections

20 of 45

Workflow Changes

Workflow changes this semester:

Gradescope based grading.

Sorry this was rocky. Being half in nbgrader and half in gradescope was not ideal for many reasons.

Doing all your work on datahub.

Unfortunately, datahub did not cope particularly well with DS100’s scale. Bug in storage configuration found fairly late in the semester.

Sorry about this one, we're learning - the tool actually works very reliably but each cloud provider has unique quirks we're finding out about.

21 of 45

Curriculum Changes

Curriculum changes this semester:

Reordering of the course. Biggest change: linear models came much earlier.
Tighter integration of homeworks/labs with lecture.

For those of you who kept up with lecture, hopefully you didn’t have many “how was I supposed to know that?” moments.

Different set of guest lectures.
Harder exams.

This is a good thing. Last semester grades were (probably) noisy on the high end. We’ll keep to the same grade distribution as the past.

Completely new HW5 (basketball), HW6 (Ray), and project 2 (taxi dataset).
Significant modifications to most assignments.

22 of 45

Things We’d Like To Do Next Time

Increase emphasis on real world tools like sklearn.
Have more engaging and socially relevant data sets to explore.
HWs and projects should result in more interesting, meaningful, and surprising results.
Expose potential data science pitfalls (e.g. Simpson’s Paradox, algorithmic bias) explicitly a few times throughout the homeworks.
More strongly emphasize avoidance of for loops. Only allow them where explicitly stated.
More tightly couple the textbook with the course.
Create a set of exam practice problems.
Create more small, in-class mini-exercises (both mental and computer-based).
Bring some non-tabular data examples (images, climate data).
Weave in some reflection on human/ethical context within class/HW.

23 of 45

Things We’d Like to Hear About

HKN Survey coming soon. Here are some things we’d like to hear from you:

My impression is that this course is somewhat less challenging than 61A or 61B, but not by much. Let us know what you thought of the workload.
Do you feel like you walked away with a good picture understanding of data science, or more like a toolbox of vaguely connected concepts?
What would have made the course more inspiring for you?
What parts of the course seemed like a poor use of your time?

24 of 45

HKN Survey [~7:10/7:20 PM]

25 of 45

Ask Us Anything

Attendance:

yellkey.com/plan

26 of 45

What’s Next

27 of 45

Beyond Data 100

Things we didn’t focus on in Data 100:

Try to interpret learned model parameters.

What does it mean if the result of training our logistic regression model is a huge value for θ₁?

Causal inference.

How do we establish causality when we identify a correlation observed during EDA?

Deep learning.

How can we actually learn the right features instead of picking them in advance?

Web scraping.

How can we get data from the web when it’s in an unfriendly format?

28 of 45

Beyond Data 100

Things we didn’t focus on in Data 100:

Time series analysis.

How do we analyze data where each point corresponds to a moment in time, e.g. detecting computer network traffic indicative of hackers.

Other flavors of machine learning (dimensionality reduction, reinforcement learning, clustering).
Open-ended exploration of problems and datasets picked by you.
Non-tabular data, often coming from the sciences, e.g.:

Images (satellite, medical, etc)
Sensor-generated data: stream flow, earthquakes, air quality, …

29 of 45

Machine Learning

30 of 45

Recommended Courses

Machine learning courses:

Info 154: A practical course in using ML.
CS188: A survey of AI including machine learning.
CS182: New course focused on deep learning. Don’t know much about it.
CS189: A deeper and more mathematically intense view of ML.
Stat 154: Seems similar to CS189. Also seems less popular?
Stat 157 (Spring 2019): Beginner introduction to deep learning.
DS102: The official data science ML course. Coming Fall 2019.

Databases:

CS186: A look at how databases are structured and how they work internally.�

31 of 45

Intro to Deep Learning course - STAT 157, Spring 2019

Stat 157 Topics in Probability & Statistics: "Introduction to Deep Learning" (3 units)

TTh 3:30-5:00pm

The topic for this semester is Introduction to Deep Learning. This class provides a practical introduction to deep learning, including theoretical motivations and how to implement it in practice. As part of the course we will cover multilayer perceptrons, backpropagation, automatic differentiation, and stochastic gradient descent. Moreover, we introduce convolutional networks for image processing, starting from the simple LeNet to more recent architectures such as ResNet for highly accurate models. Secondly, we discuss sequence models and recurrent networks, such as LSTMs, GRU, and the attention mechanism. Throughout the course we emphasize efficient implementation, optimization and scalability, e.g. to multiple GPUs and to multiple machines. The goal of the course is to provide both a good understanding and good ability to build modern nonparametric estimators. The entire course is based on Jupyter notebooks to allow students to gain experience quickly. Supporting material can be found at www.diveintodeeplearning.org.

Instructors:

Alexander Smola, Director, Amazon Web Services
Mu Li, Principal Scientist, Amazon Web Services

See Piazza (https://piazza.com/class/jkopvsyuy7g3u0?cid=1476) for more.

32 of 45

INFO 154: Data Mining and Analytics

Practically-oriented course.
An overview of tools and techniques in ML.
Right where we leave off in D100.

https://www.ischool.berkeley.edu/courses/info/154

33 of 45

More mathematical: CS 189, STAT 154

http://www.eecs189.org

https://classes.berkeley.edu/content/2018-fall-stat-154-001-lec-001

34 of 45

Recommended Courses

Probability/statistics foundations courses:

Stat 134 or Stat 140: Deeper introduction to probability theory (e.g. expectation, independence, central limit theorem, etc). Stat 140 is targeted at data scientists.
Stat 135: Deeper introduction to statistics (e.g. mean, median, parameter estimation, bootstrapping, etc).
EECS 126: Deeper introduction to probability theory. Also briefly touches on entropy and information theory.

GSI recommended courses:

IEOR 135: Applied Data Science with Venture Applications.
IEOR 172: Probability and Risk Analysis for Engineers: Modeling of uncertain systems.
INFO 159: Natural language processing.�

35 of 45

A few great (FREE) resources

Data8 book: https://www.inferentialthinking.com
Data100 book: https://textbook.ds100.org

A few more (from our Stanford colleagues):

Statistical thinking for the 21st century: http://statsthinking21.org
An introduction to statistical learning: http://www-bcf.usc.edu/~gareth/ISL
The elements of statistical learning: https://web.stanford.edu/~hastie/ElemStatLearn

http://www.ds100.org/fa18/

Lecture 15 / 16 slides give overview.
Let me know if you want a 2nd half overview.

36 of 45

Online Communities

GSIs recommended:

https://towardsdatascience.com/
https://reddit.com/r/dataisbeautiful
Twitter: Data scientists and professors post there.
Fivethirtyeight, washingtonpost and nytimes seem to do some good stuff.
Hacker news (y combinator).

37 of 45

Applying Your Knowledge

There are countless untold stories lurking in publicly available datasets.

Example below is mostly artistic, but there are lots of insights to be gleaned.

Source [Link]

38 of 45

Applying Your Knowledge

There are countless untold stories lurking in publicly available datasets.

Source [Link]

39 of 45

Applying Your Knowledge

There are countless untold stories lurking in publicly available datasets.

Let’s look at an example from a Democracy Working Group that I’m part of.

40 of 45

Applying Your Knowledge

Even though you are new to data science, you are also among the most skilled people in the world at data science.

Use your power wisely.

41 of 45

Student opportunities for spring 2019 (remember Tuesday?)

http://bit.ly/ds-teams-apply

42 of 45

Don't forget the Data 001 Piazza!

https://piazza.com/class/j7s01y165odq5

43 of 45

Helping Out with Data100

Data 100 needs you!

Teaching is a great way to deeply understand course material.
Data 100 has grown rapidly.
Growth only possible due to help from students.
The path starts as an academic intern.

44 of 45

Thanks to our incredible (u)GSI team!!!

This course is literally impossible without a team like this..

Please thank them, be kind to them, and we hope some of you will be in this same slide next semester!

Aakash

Allen

Aman

Ananth

Andrew

Caleb

Daniel

Junseo

Manana

Mian

Neil

Patrick

Sasank

Scott

Sona

Simon

Sumukh

Suraj

Tiffany

Tony

William

1 of 45

2 of 45

3 of 45

4 of 45

5 of 45

6 of 45

7 of 45

8 of 45

9 of 45

10 of 45

11 of 45

12 of 45

13 of 45

14 of 45

15 of 45

16 of 45

17 of 45

18 of 45

19 of 45

20 of 45

21 of 45

22 of 45

23 of 45

24 of 45

25 of 45

26 of 45

27 of 45

28 of 45

29 of 45

30 of 45

31 of 45

32 of 45

33 of 45

34 of 45

35 of 45

36 of 45

37 of 45

38 of 45

39 of 45

40 of 45

41 of 45

42 of 45

43 of 45

44 of 45

45 of 45