1 of 46

Stat Comp and Intro to Data Science

Wayne Tai Lee

2 of 46

Agenda

  • Introductions
  • Data science and the need for computing
  • Course logistics
  • Data science case study

3 of 46

Wayne Tai Lee

4 of 46

Stat Computing and Intro to Data Science

  • Some data science but mostly programming with data
  • What is “data science”?
  • Why do we need to know computing?

5 of 46

Universal definition: DataScience(data) = $

  • Data has been around though, what’s new?

6 of 46

Data and services for data are now primary assets

7 of 46

A more recent example

Source: The Verge

Source: The Verge

Source: NYTimes

8 of 46

Others have “data” in their title

  • More than one role generates value out of data
    • Data Analyst
    • Data engineers
    • Data architect
    • MLOps

9 of 46

Some notable trends in data science

  • Data quality is becoming an issue
  • Each team only needs at most one data scientist
  • “Research in industry” is not long-term viable
  • Tenure at each job can be short
  • Ads remains a big revenue source for data

10 of 46

Data Scientists often are only trained about modeling

Model building

11 of 46

You definitely need to know analytics as well

Model building

Analyze behavior

12 of 46

If you’re lucky, you get to define the data

Data Protocol

Model building

Analyze behavior

13 of 46

How will people consume your output?

Data Protocol

Model building

Productization

Analyze behavior

14 of 46

Pre-processing is more important than you think

Data Protocol

Pre-processing

Model building

Productization

Analyze behavior

Support

15 of 46

Data Scientists have the end-to-end view from a quantitatively rigorous perspective

Data Protocol

Pre-processing

Model building

Productization

Analyze behavior

Support

16 of 46

If you were concerned about a career in data science...

  • Big data is only becoming more complex
  • Data engineering is becoming more automated but insights and innovation still require humans (Some companies would disagree)
  • The key is to identify and solve problems, not listing off your skills

17 of 46

Why learn computing? Efficiency gains

18 of 46

Why learn computing? Verify statistical theorems

19 of 46

Why learn computing? Allows diverse approaches

Permutation test instead of 2 sample t-test: no longer as dependent on Normal distribution

20 of 46

Why learn computing? Reproducible + readable

Excel

21 of 46

Why learn computing? Reproducible + readable

Excel

Coding

22 of 46

Why Python? Popularity = Support

Source (methodology not verified)

23 of 46

Expectations for 4000+ level courses

  • 1000+ / 2000+
    • Learn through imitation
  • 3000+
    • Learn by building on previous foundations
  • 4000/5000+
    • Learn through translation
      • From idea to execution
      • Articulating your thoughts

24 of 46

How to take this class? First half focuses on data science

Data science

  • “Case study”
  • Translate the problem into concrete problems
  • Formulate problem solutions
  • Talk about the code necessary to solve the problem

25 of 46

How to take this class? Second half focuses on coding

Data science

  • “Case study”
  • Translate the problem into concrete problems
  • Formulate problem solutions
  • Talk about the code necessary to solve the problem

Coding

  • Review coding concepts
  • Review the problem
  • Review the solution
  • Apply lecture content to the solution -- In groups!

26 of 46

You should study the tutorials at home

Introduction

Case study

Week 1

Review +

Coding

Case study continued

Week 2

Study tutorials at home!

….

Review +

Coding

Case study continued

Week 3

Study tutorials at home!

27 of 46

Course logistics

See syllabus on Canvas (slight difference across sections)

  • Homeworks are released and graded on Ed
  • Exams are in-person
  • Examples will mostly be in Colab
  • Discussions are on Ed
  • Participation is measured via class exercises on CourseWorks
  • Office hours TBD
  • Is it okay to use ChatGPT?

28 of 46

Ed Logistics - HW0 Demo

  • Ed Homework FAQ

29 of 46

How to ask questions online

Meaningful title for others to find

30 of 46

How to ask questions online

What are you trying to do

31 of 46

How to ask questions online

How are you doing it

32 of 46

How to ask questions online

Test it out with small data

33 of 46

How to ask questions online

What you expect vs what you’re seeing

34 of 46

How to ask questions online

Be nice!

35 of 46

How to ask questions online

  • Everyone Googles when they see unfamiliar errors
    • stackoverflow.com is very good!
  • Copy the error and the command, not the code

If you’re using AI tools, here’s a prompt:

””” You are a college instructor helping students with an assignment. Your job is to help clarify and guide my thinking by asking questions back without giving me the answers to the problem. Here are 2 examples: Question: create a simulation that demonstrates the sample average is unbiased for estimating the population mean. Your answer: What does unbiased mean? Would you expect a single sample average to be exactly the same as the population mean?

Question: how should we evaluate a model? Answer: What is the purpose of the model? How would you know if the model was bad? What is the model being compared to? “””

36 of 46

ChatGPT

  • What is hallucination?
    • In legal
    • In coding
  • Best practice:
    • Know what you want AND how to test it!
    • Let it ask you questions back!

37 of 46

Google Colab Demo

Please use your LionMail account!

colab.research.google.com/

38 of 46

How to ask questions online - NO screenshots of code

  • It’s hard to copy/paste
  • Identifying the error is part of the learning

Screenshots are reasonable for:

  • Platform errors
  • Errors you cannot reproduce

39 of 46

Random tangent - how to differentiate yourself

  • Data challenges with prizes
  • Develop a pet project
  • Do research (academia is the optimal environment!)

40 of 46

First question in data science

  • What would your priorities be if you were the first “data” person in a startup?

41 of 46

Lessons from installation - why conda?

  • Command line interface
  • Manage computing environments and dependencies with conda to avoid:

42 of 46

Where is my computer? - notebooks vs Python program

Browser

Colab notebook (similar to Jupyter Notebooks)

Google Colab Servers

Python

  • control over packages
  • dependence on internet?
  • access your local files?

43 of 46

Common mistakes

  • Files on your computer are not on Colab
  • Files uploaded to Colab will disappear after you shut down your kernel
    • I recommend uploading to Google Drive then “mounting” your Google Drive to your Colab
  • You should save frequently with Colab

44 of 46

Lessons from installation - when working with Ed

Browser

Jupyter notebook

Some computer managed by Ed

Python

  • Whole class shares the same setup
  • Depends on internet connectivity
  • Cannot see your local files

45 of 46

How I use Python

46 of 46

Python Basics -