1 of 15

CSE 163

Missing Data & Time Series

Hunter Schafer

2 of 15

DataFrame

  • One of the basic data types from pandas is a DataFrame
    • It’s essentially a table with column and rows!

2

id

year

month

day

latitude

longitude

name

magnitude

0

nc72666881

2016

7

27

37.672333

-121.619000

California

1.43

1

us20006i0y

2016

7

27

21.514600

94.572100

Burma

4.90

2

nc72666891

2016

7

27

37.576500

-118.859167

California

0.06

Columns

Index (row)

3 of 15

Group By

data.groupby('col1')['col2'].sum()

3

col1

col2

A

1

B

2

C

3

A

4

C

5

key

col2

C

3

5

key

col2

B

2

key

col2

A

1

4

key

A

5

key

B

2

key

C

8

key

A

5

B

2

C

8

Data

Split

Apply

(sum)

Combine

4 of 15

This Week

Data Science Libraries

  • Monday
    • Missing Data
    • Time Series
    • Library: pandas
  • Wednesday
    • Data Visualization
    • Library: seaborn
  • Friday
    • Machine Learning
    • Library: scikit-learn

4

5 of 15

What to Learn

  • This week, we are learning more about pandas and learning 2 new libraries
  • Memorizing the function calls and parameters is ridiculous
    • No one memorizes this stuff!
    • This is what documentation is for!
  • Much more important to understand the big ideas behind what the library call is doing
    • You might use a different library in the future
    • They change from version to version
  • Don’t try to write every bit of syntax down, focus on the big ideas behind what we are trying to solve and use the slides and lecture notes as a resource.
  • On the exam, we will provide shortened documentation so you don’t have to memorize the method calls

5

6 of 15

fMRI

6

7 of 15

Missing Data

  • Most data in the world is messy and not in a form you want
    • Most common: Missing data
  • Pandas uses “Not a Number” (NaN) to represent missing data
  • Most times, it will just ignore them in computations but NaN can be a common source of bugs!
  • Useful pandas functions

7

Detecting for missing data

isnull()

notnull()

Changing/Removing missing data

dropna()

fillna()

8 of 15

Sorting

  • Sorting your data is a very common task
    • Either for presentation or finding the top-k
  • Very easy in pandas!
    • Note: All of these return new DataFrames

8

# Sort data

data.sort_values('column')

data.sort_index()

# Find top-k

data.nlargest(10, 'column')

9 of 15

Keyword Arguments

  • How does Python know that a is 2 and b is 3?
    • Arguments are determined by position
  • Python also allows you to pass by name instead
    • Library calls usually take MANY arguments (with defaults), much more convenient to specify the ones you by name

9

def div(a, b):

return a / b

div(2, 3)

div(b=3, a=2)

10 of 15

Brain Break

10

11 of 15

Time Series

11

12 of 15

Fremont Bridge

12

13 of 15

Time Series

  • Context: A bit more advanced than what you will need on your homework for this week, but can be helpful for your project! �
  • Common to change index of your data to be the timestamp
    • This allows easy querying by date

13

# Read in data with timestamp

data = pd.read_csv('data.csv', index_col='col',

parse_dates=True)

# Query for certain dates

data.loc['2017-03-06'] # one day

data.loc['2018-06'] # a month

data.loc['2019'] # a year

data.loc['2017':'2019'] # a range of time

14 of 15

Granularity Matters

  • Your data will have a certain granularity to its time
    • e.g a row per second, a row per hour, a row per year
  • Your application might require a different granularity
    • You can downsample by combining values
    • You can upsample by creating new values
    • Both are done using resample
  • Use codes to describe frequency
    • D = day, W = week, M = month, A = year, …
    • Many possible codes listed here.�
  • Can also groupby with time, but is not the same as downsampling.

14

15 of 15

Before Next Time

  • If you haven’t started HW2, now is a great time!

Next Time

  • Focus on data visualization
    • How to do it in Python
    • What is a “good” data viz

15