CSE 163
Missing Data & Time Series
Hunter Schafer
DataFrame
2
| id | year | month | day | latitude | longitude | name | magnitude |
0 | nc72666881 | 2016 | 7 | 27 | 37.672333 | -121.619000 | California | 1.43 |
1 | us20006i0y | 2016 | 7 | 27 | 21.514600 | 94.572100 | Burma | 4.90 |
2 | nc72666891 | 2016 | 7 | 27 | 37.576500 | -118.859167 | California | 0.06 |
Columns
Index (row)
Group By
data.groupby('col1')['col2'].sum()
3
col1 | col2 |
A | 1 |
B | 2 |
C | 3 |
A | 4 |
C | 5 |
key | col2 |
C | 3 |
5 |
key | col2 |
B | 2 |
key | col2 |
A | 1 |
4 |
key | |
A | 5 |
key | |
B | 2 |
key | |
C | 8 |
key | |
A | 5 |
B | 2 |
C | 8 |
Data
Split
Apply
(sum)
Combine
This Week
Data Science Libraries
4
What to Learn
5
fMRI
6
Missing Data
7
Detecting for missing data |
isnull() |
notnull() |
Changing/Removing missing data |
dropna() |
fillna() |
Sorting
8
# Sort data
data.sort_values('column')
data.sort_index()
# Find top-k
data.nlargest(10, 'column')
Keyword Arguments
9
def div(a, b):
return a / b
div(2, 3)
div(b=3, a=2)
Brain Break
10
Time Series
11
Fremont Bridge
12
Time Series
13
# Read in data with timestamp
data = pd.read_csv('data.csv', index_col='col',
parse_dates=True)
# Query for certain dates
data.loc['2017-03-06'] # one day
data.loc['2018-06'] # a month
data.loc['2019'] # a year
data.loc['2017':'2019'] # a range of time
Granularity Matters
14
Before Next Time
Next Time
15