1 of 20

Warmup

Consider the following two tables:

What is the result of the following join?

1

a

b

A

1

B

2

c

d

3

X

2

Y

2

Z

table1

table2

table2.merge(table1, left_on='c', right_on='b',

how='left')

Pair

pollev.com/cse163

2 of 20

CSE 163

Joins / Spatial Indices

Hunter Schafer

3 of 20

Geopandas

  • Exactly like pandas, but with some extra features for geo data
  • Types
    • GeoDataFrame
    • GeoSeries

  • geopandas uses a geometry column to help draw places�
              • Demo by Wen Qiu

3

data = geopandas.read_file('data_file.shp')

4 of 20

Geometry

  • One very common notion in geospatial data is a geometry
  • This is what actually defines the shape of the countries from the reading
  • Types of geometry: �shapely.geometry

4

5 of 20

Matplotlib

  • Last time we showed how to plot 2 graphs on the same plot
  • Terminology is a bit important so I wanted to cover that again

There are two fundamental concepts for matplotlib

  • Figure (canvas to draw on / entire picture)
  • Axis (an individual plot inside the figure)

The subplots method conveniently returns a new figure and axis

5

fig, axs = plt.subplots(3)

fig, [ax1, ax2, ax3] = plt.subplots(3)

6 of 20

Dissolve

  • Exactly the same as a groupby for the the regular columns
    • For the geometry columns, overlays all of the geometries
  • Options for aggfunc
    • ‘first’
    • ‘last’
    • ‘min’
    • ‘max’
    • ‘sum’
    • ‘mean’
    • ‘median’

6

7 of 20

Separated Data

Imagine that your data is split into multiple DataFrames and you want to combine them.

How can I print out the grading assignments?

7

ta_name

ta_id

Ryan

1

James

2

Nicole

3

grader_id

student_name

2

Flora

3

Paul

1

Wen

3

Andrew

tas

grading

8 of 20

Combining Data

Basic Idea, write code to do something like (not real code)

This is called a join since we are joining the tables together

  • Find all combinations of rows that “line up” based on a value

This is such a common task, there is a method we can call to do this

8

for t in tas:

for g in grading:

if t['ta_id'] == g['grader_id']:

print(t, g)

9 of 20

Pandas Join

9

tas.merge(grading, left_on='ta_id',

right_on='grader_id')

ta_name

ta_id

Ryan

1

James

2

Nicole

3

grader_id

student_name

2

Flora

3

Paul

1

Wen

3

Andrew

ta_name

ta_id

grader_id

student_name

Ryan

1

1

Wen

James

2

2

Flora

Nicole

3

3

Paul

Nicole

3

3

Andrew

10 of 20

Type of Join

There are interesting questions of what happens if there are rows that don’t “line up”. Different type of joins differ in how to handle this case.

Types of Joins

left.merge(right, left_on=’lcol’, right_on=’rcol’,

how=’type’)

  • Inner (default): Both values must be present
  • Left: If a value from left has no match, add NaNs
  • Right: If a value from right has no match, add NaNs
  • Outer: If a value from either table has no match, add NaNs

10

11 of 20

Warmup

Consider the following two tables:

What is the result of the following join?

11

a

b

A

1

B

2

c

d

3

X

2

Y

2

Z

table1

table2

table2.merge(table1, left_on='c', right_on='b',

how='left')

Pair

pollev.com/cse163

12 of 20

1 min

Suppose we had two tables

Suppose we wanted to compute the number of ratings each instructors received. From rows shown above, the result should print out that Hunter received 0, Rit received 2, and Aleks received 1.

There are multiple steps, to solve this: Join the data and then a groupby instructor to get counts.

We want you to figure out the join.

  • Feel free to try the rest if you finish early!

12

name

id

Hunter

1

Rit

2

Aleks

3

...

...

instructor_id

rating

2

5

3

5

2

5

...

...

instructors

evals

Think

pollev.com/cse163

13 of 20

2 min

Suppose we had two tables

Suppose we wanted to compute the number of ratings each instructors received. From rows shown above, the result should print out that Hunter received 0, Rit received 2, and Aleks received 1.

There are multiple steps, to solve this: Join the data and then a groupby instructor to get counts.

We want you to figure out the join.

  • Feel free to try the rest if you finish early!

13

name

id

Hunter

1

Rit

2

Aleks

3

...

...

instructor_id

rating

2

5

3

5

2

5

...

...

instructors

evals

Pair

pollev.com/cse163

14 of 20

Geospatial Join

Goals

  • Plot multiple layers on a map
  • Find all the states that intersect the path of the hurricane

Notes

  • Plot layers by plotting on the same axes
  • Need to use a spatial join to join on geo-spatial features

14

15 of 20

Spatial Joins

Very similar to plain old joins (how)

  • Same distinction between inner, left, right
  • Find all pairs of matches

The key difference are you match by geo-spatial relation (op)

  • Most commonly will just use op=’intersects’, but there are other ways to determine a match

15

gpd.sjoin(mainland, florence, � how='inner', op='intersects')

16 of 20

Spatial Join - Points

Imagine we had a dataset of where�people live in England.��Want to know if someone lives�at coordinates (x, y)��How many points would we �have to search through?

16

[Point1, Point2, Point3, …]

17 of 20

Spatial Index

17

18 of 20

Spatial Index

This is a tree!

To find if a Point is in the dataset, follow the tree!

18

19 of 20

Spatial Index Performance

  • Say we are looking for a single Point
  • How much work is it if we have n rows?
    • Without a spatial index: O(n)
    • With a spatial index? O(height of tree)
  • How tall is the tree? How many times can we divide 1 million points in half?
    • 1,000,000 / 2 / 2 / 2 / 2 / … / 2 = 1
    • 2k = 1,000,000
    • k = log2(1,000,000) ≅ 20
    • This means the lookup is O(log n)
  • A million doesn’t sound that big. What about a tree of the US?
    • 327.2 million people
    • log2(327,200,000) ≅ 28
  • The US isn’t THAT big. What about a tree of China?
    • 1.386 billion people
    • log2(1,386,000,000) ≅ 30

19

20 of 20

Technicalities

  • Now this isn’t for free, the spatial index takes time to build and also takes up extra space
    • O(knlog(n)) where k is the dimension, n is num points
  • Indices are great for quickly accessing data, but they can be harder to update
    • It would seem easy to update the tree by adding more points, but we were assuming the tree was balanced.
    • In general, adding an index to your data makes it faster to read but harder to update

20