1 of 19

Introduction to

Pandas and GeoPandas

Ujaval Gandhi

ujaval@spatialthoughts.com

Python Foundation for Spatial Analysis

2 of 19

Pandas Basics

3 of 19

Pandas

  • Pandas is a powerful library for working with tabular data.
  • Provides fast and easy functions for reading data from files, and analyzing it.
  • Can read data from CSV, Excel, HDF and many other formats
  • Built-in functions for
    • Data cleaning (remove duplicates, remove nulls)
    • Data transformation (pivot, unpivot, merge, remove)
    • Group statistics
    • Time-series processing (interpolation, gap-filling)
  • Pandas implementation is very fast and efficient.
  • Pandas allow for simpler code and quick data processing.

4 of 19

Basic Terminology

  • DataFrame: two-dimensional data structure with rows and columns
    • You can think of a DataFrame being equivalent to a Spreadsheet or an Attribute Table of a GIS layer.
  • Series: one-dimensional data structure with labeled data
    • You can think of a Series as a single row or a single column

5 of 19

col 1

col 2

col 3

col n

row 1

row 2

row 3

row n

DataFrame

Series

6 of 19

Pandas Selection

7 of 19

col 1

col 2

col 3

col n

row 1

row 2

row 3

row n

Columns have Index and label

Rows have Index*

*You can optionally assign a column to be row labels

index

0

1

2

..

n

8 of 19

Pandas Index

  • Each row and column in a DataFrame is assigned an Index or Label
  • To select a row/column by index - use .iloc[row_index, column_index ]
  • To select a row/column by label - use .loc[ row_label, column_label]

9 of 19

Select by Index

  • Select first 5 rows
    • df.iloc[0:5]
  • Select all rows and first 5 columns
    • df.iloc[:, 0:5]
  • Select every alternate row in reverse order
    • df.iloc[::-2]

Pandas provides a shortcut for selecting rows by index

  • df[0:5]
  • df[::-1]

10 of 19

Select by Label

  • Select first column
    • df.loc[: ,‘col 1’]
  • Select first and second columns
    • df.loc[: , [‘col 1’, ‘col 2’]]

Pandas provides a shortcut for selecting columns by labels

  • Select first column
    • df[‘col 1’]
  • Select first and second columns
    • df[[‘col1’, ‘col 2’]]

11 of 19

In Summary

  • df.iloc[ ] works for selecting rows or columns by index
  • df.loc[ ] works for selecting columns by name

In practice, we recommend

  • Use df.iloc[ ] selecting rows
  • Use df[‘column’] for selecting a single column (equivalent to df.loc[: ‘column’])

12 of 19

Pandas Filtering

13 of 19

Filtering DataFrames

  • You can select records using the syntax: df[condition]
  • Logical Operators
    • df[df[‘col1’] > 10]
    • df[df[‘col1’] != ‘x’]
    • df[(df[‘col1’] == ‘x’) & (df[‘col2’] == ‘y’)]
  • str accessor
    • df[df[‘col1’].str.startswith('A')]
    • df[df[‘col1’].str.match('A*')]
  • isin operator
    • df[df[‘col1’].isin([‘A’, ‘B’])]
  • Many other functions for filtering null values, duplicates etc.
    • query(), dropna(), nlargest() etc.

14 of 19

Pandas Calculations

15 of 19

Performing Calculations

  • Simple computations
    • df[‘col1’] * 10
    • df[‘col1’] + df[‘col2’]
  • Simple Statistics
    • df[‘col1’].sum()
    • df[‘col1’].round()
  • Complex Calculations
    • df.apply(function, axis)

16 of 19

apply()

  • Preferred way to run complex calculations on a DataFrame
    • Similar to Field Calculator in GIS or Cell Formula in Excel
  • Write a function and apply it on the dataframe
  • Function can run on each row (axis=1) or each column (axis=0)

17 of 19

GeoPandas Basics

18 of 19

GeoPandas

  • GeoPandas extends the Pandas library to enable spatial operations
  • GeoPandas is built on top of the following libraries that allow it to be spatially aware.
    • Shapely for geometric operations (i.e. buffer, intersections etc.)
    • PyProj for working with projections
    • Fiona for file input and output, which itself is based on the widely used GDAL/OGR library
  • GeoPandas provides built-in functions for
    • Geoprocessing (buffer, intersection, union, dissolve etc.)
    • Joins (table join, spatial join etc.)
    • Transformations (reproject, filter etc.)
    • Format Conversion

19 of 19

Basic Terminology

  • GeoDataframe: A tabular structure with a geometry
    • You can think of a GeoDataFrame being equivalent to a vector layer in a GIS
  • GeoSeries: A structure that holds the geometry
    • Each GeoDataFrame has at least one GeoSeries called ‘geometry’ containing the geometry of each feature