1 of 60

Building a Scalable Modern GIS Infrastructure

Matt Forrest - I-GUIDE Forum 2023

Matt Forrest (he/him)

I-GUIDE Forum 2023

2 of 60

Overview

Matt Forrest - I-GUIDE Forum 2023

3 of 60

About me

Matt Forrest - I-GUIDE Forum 2023

  • Field CTO @ CARTO
  • B.A. in Geography - University of Wisconsin
  • Am not a climate scientist, researcher
  • Releasing Spatial SQL book this fall

4 of 60

Agenda

Matt Forrest - I-GUIDE Forum 2023

  • What is modern GIS
  • The Modern Data Stack
  • The Modern GIS Stack
    • Data Sources
    • Ingestion
    • Data Storage
    • Reverse ETL
    • Transformation

5 of 60

Exercises

Matt Forrest - I-GUIDE Forum 2023

  1. GeoParquet
  2. PostGIS (w/ Docker)
  3. DuckDB
  4. dbt
  5. H3

6 of 60

What you’ll need

Matt Forrest - I-GUIDE Forum 2023

https://github.com/mbforr/modern-gis-workshop

7 of 60

What you’ll need

Matt Forrest - I-GUIDE Forum 2023

  • VS Code (preferred but any IDE will work)
  • Docker
  • DuckDB
  • dbt
  • pgAdmin
  • Optional
    • GDAL (will install with Docker)
    • QGIS

8 of 60

What is Modern GIS?

Matt Forrest - I-GUIDE Forum 2023

9 of 60

Matt Forrest - I-GUIDE Forum 2023

Modern GIS is the process, systems, and technology used to derive insights from geospatial data. Modern GIS uses open, interoperable, and standards based technology. It can be run locally or in the cloud and can scale to work with many different types, velocities, and scales of data.

10 of 60

Comparison

Matt Forrest - I-GUIDE Forum 2023

Traditional

Modern

Standards

Platform and software-based

Open and standards-based

Cloud Access

Cloud-hosted or on-premises

Cloud-native

Deployment

Local software package up to enterprise software packages

Open-source local use up to full enterprise

Collaboration

Siloed

Interoperable

Scalability

Single-threaded

Serverless

Data

Limited data scale

Scalable, even further in the cloud

11 of 60

The Modern Data Stack

Matt Forrest - I-GUIDE Forum 2023

12 of 60

History and growth

Matt Forrest - I-GUIDE Forum 2023

13 of 60

Why?

Matt Forrest - I-GUIDE Forum 2023

  • Growth in the OLAP database
  • Distributed computing and regular releases
  • SQL as the lingua franca
  • Ecosystem tools to support

14 of 60

https://moderndata101.substack.com/p/evolution-of-the-data-stack-the-story

Matt Forrest - I-GUIDE Forum 2023

15 of 60

https://motherduck.com/blog/motherduck-open-for-all-with-series-b/

Matt Forrest - I-GUIDE Forum 2023

16 of 60

https://tanay.substack.com/p/understanding-the-modern-data-stack

Matt Forrest - I-GUIDE Forum 2023

17 of 60

The Modern GIS Stack

Matt Forrest - I-GUIDE Forum 2023

18 of 60

Modern Geospatial Data Stack

Data Sources

Ingestion

Reverse ETL

Storage

Transformation

Analytics

Data Science

Applications

Matt Forrest - forrest.nyc

🍦yogrt

Mapping

Early 2024

19 of 60

Modern Geospatial Data Stack

Data Lake

Transform

Processing

OLTP

Orchestration

Matt Forrest - forrest.nyc

Mid 2024

Formats

OLAP

Analytics

GIS

Python

Applications

20 of 60

Modern Geospatial Data Stack

Data Lake

Transform

Processing

OLTP

Matt Forrest - forrest.nyc

Mid 2024

Formats

OLAP

Analytics

GIS

Python

Applications

Orchestration

21 of 60

Modern Geospatial Data Stack

Data Sources

Ingestion

Reverse ETL

Storage

Transformation

Analytics

Data Science

Applications

Matt Forrest - forrest.nyc

🍦yogrt

Mapping

Mid 2024

22 of 60

The Modern GIS Stack

Data Sources

Ingestion

Reverse ETL

Storage

Transformation

GIS

Data Science

Applications

Matt Forrest - I-GUIDE Forum 2023

23 of 60

The Modern GIS Stack

Data Sources

Ingestion

Reverse ETL

Storage

Transformation

GIS

Data Science

Applications

Matt Forrest - I-GUIDE Forum 2023

24 of 60

The Modern GIS Stack

Data Sources

Ingestion

Reverse ETL

Storage

Transformation

GIS

Data Science

Applications

Matt Forrest - I-GUIDE Forum 2023

25 of 60

Data Sources

Matt Forrest - I-GUIDE Forum 2023

26 of 60

Matt Forrest - I-GUIDE Forum 2023

  • Not a file type, a data specification
  • Increases data transfer speeds between systems

The Apache Arrow project specifies a standardized language-independent columnar memory format. It enables shared computational libraries, zero-copy shared memory and streaming messaging, interprocess communication, and is supported by many programming languages and data libraries.

27 of 60

Matt Forrest - I-GUIDE Forum 2023

  • Optimized to work better with cloud storage systems and HTTP GET requests
  • Tiling: get access to just the parts you need
  • Overviews: Downsampled images that make rendering much faster
  • Example

28 of 60

Zarr

Matt Forrest - I-GUIDE Forum 2023

  • N-dimensional array storage format
  • Generally used with temporal data, but the third dimension can be anything
  • Video demo + presentation

29 of 60

Geoparquet

Matt Forrest - I-GUIDE Forum 2023

  • First cloud-native vector format
  • Based on Apache Parquet
    • Columnar based storage
  • v1 recently released
  • More performant, smaller files
  • Partitioning!
  • Site + OGC presentation

30 of 60

Geoparquet

Matt Forrest - I-GUIDE Forum 2023

31 of 60

Exercise 1: Using GeoParquet

Matt Forrest - I-GUIDE Forum 2023

32 of 60

Ingestion

Matt Forrest - I-GUIDE Forum 2023

33 of 60

GDAL

Matt Forrest - I-GUIDE Forum 2023

  • Tried and true
  • Continues to support old formats and new
  • Embedded in most geospatial tools
  • Never going out of style

34 of 60

Airflow

Matt Forrest - I-GUIDE Forum 2023

  • Easy to adapt with Python
  • Orchestrate data pipelines and manage issues
  • Can be used to transform data (T in the ETL)
  • Still popular despite new tools

35 of 60

Airflow

Matt Forrest - I-GUIDE Forum 2023

36 of 60

Airbyte

Matt Forrest - I-GUIDE Forum 2023

  • Open source and cloud (paid) version
  • Alternative to tools like Fivetran
  • Mostly focuses on a wide range of spatial sources
  • Good for connecting cloud files and other systems
  • dbt built in

37 of 60

Data Storage

Matt Forrest - I-GUIDE Forum 2023

38 of 60

1995�Oracle Spatial in Oracle 8i

Spatial SQL

ESRI started in 1969 and created commercial GIS

1990

2010

2020

2000

1994�Illustra Spatial launches

1995�Oracle Spatial Data Option (SDO)

1996�Informix acquires Illustra

Spatial Datablade launched

2001�PostGIS candidate

release

2002�Releases DB2

Spatial Extender

2003�IBM acquires Informix

Spatial Extender launched

2003�OGC adopts

ISO 19125

2005�PostGIS 1.0

released

2008�Spatial support

In MS SQL

Server

2008�Spatialite for

SQLite launched

2009�MySQL launches

spatial support

2015�Geospark

launched

2018�Spatial support

In BigQuery

2019�Spatial support

In Redshift

2020�Spatial support

In Snowflake

2021�Apache Sedona

launched

2021�Spatial support

in Apache Pinot

2022

H3 support

in Databricks

2023

Spatial support

in DuckDB

Traditional database era

The move to open

The modern data stack

39 of 60

PostGIS

Matt Forrest - I-GUIDE Forum 2023

  • Still the ruler in terms of functionality
  • Can scale, but still an OTAP
  • Easier to use than ever (especially with extensions)
  • Will remain a core component of the modern GIS stack for a long time

40 of 60

Cloud data warehouses

Matt Forrest - I-GUIDE Forum 2023

41 of 60

BigQuery

Matt Forrest - I-GUIDE Forum 2023

  • Strong spatial capabilities out of the box
  • Loading tools for GeoJSON (newline only)
  • Scalable from a serverless side
  • Great public data to begin with

42 of 60

Snowflake

Matt Forrest - I-GUIDE Forum 2023

  • Focused on the developer and analytics market
  • Incremental steps for compute
  • Added GEOMETRY and GEOGRAPHY
  • SnowPark for Python scalability and UDFs

43 of 60

Redshift

Matt Forrest - I-GUIDE Forum 2023

  • Strongest spatial capabilities, most similar to PostGIS
  • New serverless model
  • Close to open data in AWS Earth

44 of 60

Analytics Toolbox

Matt Forrest - I-GUIDE Forum 2023

  • Extends spatial functionality

45 of 60

Analytics Toolbox

Matt Forrest - I-GUIDE Forum 2023

  • Extends spatial functionality in all data warehouses
  • Open source tools include spatial indexing, processing, and other tools
  • License includes tiling, spatial statistics, raster ingest
  • Free accounts for students and educators

46 of 60

Analytics Toolbox

Matt Forrest - I-GUIDE Forum 2023

47 of 60

DuckDB

Matt Forrest - I-GUIDE Forum 2023

  • Columnar and vector OLAP with no dependencies
  • Zero dependencies
  • Runs locally or in Python
  • Incredibly fast on large amounts of data
  • Reads files from a file system

48 of 60

MotherDuck

Matt Forrest - I-GUIDE Forum 2023

  • Adds serverless capabilities with data in the cloud
  • Query between local and cloud data together (hybrid execution)
  • Data warehouses “for the rest of us”
  • Site + “Big data is dead

49 of 60

Hybrid Approach

Matt Forrest - I-GUIDE Forum 2023

Complete spatial tools

Raster support

Extensions

Large scale data

Parquet!

Fast processing

Sharable databases

50 of 60

Exercise 2: PostGIS with Docker

Matt Forrest - I-GUIDE Forum 2023

51 of 60

Exercise 3: Quack with DuckDB

Matt Forrest - I-GUIDE Forum 2023

52 of 60

Reverse ETL

Matt Forrest - I-GUIDE Forum 2023

53 of 60

What is Reverse ETL

Matt Forrest - I-GUIDE Forum 2023

  • Pushing processed or analyzed data back into data storage
  • Can be used to validate or make access easier
  • GDAL and DuckDB are the best for this if needed

54 of 60

Transformation

Matt Forrest - I-GUIDE Forum 2023

55 of 60

GDAL

Matt Forrest - I-GUIDE Forum 2023

  • Moving between formats is easy
  • Built in processing with Spatialite
  • Raster to vector and vice versa

56 of 60

H3 Indexing

Matt Forrest - I-GUIDE Forum 2023

  • Global discrete grid system developed by Uber
  • Maintains relationships between parents, children, neighbors
  • Measure paths, distances, areas, etc.
  • Rapidly speed up geospatial joins

57 of 60

dbt

Matt Forrest - I-GUIDE Forum 2023

  • “Data build tool”
  • Uses an ELT paradigm, handles the transform at the end
  • Multiple steps, create multiple outputs
  • Checks for various factors

58 of 60

Exercise 4: Process data with dbt

Matt Forrest - I-GUIDE Forum 2023

59 of 60

Exercise 5: Translating rasters to H3

Matt Forrest - I-GUIDE Forum 2023

60 of 60

Thank you!

Matt Forrest - I-GUIDE Forum 2023

LinkedIn @mbforr

forrest.nyc

matt@carto.com

spatial-sql.com�mattforrest.substack.com