1 of 66

Intro to Spatial Analysis & GIS for Spatial Epidemiology in R

SER Workshop

June 13, 2023��go.illinois.edu/SER2023

Marynia Kolak, PhD, MFA, MS

Qinyun Lin, PhD

2 of 66

Today’s workshop

Learning goals for today:

Understanding the basics of spatial analysis & mapping in R
Standardizing data at the neighborhood level
Integrating data from another source
Calculating new variables
Generating maps for analysis

3 of 66

Today’s workshop

1:00 - 1:30pm	Introduction to R Spatial	Marynia
1:30 - 2:00pm	Mapping Neighborhoods	Qinyun
2:00 - 2:20pm*	Spatial Cluster Detection	Marynia
2:30 - 3:00pm	Adding Health Resources	Marynia
3:00 - 3:20pm*	Calculating Spatial Metrics	Qinyun
3:30 - 4:00pm	Resource Sharing & Q&A	Qinyun

Follow along with live coding: https://makosak.github.io/Intro2RSpatialMed/

*We will take two 10-minute breaks

4 of 66

Environment Set-Up

Working version of RStudio
Install these packages: sf, tmap, tidygeocoder, & rgeoda
If you don’t have a working R-spatial environment on your computer, create a free account in Posit & utilize today

5 of 66

Introduction to R Spatial

6 of 66

What is Spatial Data?

Spatial Data = Information + Location 🎯�

Spatial data is data that contains information about specific locations.

Note: the information content of the data may change with location.�

On some occasions, spatial data may only include “location.”

But without “location,” the data is no longer spatial.

7 of 66

Spatial Data Types

/ Polygon

8 of 66

Non-Spatial & Spatial Data Formats

CSVs. Comma Separated Values (or other text files) can store spatial data if they have columns with latitude & longitude to represent coordinate locations

9 of 66

Non-Spatial & Spatial Data Formats

Shapefile. A proprietary data format by ESRI made of at least 4 extension files (.shp, .shx, .dbf, and .prj)

��

GeoJSON. An open standard format for encoding a variety of geographic data structures.

10 of 66

Non-Spatial & Spatial Data Formats

In R, we will work with the sf library. ��sf uses the following data structures:

data.frame - table or a two-dimensional array-like structure.
tibble - a “modern reimagining” of the data.frame (tidyverse)
sf - Simple features refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. (Types: point, line, polygon; multipoint, multipolygon)

11 of 66

Coordinate Reference Systems

A Coordinate Reference System (CRS) communicates what method should be used to flatten or project the Earth’s surface onto a 2-dimensional map.

Different CRS imply different ways of projections and can generate substantially different visualizations. For example, the following are some world maps using different projections →

12 of 66

13 of 66

Defining your CRS

CRSs can be referred to using a Spatial Reference System Identifier (SRID) including EPSG codes. The EPSG Database is one of the most comprehensive databases that store thousands of different projections.

EPSG 4326 is one of the most common projections used today.

In {sf}, you can use the function st_crs to check the CRS used in a data, & st_transform to re-project the data to a particular CRS.

14 of 66

R-spatial Ecosystem

An active and growing online community:

Geocomputation in R�https://geocompr.robinlovelace.ne t�
Spatial Data Science with R�https://rspatial.org/�
R Spatial Workshops from CSDS�https://spatialanalysis.github.io/tutorials/ �

The popularity of spatial packages in R. The y-axis shows average number of downloads per day, within a 30-day rolling window. Source: Geocomputation in R

15 of 66

Live demo

16 of 66

Mapping Neighborhoods

17 of 66

Thematic Mapping

Thematic or chloropleth maps represent quantitative data through colors or patterns in different geographic areas.

There are many different ways to classify & thematically map data:�

Data classification methods
Number of classes or bins

Image: Axis Maps

Equal interval classification with 5 bins

18 of 66

Data Classification Methods

Different methods arrange your data using different boundaries to separate classes.

In a quantile map, data is grouped into classes or bins that each have the same number of observations.

3 classes = tertile �4 classes = quartile�5 classes = quintile

19 of 66

Data Classification Methods

Different methods arrange your data using different boundaries to separate classes.

In a natural breaks map (also called Jenks map), data is grouped so that the within-group homogeneity is maximized.

Note the number of observations in each class can be highly unequal.

20 of 66

Data Classification Methods

Different methods arrange your data using different boundaries to separate classes.

A standard deviation map can be useful for helping identify outliers.

The variable is transformed to standard deviational units, covering the range from lowest to highest.

21 of 66

Data Classification Methods

Different methods arrange your data using different boundaries to separate classes.

Additional methods:

Equal intervals
Manual intervals
Box map

Image: Axis Maps

22 of 66

Which method is best?

This depends on your data -- and the findings you are trying to convey

Quantile

Natural Breaks

Standard Deviation

23 of 66

Live demo

24 of 66

Spatial Cluster Detection

25 of 66

Formalizing Tobler’s Law: Spatial Autocorrelation

26 of 66

Using Test Statistics

A statistic is any value that summarizes characteristics of a distribution
Calculated from the data and compared to reference distribution
How likely is the value if it had occurred under the null hypothesis?
When unlikely (low p-value), the null hypothesis is rejected

When considering spatial autocorrelation, we test against a null hypothesis of spatial randomness.

27 of 66

Operationalizing Spatial Randomness

under spatial randomness, the location of values may be altered without affecting the information content of the data
Test it: random permutation or reshuffling of values

From Anselin, Fall 2017, SOC 20253

28 of 66

Spatial Autocorrelation Statistic

Captures both attribute and locational similarity
Need to construct index from data that captures both

Attribute similarity

Location similarity

29 of 66

Defining Spatial Weights

Before we can test for spatial autocorrelation, we need to define neighbors.

We do this by creating a spatial weight, which is a matrix where neighbor relationships for each unit are recorded.

When creating a spatial weight, think about how far the impact can reach...

Europe, queen contiguity weights

30 of 66

Spatial Weight - 2nd Order Queen Contiguity

31 of 66

Same area, different spatial weight (w)

2nd order queen contiguity

1st order rook contiguity

32 of 66

Contiguity-Based Weights

Rook Contiguity Weight

Neighbors touch boundary, but not corners
Can be 1st order and above
Won’t connect islands or areas on their own

Queen Contiguity Weight

Neighbors touch boundary and corners
Can be 1st order and above
Won’t connect islands or areas on their own

33 of 66

Distance-Based Weights

KNN Weight

K-Nearest Neighbor Weight
K is a number you choose, like 4
If K=4, will find 4 nearest neighbors (based on distance of centroids)
Will connect all areas, even

Europe, KNN-4 contiguity weights

34 of 66

Matrix Representation

If 2 are neighbors, the value = 1. Otherwise = 0.

1	2	3
4	5	6
7	8	9

	1	2	3	4	5	6	7	8	9
1	1	1	1	0	0	0	0	0	0
2	1	1	1	0	1	0	0	0	0
3	0	1	1	0	0	1	0	0	0
4	1	0	0	1	1	0	1	0	0
5	0	1	0	1	1	1	0	1	0
6	0	0	1	0	1	1	0	0	1
7	0	0	0	1	0	0	1	1	0
8	0	0	0	0	1	0	1	1	1
9	0	0	0	0	0	1	0	1	1

35 of 66

Global Spatial Autocorrelation

Moran’s I Test Statistic:

zi = yi - mx : deviations from the mean

Cross product statistic ~ correlation stat.

z statistic: (observed - mean) / SD�Comparable across variables & spatial weights

Need to assess whether computed value of Moran’s I is significantly different from a value for a spatially random distribution:

Compute analytically using normal distribution, etc.
Compute computationally, comparing values to a reference distribution obtained from a series of random permutations

36 of 66

Moran’s I & the Permutation Approach

Moran’s I changes with different spatial weights; use z-scores to compare
We create a reference distribution and randomly reshuffle observations (using Monte Carlo permutations), computing Moran’s I each time
Compare observed value to reference dist.
High/Low Moran’s I; positive vs negative autocorrelation with 0 as the null hypothesis

Tip: More permutations (ex. n=999 vs 499), more stability in findings.

37 of 66

Spatial Autocorrelation Measures

Global Clustering Measure: Is there spatial clustering present?

Moran’s I

Local Clustering Measures: Where and what type of spatial clustering is present?

Geary’s C

Spatial clusters (hot spot, cold spot)�

LISA ( Local Indicator of Spatial Autocorrelation)

Spatial clusters (hot spot, cold spot)
Spatial outliers

38 of 66

Spatial Clusters

39 of 66

Spatial Outliers

40 of 66

Spatial Clusters

https://geodacenter.github.io/workbook/6a_local_auto/lab6a.html

41 of 66

Adding Resources

42 of 66

What is geocoding?

Geocoding is the process of converting addresses into geographic coordinates using a known coordinate reference system (CRS).

We can then use these coordinates (latitude, longitude) to spatially enable data.

Input	Output
Address	Latitude, longitude coordinates

43 of 66

In R, we use the tidygeocoder package

Read in csv or text file with addresses → data.frame

44 of 66

In R, we use the tidygeocoder package

Geocode addresses to generate latitude, longitude coordinates

Prepare data

45 of 66

In R, we use the tidygeocoder package

→ convert to spatial data sf

46 of 66

Mapping points

Now that addresses have been cleaned, geocoded, and spatially-enabled, they can be added as a point layer on a map

47 of 66

Live demo

48 of 66

Calculating Spatial Metrics

49 of 66

Accessibility Metrics: �How accessible is a resource?

Accessibility is a multidimensional concept, and spatial distance is only one component of accessibility. �

Classical model: (Penchansky & Thomas, 1981)

Access: a degree of “fit” between clients and the system
Availability, Accessibility, Affordability, Acceptability, Accommodation

50 of 66

Choosing the “right” measure is complicated

Location optimization considers:

Supply and demand values at destinations and origins
Optimal placement for facilities (coverage problem)
Allocation of demand to given facilities (location-allocation problem)

Different access measures could lead to different conclusions about the existence of spatial mismatch and inequity (Talen & Anselin, 1998)

It is important to consider what characterization of access is most suitable for a specific problem

51 of 66

Choosing the “right” measure is complicated

How should distance between the user/patient and the facility be characterized?

What assumptions about travel behaviors are most appropriate?

52 of 66

Access Measurements

Density:

Container method - buffers, point-in-polygon

Proximity:

Minimum distance
Travel cost
Travel time
Gravity potential

53 of 66

Container Method

A count of facilities (or measure of services) provided by any geographic unit.

For example:

How many methadone providers are in each zip code?
How many methadone providers are in a 1-mile buffer zone?
What percentage of area is within a threshold of the provider’s service?

54 of 66

Proximity Methods

Minimum distance: the distance from an origin location to the nearest destination (e.g. methadone providers)�

Travel cost: measure of the total or average distance (or travel time) between each origin and destination�

Gravity potential: resources are weighted by size and adjusted for “friction of distance”

Minimum distance to the nearest methadone provider for all contiguous US zip codes

55 of 66

Live demo

56 of 66

Wrap Up & Discussion

57 of 66

Additional R-Spatial Resources

Opioid Environment Toolkit (HEROP resource): https://geodacenter.github.io/opioid-environment-toolkit/index.html

R Spatial Workshops (CSDS/HEROP resource): https://spatialanalysis.github.io/tutorials/ ��Geocomputation in R (Lovelace et al, 2019): https://geocompr.robinlovelace.net��Spatial Data Science with R: https://rspatial.org/

58 of 66

59 of 66

Kernel density estimates (KDE)

Approximates the distribution of point data along a continuous surface, so that one could understand how likely an event is to occur across spaces, or the “intensity” of activities.

Resources

60 of 66

Spatial regression models (basics)

Null hypothesis

y = X𝛽 + u, where E[uu’] = σ²I

Alternative hypotheses

Spatial lag model (SAR)

y = ρWy + X𝛽 + u

Spatial error model (SEM)

E[uu’] = 𝚺 ≠ σ²I

61 of 66

Spatial regression models

Textbook: Anselin, L. and S. Rey. (2014) Modern Spatial Econometrics in Practice: A Guide to GeoDa, GeoDaSpace and PySAL. Chicago: GeoDa Press LLC, Sections 5.1 and 6.1.
Youtube: Spatial Regression Estimation
Implementation:

62 of 66

PCA followed by cluster analysis & spatial regression

Gather multidimensional SDoH variables based on theoretical framework
Principal component analysis to reduce all SDoH variables to a few PCs (each PC is orthogonal to each other)
Cluster analysis to identify areas with similar SDoH characteristics
Spatial regression analysis examining associations between SDoH (using PCs) and health outcomes

SAR or SEM for spatial dependence (e.g., virus is highly transmissible)
Spatial regime model to account for spatial heterogeneity (e.g., virus has also spread unevenly across place and populations)

63 of 66

SDOH Index	Description
Socioeconomic advantage index (SES)	Include socioeconomic status factors such as poverty and educational level; this measure is strongly correlated with the Sing et al (cite) areal deprivation index.
Limited mobility index (MOB)	Captures the proportion of older adults and persons with disabilities.
Urban core opportunity index (URB)	Reflects highly urbanized populations experiencing more opportunities and also high living costs.
Mixed immigrant cohesion and accessibility index (MICA)	Features immigrant populations with traditional family structures and multiple accessibility stressors.

Kolak, M., Bhatt, J., Park, Y. H., Padrón, N. A., & Molefe, A. (2020). Quantification of neighborhood-level social determinants of health in the continental United States. JAMA network open , 3 (1), e1919928-e1919928.

64 of 66

Seven SDOH neighborhood typologies were optimized across the continental United States, as presented in this screenshot from the SDOH web application generated to visualize results.

SDOH Neighborhood Typology.

65 of 66

The four indexes/PCs were also used later in spatial regression models to examine how SDoH associates with COVID-19 outcomes differently in rural, urban, suburban areas, accounting for spatial dependence.

Lin, Q., Paykin, S., Halpern, D., Martinez-Cardoso, A., & Kolak, M. (2022). Assessment of structural barriers and racial group disparities of COVID-19 mortality with spatial analysis. JAMA network open , 5 (3), e220984-e220984.

66 of 66

Thank you

Marynia Kolak

mkolak@illinois.edu��makosak

@MaryniaKolak

Qinyun Lin

qinyun.lin@gu.se �

qinyun-lin�@lin_qinyun

	1	2	3	4	5	6	7	8	9
1	1	1	1	0	0	0	0	0	0
2	1	1	1	0	1	0	0	0	0
3	0	1	1	0	0	1	0	0	0
4	1	0	0	1	1	0	1	0	0
5	0	1	0	1	1	1	0	1	0
6	0	0	1	0	1	1	0	0	1
7	0	0	0	1	0	0	1	1	0
8	0	0	0	0	1	0	1	1	1
9	0	0	0	0	0	1	0	1	1

	1	2	3	4	5	6	7	8	9
1	1	1	1	0	0	0	0	0	0
2	1	1	1	0	1	0	0	0	0
3	0	1	1	0	0	1	0	0	0
4	1	0	0	1	1	0	1	0	0
5	0	1	0	1	1	1	0	1	0
6	0	0	1	0	1	1	0	0	1
7	0	0	0	1	0	0	1	1	0
8	0	0	0	0	1	0	1	1	1
9	0	0	0	0	0	1	0	1	1

	1	2	3	4	5	6	7	8	9
1	1	1	1	0	0	0	0	0	0
2	1	1	1	0	1	0	0	0	0
3	0	1	1	0	0	1	0	0	0
4	1	0	0	1	1	0	1	0	0
5	0	1	0	1	1	1	0	1	0
6	0	0	1	0	1	1	0	0	1
7	0	0	0	1	0	0	1	1	0
8	0	0	0	0	1	0	1	1	1
9	0	0	0	0	0	1	0	1	1