1 of 15

Devika Kakkar

Centre for Geographic Analysis, Harvard University

2024 ESRI User Conference (UC)

Bridging the Gap: A Case Study of Integrating Social Media Big Data with Geography

2 of 15

Introduction: Social Media and Geography

2

Social media platforms generate data that serve as social sensors over space and time

Traditional GIS systems have challenges in handling big data

Real-time analysis in urban dynamics, public health and disaster management

Definition

Challenges

Opportunity

3 of 15

3

Challenges in Big Geospatial Data Merging

  • Volume, Velocity and Veracity of Data Storage, Transfer, Update, Streaming, Integration, Standardization, Accessibility, Security

  • Efficiency and Scalability of Systems System deployment on cloud, Optimizing algorithms, Time-cost balancing

  • Analysis and Visualization Complexity Coding Requirement, Resource Management, Visualization Customization

4 of 15

4

Approaches

Move geospatial process from desktops to clouds / clusters

Manage data storage in proximity with tools for efficient access

Leverage GPU-based solutions for high performance tasks

Containerize applications for rapid deployment and shut down

Scale applications based on the computing needs

5 of 15

CGA’s Geotweet Archive

5

Collection of 10 Billion geotweets

Stored on Harvard’s High Performance Cluster

Sentiment enrichment using BERT

Global coverage (164 countries)

10 years of data (2010-2023)

5 TB of dataset

Multilingual dataset

To the best of our knowledge, it is the first social media dataset of this scale and granularity!

In contrast to the other datasets, it is not limited to a specific topic, period, or location.

6 of 15

U.S. Administrative Boundaries

  • State and County Boundaries: Sourced from the U.S. Census Bureau. Includes essential fields such as ID, fips, county name, state name, and geometry
  • Census Blocks: Sourced from the U.S. Census Bureau. Includes essential fields such GEOID20 , TRACTCE20, BLOCKCE20 and more

7 of 15

7

Methodology: Computing Environment

Cluster Computing Infrastructure

Harvard University Research Computing Cluster (FASRC)

  • Established in 2007 under the Faculty of Arts & Sciences (FAS) Division of Science
  • Facilitates the advancement of complex research by leading-edge cluster computing services
  • Available for High Performance Computing (HPC) and big data analysis, visualization, and storage

8 of 15

Methodology: Geography Enrichment

8

  • Start with state and county boundaries
  • GPU-based parallel processing of join

  • Subsets the merged tweets from step 1 by states
  • CPU based spatial join of census blocks

GPU-Based

CPU-Based

1

2

9 of 15

9

3 Billion Tweets, 8.18 million U.S. Census Blocks

Multi processing system using CPUs/ GPUs on HPC

Publicly available on Dataverse

Real-time analysis at census block level for over a decade

Novel dataset of this temporal and geographic granularity

Result: Geowteets Census Archive

10 of 15

Result: Open Access Dataset

10

Harvard Dataverse Archive

Github repository

11 of 15

11

Real-time political expression across granular geographic level

Result: Political Science Use-Case

  • What is the impact of geography on political expressions over social-media?
  • How has political sentiment evolved globally over space and time?

12 of 15

Results and Impact

12

Geo-located tweets provided us a unique opportunity to observe real-time political expressions across granular geographic units over a decade. It is unprecedented to observe public opinion with such temporal and geographic granularity over a decade. This data also became easy to use with the CGA's effort enriching the data with census information.

Sun Young Park

Doctorate Student

Department of Government

Park, S. (2024), "Political Animosity in Declining and Prospering Areas: An Analysis of 3 Billion Geo-located Tweets Spanning a Decade in Three Advanced Democracies”, Working Paper.

Park, S., Brown, J., and Enos, R. (2024), "Does Where You Live Influence How You Talk About Politics Online?", Working paper

13 of 15

13

Future Work: GenAI based User-Interface

14 of 15

Acknowledgements

14

  • This work is partially supported by NSF Award #1841403 and Department of Government, Harvard University.
  • We would like to thank the New England Research Cloud (NERC) and FASRC for the computing resources.

Co-authors

Xiaokang Fu Jack Hayes

15 of 15

Contact us:�Harvard CGA

or

Email:�Devika Jain

Questions / Comments?