1 of 12

Mapping America’s Digital Dialects

Clayton Hamre

Using data science, social media, and spatial analysis to explore the English language

2 of 12

Research questions

Is Reddit a viable data source for the study of dialect variation?
Can lexical variation be detected by analyzing city subreddits?
What patterns of variation are revealed by city subreddit data?
I considered 127 lexical variables, including:

Disaster vs. catastrophe
Lawyer vs. attorney
Dinner vs. supper
Forest vs. woodland
Shit vs. crap
Phone vs. telephone
Agree vs. concur
Throw vs. toss
Buy vs. purchase
May vs. might
Improve vs. enhance
Useful vs. helpful
Strange vs. weird
Huge vs. gigantic
Relevant vs. pertinent
Cute vs. adorable
Ugly vs. hideous
Many vs. numerous
Often vs. frequently
Usually vs. typically
Possibly vs. potentially
Almost vs. nearly
Until vs. till
Toward vs. towards
Hi vs. hello
Lmao vs. lmfao
Everyone vs. everybody

3 of 12

Data extraction

I used the Python Reddit API Wrapper to extract text data from city subreddits (as well as date and username for each post/comment)

4 of 12

Data processing

I then used a Python script to extract word counts for my variables and calculate the ratio of each pair of variants for each city subreddit

5 of 12

Global spatial autocorrelation analysis

Global Moran’s I results (North America)

Index: 0.757749

Z-score: 13.67815

p-value: < 0.000001

6 of 12

“Raw” values for variables

Getis-Ord GI* Z-scores

Principal components

Eight regional clusters

Local spatial autocorrelation

Principal

component

analysis in R

Agglomerative

cluster analysis in R

Variables with significant global spatial autocorrelation were included in a regional analysis

7 of 12

Base map created in ArcGIS – Map content created in Inkscape

8 of 12

Statistics for inter-country analysis

I implemented ANOSIM in R to test whether city subreddits in the US, Canada, and the UK differed overall in lexical usage
I visualized these overall differences using non-metric multidimensional scaling
I used violin plots to visualize the distribution of values for individual variables

9 of 12

Anyway vs. anyways in US, Canadian, and UK city subreddits

Equal usage of

the two variants

Greater usage

of anyway

All three countries used anyway more often than anyways, but they significantly differed in how much more often. Canada used anyways the most.

10 of 12

Network analysis

Many users have posts that appear in more than one subreddit corpus. Is this linked with the structure of lexical variation?

User data

Python script

Shared users table

11 of 12

There is a strong correlation between lexical distance and shared users among subreddits

12 of 12

Takeaways

“Unstructured” data contains a lot of structure!
Unconventional and online data sources offer a look at questions that can’t easily be studied through traditional methods
You can analyze the same dataset in very different ways

Spatial autocorrelation vs. networks of shared users
And these analyses can combine to produce even more powerful insights!

Data analysis is iterative: Basic statistics open up the possibility for a range of more sophisticated analyses

This means it’s very important to ensure your input data is high-quality and error-free!