1 of 12

Mapping America’s Digital Dialects

Clayton Hamre

Using data science, social media, and spatial analysis to explore the English language

2 of 12

Research questions

  • Is Reddit a viable data source for the study of dialect variation?
  • Can lexical variation be detected by analyzing city subreddits?
  • What patterns of variation are revealed by city subreddit data?
  • I considered 127 lexical variables, including:
  • Disaster vs. catastrophe
  • Lawyer vs. attorney
  • Dinner vs. supper
  • Forest vs. woodland
  • Shit vs. crap
  • Phone vs. telephone
  • Agree vs. concur
  • Throw vs. toss
  • Buy vs. purchase
  • May vs. might
  • Improve vs. enhance
  • Useful vs. helpful
  • Strange vs. weird
  • Huge vs. gigantic
  • Relevant vs. pertinent
  • Cute vs. adorable
  • Ugly vs. hideous
  • Many vs. numerous
  • Often vs. frequently
  • Usually vs. typically
  • Possibly vs. potentially
  • Almost vs. nearly
  • Until vs. till
  • Toward vs. towards
  • Hi vs. hello
  • Lmao vs. lmfao
  • Everyone vs. everybody

3 of 12

Data extraction

  • I used the Python Reddit API Wrapper to extract text data from city subreddits (as well as date and username for each post/comment)

4 of 12

Data processing

  • I then used a Python script to extract word counts for my variables and calculate the ratio of each pair of variants for each city subreddit

5 of 12

Global spatial autocorrelation analysis

Global Moran’s I results (North America)

Index: 0.757749

Z-score: 13.67815

p-value: < 0.000001

6 of 12

“Raw” values for variables

Getis-Ord GI* Z-scores

Principal components

Eight regional clusters

Local spatial autocorrelation

Principal

component

analysis in R

Agglomerative

cluster analysis in R

Variables with significant global spatial autocorrelation were included in a regional analysis

7 of 12

Base map created in ArcGIS – Map content created in Inkscape

8 of 12

Statistics for inter-country analysis

  • I implemented ANOSIM in R to test whether city subreddits in the US, Canada, and the UK differed overall in lexical usage
  • I visualized these overall differences using non-metric multidimensional scaling
  • I used violin plots to visualize the distribution of values for individual variables

9 of 12

Anyway vs. anyways in US, Canadian, and UK city subreddits

Equal usage of

the two variants

Greater usage

of anyway

All three countries used anyway more often than anyways, but they significantly differed in how much more often. Canada used anyways the most.

10 of 12

Network analysis

  • Many users have posts that appear in more than one subreddit corpus. Is this linked with the structure of lexical variation?

User data

Python script

Shared users table

11 of 12

There is a strong correlation between lexical distance and shared users among subreddits

12 of 12

Takeaways

  • “Unstructured” data contains a lot of structure!
  • Unconventional and online data sources offer a look at questions that can’t easily be studied through traditional methods
  • You can analyze the same dataset in very different ways
    • Spatial autocorrelation vs. networks of shared users
    • And these analyses can combine to produce even more powerful insights!
  • Data analysis is iterative: Basic statistics open up the possibility for a range of more sophisticated analyses
    • This means it’s very important to ensure your input data is high-quality and error-free!