Understanding Communities
with Socially Aware NLP
Siddharth Mangalik & Salvatore Giorgi
Presenters
Salvatore Giorgi�PhD
UPenn, NIDA, WWBP
Siddharth Mangalik�PhD Candidate
SBU
This �session
Here we will cover the basics of creating language estimates of spatial communities (e.g. states, provinces, counties, etc).
At the end we will look at a code notebook to experiment with the proper methods for handling community-level text.
We will cover topics such as:
Surveyed communities for “What generic word do you use to describe carbonated soft drinks”
Why Communities?
When working in interdisciplinary teams our goals will often require understanding how spatial units of people are using language.
This is particularly important with studies in
which all benefit for rigorously designed representations of communities.
Tracking the opioid epidemic
Why Community Language?
Using community language we can study
This carries large public health implications by enabling studies at scales beyond what demographers and census takers can do with surveying methods.
Language associated with higher future opioid mortality
From Documents to Communities
Consider three types of aggregation
Giorgi et al., 2018, The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions
Community
(Giorgi, Preotiuc-Petro, Buffone, Reiman, Ungar, Schwartz, 2018)
(Giorgi, Lynn, Ungar, Schwartz, 2019)
Outcome
User to Community
(Giorgi, Preotiuc-Petro, Buffone, Reiman, Ungar, Schwartz, 2018)
(Giorgi, Lynn, Ungar, Schwartz, 2019)
Outcome
Aggregation Results
Out of sample Pearson r
N = 2040 US counties
Traditional Approach: Word frequencies�(aggregate words directly)
New Approach: Twitter Cohort �(aggregate through people)
Life
Satisfaction
From Documents to Communities
These methods were compared on their ability to generate unigram-based predictions for:
Giorgi et al., 2018, The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions
From Documents to Communities Results
Giorgi et al., 2018, The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions
Selection Biases
Selection biases occur when our sample of data does not match the true distribution of the population.
When creating aggregate community data any selection biases in our data collection will be reflected in our community findings.
The most common way to adjust for this is through the use of adjusting the influence of individuals with weights.
https://www.vectorstock.com/royalty-free-vector/sample-from-population-statistics-research-survey-vector-16452707
Selection Biases - Twitter Communities
Active American Twitter users, unlike the average American are ___.
User scores can be re-weighted by socio-economics such that their contribution better represents their community per user-week. �(2022, Giorgi S., ICWSM)
2019, Pew Research Center
Robust Poststratification
Reweighting is done through three methods which refine a set of bins
Correcting Sociodemographic Selection Biases for Population Prediction; Giorgi et al; ICWSM 2022
Reweighting Tolland County, Connecticut (06075)
Here we see the income bin for individuals earning >$200,000 in Tolland County, CT��By applying user weighting we are able to improve the income representation of earners
Post Stratified Weighting Results
Correcting Sociodemographic Selection Biases for Population Prediction; Giorgi et al; ICWSM 2022
Ecological Fallacies
Language patterns found at the individual-level do not always hold at the community-level.
This might manifest as mistakenly assigned traits to individuals inferred from the groups they belong to.
We might avoid these confusions by properly modeling individual correlations and comparing those relationships to ecological correlations.
datasauRus’ Simpson’s Paradox data
Ecological Fallacies - Example
Le Suicide: Étude de sociologie; Durkheim, Émile; 1897
Ecological Fallacies
In a study to comparing language-based methods to survey-based methods for measuring well-being, Kokil et al. ran an analogous and parallel analysis across a sample of 2,321 Facebook users who were surveyed directly.
This allowed for the investigation of the ecological effects of community-level aggregation on their results.
On the right we can see how powerful County-level correlations did not trickle down to Person-level correlations.
Estimating geographic subjective well-being from Twitter; Jaidka et al; PNAS 2023
Person
County
Ecological Fallacies
Estimating geographic subjective well-being from Twitter; Jaidka et al; PNAS 2023
Cultural Considerations
Multilingual Language Models are not Multicultural; Shreya Havaldar et al; WASSA 2023
The language of communities reflect the cultural differences imbedded in their value systems.
When we explore multilingual communities it is thus necessary to consider how our LM choices affect our interpretations of communities
Multilingual LMs
Multilingual Language Models are not Multicultural; Shreya Havaldar et al; WASSA 2023
Both LMs (XLM-Roberta) and generative LMs (GPT-3, GPT-3.5, GPT-4) demonstrated:
Cultural Considerations - Embeddings
Multilingual Language Models are not Multicultural; Shreya Havaldar et al; WASSA 2023
Por ejemplo, the embeddings for “Joy” in different languages can meaningfully differ in how emotions are encoded.
This could be relevant in fine-tuned NLP systems attempting to extract sentiment data, psychological traits, or expressed beliefs from non-English language.
As a Pipeline
Mangalik et al. Robust language-based mental health assessments in time and space.. npj Digital Medicine 2024
Community Studies in Time
Mangalik et al. Robust language-based mental health assessments in time and space.. npj Digital Medicine 2024
Takeaways
Community language has a distinct set of aggregation best practices that should be used before analysis�
Communities share some of the traditional challenges that NLP tasks encounter, however they are addressed differently or are emergent