1 of 28

Understanding Communities

with Socially Aware NLP

Siddharth Mangalik & Salvatore Giorgi

2 of 28

Presenters

Salvatore Giorgi�PhD

UPenn, NIDA, WWBP

Siddharth Mangalik�PhD Candidate

SBU

3 of 28

This �session

Here we will cover the basics of creating language estimates of spatial communities (e.g. states, provinces, counties, etc).

At the end we will look at a code notebook to experiment with the proper methods for handling community-level text.

We will cover topics such as:

Aggregation
Selection biases
Ecological fallacies
Cultural Considerations

4 of 28

Surveyed communities for “What generic word do you use to describe carbonated soft drinks”

5 of 28

Why Communities?

When working in interdisciplinary teams our goals will often require understanding how spatial units of people are using language.

This is particularly important with studies in

Medical NLP
Social sciences
Economics
Geopsychology
Population health studies

which all benefit for rigorously designed representations of communities.

Tracking the opioid epidemic

6 of 28

Why Community Language?

Using community language we can study

Personality
Well Being
Mental Health
Substance Use

This carries large public health implications by enabling studies at scales beyond what demographers and census takers can do with surveying methods.

Language associated with higher future opioid mortality

7 of 28

From Documents to Communities

Consider three types of aggregation

Community: The community is considered as a “bag-of-words”�
User to Community: Normalized features are extracted per user and averaged to communities�
Weighted Users to Community: up/down weight by how under/over represented individuals are in the sample

Giorgi et al., 2018, The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions

8 of 28

Community

(Giorgi, Preotiuc-Petro, Buffone, Reiman, Ungar, Schwartz, 2018)

(Giorgi, Lynn, Ungar, Schwartz, 2019)

Outcome

9 of 28

User to Community

(Giorgi, Preotiuc-Petro, Buffone, Reiman, Ungar, Schwartz, 2018)

(Giorgi, Lynn, Ungar, Schwartz, 2019)

Outcome

10 of 28

Aggregation Results

Out of sample Pearson r

N = 2040 US counties

Traditional Approach: Word frequencies�(aggregate words directly)

New Approach: Twitter Cohort �(aggregate through people)

Life

Satisfaction

11 of 28

From Documents to Communities

These methods were compared on their ability to generate unigram-based predictions for:

Median Household Income
At Least a Bachelor’s Education
Questionnaire Measured Life Satisfaction
Age-Adjusted Mortality Rates from Heart Disease

Giorgi et al., 2018, The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions

12 of 28

From Documents to Communities Results

Giorgi et al., 2018, The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions

13 of 28

Selection Biases

Selection biases occur when our sample of data does not match the true distribution of the population.

When creating aggregate community data any selection biases in our data collection will be reflected in our community findings.

The most common way to adjust for this is through the use of adjusting the influence of individuals with weights.

https://www.vectorstock.com/royalty-free-vector/sample-from-population-statistics-research-survey-vector-16452707

14 of 28

Selection Biases - Twitter Communities

Active American Twitter users, unlike the average American are ___.

more college educated (42% vs 31% have Bachelor’s)
younger (40 vs 47)
more likely to earn >$75,000 (41% vs 32%)
more Democrat (36% vs 30%)
more male (50% vs 48%)

User scores can be re-weighted by socio-economics such that their contribution better represents their community per user-week. �(2022, Giorgi S., ICWSM)

2019, Pew Research Center

15 of 28

Robust Poststratification

Reweighting is done through three methods which refine a set of bins

Estimator Redistribution�Accounts for shrinking such that the population percentage in each source bin matches those of the target bins.
Adaptive Binning �Sets a minimum threshold on the number of observations within a given socio-demographic bin. Adjacent bins are iteratively combined until the threshold is met
Informed Smoothing �to handle sparse socio-demographic estimates by padding each weight with a fraction of users from a known distribution

Correcting Sociodemographic Selection Biases for Population Prediction; Giorgi et al; ICWSM 2022

16 of 28

Reweighting Tolland County, Connecticut (06075)

Here we see the income bin for individuals earning >$200,000 in Tolland County, CT��By applying user weighting we are able to improve the income representation of earners

17 of 28

Post Stratified Weighting Results

Correcting Sociodemographic Selection Biases for Population Prediction; Giorgi et al; ICWSM 2022

18 of 28

Ecological Fallacies

Language patterns found at the individual-level do not always hold at the community-level.

This might manifest as mistakenly assigned traits to individuals inferred from the groups they belong to.

We might avoid these confusions by properly modeling individual correlations and comparing those relationships to ecological correlations.

datasauRus’ Simpson’s Paradox data

19 of 28

Ecological Fallacies - Example

At the individual level, wealth is positively correlated with voting Republican in the United States. However, we observe that wealthier states more often vote for Democrats.�
Protestant communities have been observed to have higher suicide rates than Catholic communities. However, having higher suicide rates does not imply that individual personal religion caused suicide, and may instead have been caused by record keeping practices.

Le Suicide: Étude de sociologie; Durkheim, Émile; 1897

20 of 28

Ecological Fallacies

In a study to comparing language-based methods to survey-based methods for measuring well-being, Kokil et al. ran an analogous and parallel analysis across a sample of 2,321 Facebook users who were surveyed directly.

This allowed for the investigation of the ecological effects of community-level aggregation on their results.

On the right we can see how powerful County-level correlations did not trickle down to Person-level correlations.

Estimating geographic subjective well-being from Twitter; Jaidka et al; PNAS 2023

Person

County

21 of 28

Ecological Fallacies

Estimating geographic subjective well-being from Twitter; Jaidka et al; PNAS 2023

22 of 28

Cultural Considerations

Multilingual Language Models are not Multicultural; Shreya Havaldar et al; WASSA 2023

The language of communities reflect the cultural differences imbedded in their value systems.

When we explore multilingual communities it is thus necessary to consider how our LM choices affect our interpretations of communities

23 of 28

Multilingual LMs

Multilingual Language Models are not Multicultural; Shreya Havaldar et al; WASSA 2023

Both LMs (XLM-Roberta) and generative LMs (GPT-3, GPT-3.5, GPT-4) demonstrated:

The ability to respond in multilingual contexts
Failures in capturing emotion variations associated with different culture
A preference to expressing the cultural values of the Western world

24 of 28

Cultural Considerations - Embeddings

Multilingual Language Models are not Multicultural; Shreya Havaldar et al; WASSA 2023

Por ejemplo, the embeddings for “Joy” in different languages can meaningfully differ in how emotions are encoded.

This could be relevant in fine-tuned NLP systems attempting to extract sentiment data, psychological traits, or expressed beliefs from non-English language.

25 of 28

As a Pipeline

Mangalik et al. Robust language-based mental health assessments in time and space.. npj Digital Medicine 2024

26 of 28

Community Studies in Time

Mangalik et al. Robust language-based mental health assessments in time and space.. npj Digital Medicine 2024

27 of 28

Takeaways

Community language has a distinct set of aggregation best practices that should be used before analysis�

Communities share some of the traditional challenges that NLP tasks encounter, however they are addressed differently or are emergent

28 of 28

Code Demo on Colab

bit.ly/text2community