1 of 25

Social Influences on the Production of Text

AJ Alvero

SICSS-South Florida 2023

2 of 25

Roadmap for today's talk

An overview of computational text analysis
An example of connecting theory to method (as well as some of my personal background)
Introduction to two key methods used for computational text analysis (topic modeling and word embeddings)

These will feed directly into the lesson later today
Neither this talk nor the coding session will be mathy

Examples and results from a preprint
So this is a fairly applied talk

3 of 25

Computational text analysis for social science: big picture

There is a lot of interest in using computational tools/methods to analyze text. Why?

Availability of (big, novel, cool) data
Faster computational processing speed

Yeah yeah, but why social science? Why should we care about text?

This is a harder question, especially if you see yourself as a "numbers" person
My argument: text (including transcribed audio) captures or could capture insights into basically any social process and/or outcome

For example…..

4 of 25

Example: Stereotype shifts over time in newspaper data (PNAS)

5 of 25

Example: Disparities in scientific uptake with dissertation abstracts (PNAS)

6 of 25

Example: Variation in rental postings across neighborhood racial demographics (Social Forces)

Results show that listings from White neighborhoods emphasize trust and connections to neighborhood history and culture, while listings from non-White neighborhoods offer more incentives and focus on transportation and development features, sundering these units from their surroundings.

7 of 25

Example: Tweets are better predictor of county level heart disease than common risk factors (Psychological Science)

A cross-sectional regression model based only on Twitter language predicted AHD mortality significantly better than did a model that combined 10 common demographic, socioeconomic, and health risk factors, including smoking, diabetes, hypertension, and obesity.

8 of 25

Computational text analysis for social science: big picture

These are just a snapshot though, and perhaps you've heard of these and other studies which used text data.
So let's brainstorm! We'll do a poll about everyone's research interests/topics and generate ideas about how we can study them using text data.

Sources of data?
Possible results/findings?
Experiments? Natural data?

pollev.com/ajalvero275

9 of 25

My background, interests, and theoretical perspectives

A lot of my past research: analyzing dataset of +800k college admissions essays submitted to the University of California system

Used to be a high school English teacher, so I have a lot of experience helping students write these things

I also did a lot of after school tutoring for the SAT and ACT, but the essays were always a little confusing and tricky to navigate

When does a student's personal statement become "oversharing"?
With all the controversies about test scores in admissions, what about the essays?

To answer these questions, I focus authorship patterns along various dimensions
I draw a lot of theory from an entire subfield dedicated to the relationship between the social and the linguistic: sociolinguistics

10 of 25

Example of theory informed CSS: sociolinguistic perspective

The sociolinguistic perspective is that there is a strong relationship between one's identity and their use of language

Popularized by sociolinguists like William Labov

For example, Labov found that pronunciation of /r/ tracked with social class indicators, eg. different brands of department stores
If these differences in spoken language are not coincidence but systematic like the sociolinguists suggest, I want to see if these patterns hold for text data. If they do, we might need to change how we think about text/writing

11 of 25

So how can we show these types of patterns in text?

But how can we show patterns in text a la Labov? There are many tools to consider, many of which focus on word co-occurrences in different ways
In this talk I want to quickly go over two popular methods: topic modeling and word vectors
We're going to try them out in the coding session later today
In a few slides I will provide an example from my own work that used topic models to generate numerical representations of text

12 of 25

Topic modeling: document content via word co-occurrences

The sport I love baseball. I hit homeruns and doubles when I bat. I also put on my mitt to play catch.

Basketball is my favorite sport. I hit three-pointers and alley-oop. I also dribble and catch passes.

Tennis is my top sport. I hit and serve the ball in a way that they can't volley. Doubles is fun.

Document 1

Document 2

Document 3

Output: numerical representations of "topics" for each topic and document
Use these numerical representations for prediction, sorting, experiments, etc.

13 of 25

R² = .12

R² = .19

R² = .49

Topic modeling example: what students write about is highly predictive of income and SAT score

14 of 25

Word embedding: word meanings based on word neighbors

The sport I love baseball. I hit homeruns and doubles when I bat. I also put on my mitt to play catch.

Basketball is my favorite sport. I hit three-pointers and alley-oop. I also dribble and catch passes.

Tennis is my top sport. I hit and serve the ball in a way that they can't volley. Doubles is fun.

Document 1

Document 2

Document 3

Output: vector representations of words given the words that appear near them
Use these vectors for prediction, sorting, experiments, etc. but also to model relationships between words (ie. semantics, word similarities/differences)

15 of 25

Word embedding example: analogies

16 of 25

Using topic modeling to make predictions

My past work has shown strong relationships between application essays and different information about the applicant

Essays are strong predictors of gender, income, test scores, educational pathways, and other important pieces of information

In another paper ("Social Influences on Textual Production"), I analyze instead the relationship between college admissions essays and the geospatial context of the authors

How well can average document features (using topic models) in a given zipcode predict other things about that zipcode? For example, how well can they predict local economic exchanges?

Preprint also examines intersectional identities and will look very different from final version: https://osf.io/preprints/socarxiv/pt6b2/

17 of 25

Social Influences on Textual Production: Data

Two years of admissions essays submitted to the UC system

Matched applicant essays to their high school zipcodes
Also have information about gender, income, and first-gen status (more details of how we used these are in the preprint)
Applicants wrote 2 essays in first year, 4 essays in second year. So we analyzed them separately

18 of 25

Social Influences on Textual Production: Methods

Took the essays they wrote and combined them into one super essay (so one document per students
Then topic modeling on the essays and used the topics to predict different characteristics about the applicant's neighborhood

Prediction method: linear regression with 10 fold CV
Primary result metric is R². Allows us to compare results with past work
Note: we take average topic features for each zipcode that appears in the data

Geospatial data

American Community Survey (ACS)
Data from Raj Chetty's social capital and economic mobility papers from 2022

19 of 25

Before the final analysis using all of the topics to predict features, we find that even individual topics are strongly correlated with ACS data

20 of 25

Same story for the social/economic features

Next step is to see how well all of the topics can predict these features

21 of 25

Strong predictors of geospatial information. Red line is from my work using essays to predict SAT scores; blue line is income and test scores

22 of 25

Social influences of textual production: Discussion

Some of the relationships between essays and geospatial characteristics are even higher than the relationship between an applicant's test score and their essays

Most of these are also stronger than the relationship between household income and test scores

Lots of possible future directions for this work, and/but it provides another example of the strength between social context and writing

Maybe we should reconsider our ideas of authorship in a general sense? If authorship is so tied to identity and context, what do we think we're getting from text?
Implications for the practice and process of college admissions as well

23 of 25

Wrapping up

I hope you can see the connection between the sociolinguistic perspective and the paper I presented

I also hope you can envision the connection between theories near and dear to you with text!

Looking ahead: life might end up becoming even more saturated in text with the advent of large language models

Careful, critical analysis typical of the social sciences will probably be needed more than ever

24 of 25

Questions/Comments/Concerns?

Thank you!

aalvero@ufl.edu

25 of 25

But what kinds of methodological frameworks can we use?

Text analysis fits into any kind of study design in social science

In this way, text analysis is for everyone! Can adapt to lots of different theoretical traditions too

Prediction

Quantitative

Qualitative

This text predicts Y better than this other X.

Adding text to the model increases prediction accuracy by X%

Text feature is correlated with X.

Treatment condition had effect of X on open ended survey responses.

X type of text used to signify Y

Image data described by AI using X kinds of language