1 of 25

Social Influences on the Production of Text

AJ Alvero

SICSS-South Florida 2023

2 of 25

Roadmap for today's talk

  • An overview of computational text analysis
  • An example of connecting theory to method (as well as some of my personal background)
  • Introduction to two key methods used for computational text analysis (topic modeling and word embeddings)
    • These will feed directly into the lesson later today
    • Neither this talk nor the coding session will be mathy
  • Examples and results from a preprint
  • So this is a fairly applied talk

3 of 25

Computational text analysis for social science: big picture

  • There is a lot of interest in using computational tools/methods to analyze text. Why?
    • Availability of (big, novel, cool) data
    • Faster computational processing speed
  • Yeah yeah, but why social science? Why should we care about text?
    • This is a harder question, especially if you see yourself as a "numbers" person
    • My argument: text (including transcribed audio) captures or could capture insights into basically any social process and/or outcome
  • For example…..

4 of 25

Example: Stereotype shifts over time in newspaper data (PNAS)

5 of 25

Example: Disparities in scientific uptake with dissertation abstracts (PNAS)

6 of 25

Example: Variation in rental postings across neighborhood racial demographics (Social Forces)

Results show that listings from White neighborhoods emphasize trust and connections to neighborhood history and culture, while listings from non-White neighborhoods offer more incentives and focus on transportation and development features, sundering these units from their surroundings.

7 of 25

Example: Tweets are better predictor of county level heart disease than common risk factors (Psychological Science)

A cross-sectional regression model based only on Twitter language predicted AHD mortality significantly better than did a model that combined 10 common demographic, socioeconomic, and health risk factors, including smoking, diabetes, hypertension, and obesity.

8 of 25

Computational text analysis for social science: big picture

  • These are just a snapshot though, and perhaps you've heard of these and other studies which used text data.
  • So let's brainstorm! We'll do a poll about everyone's research interests/topics and generate ideas about how we can study them using text data.
    • Sources of data?
    • Possible results/findings?
    • Experiments? Natural data?

pollev.com/ajalvero275

9 of 25

My background, interests, and theoretical perspectives

  • A lot of my past research: analyzing dataset of +800k college admissions essays submitted to the University of California system
    • Used to be a high school English teacher, so I have a lot of experience helping students write these things
  • I also did a lot of after school tutoring for the SAT and ACT, but the essays were always a little confusing and tricky to navigate
    • When does a student's personal statement become "oversharing"?
    • With all the controversies about test scores in admissions, what about the essays?
  • To answer these questions, I focus authorship patterns along various dimensions
  • I draw a lot of theory from an entire subfield dedicated to the relationship between the social and the linguistic: sociolinguistics

10 of 25

Example of theory informed CSS: sociolinguistic perspective

  • The sociolinguistic perspective is that there is a strong relationship between one's identity and their use of language
    • Popularized by sociolinguists like William Labov
  • For example, Labov found that pronunciation of /r/ tracked with social class indicators, eg. different brands of department stores
  • If these differences in spoken language are not coincidence but systematic like the sociolinguists suggest, I want to see if these patterns hold for text data. If they do, we might need to change how we think about text/writing

11 of 25

So how can we show these types of patterns in text?

  • But how can we show patterns in text a la Labov? There are many tools to consider, many of which focus on word co-occurrences in different ways
  • In this talk I want to quickly go over two popular methods: topic modeling and word vectors
  • We're going to try them out in the coding session later today
  • In a few slides I will provide an example from my own work that used topic models to generate numerical representations of text

12 of 25

Topic modeling: document content via word co-occurrences

The sport I love baseball. I hit homeruns and doubles when I bat. I also put on my mitt to play catch.

Basketball is my favorite sport. I hit three-pointers and alley-oop. I also dribble and catch passes.

Tennis is my top sport. I hit and serve the ball in a way that they can't volley. Doubles is fun.

Document 1

Document 2

Document 3

  • Output: numerical representations of "topics" for each topic and document
  • Use these numerical representations for prediction, sorting, experiments, etc.

13 of 25

R2 = .12

R2 = .19

R2 = .49

Topic modeling example: what students write about is highly predictive of income and SAT score

14 of 25

Word embedding: word meanings based on word neighbors

The sport I love baseball. I hit homeruns and doubles when I bat. I also put on my mitt to play catch.

Basketball is my favorite sport. I hit three-pointers and alley-oop. I also dribble and catch passes.

Tennis is my top sport. I hit and serve the ball in a way that they can't volley. Doubles is fun.

Document 1

Document 2

Document 3

  • Output: vector representations of words given the words that appear near them
  • Use these vectors for prediction, sorting, experiments, etc. but also to model relationships between words (ie. semantics, word similarities/differences)

15 of 25

Word embedding example: analogies

16 of 25

Using topic modeling to make predictions

  • My past work has shown strong relationships between application essays and different information about the applicant
    • Essays are strong predictors of gender, income, test scores, educational pathways, and other important pieces of information
  • In another paper ("Social Influences on Textual Production"), I analyze instead the relationship between college admissions essays and the geospatial context of the authors
    • How well can average document features (using topic models) in a given zipcode predict other things about that zipcode? For example, how well can they predict local economic exchanges?
  • Preprint also examines intersectional identities and will look very different from final version: https://osf.io/preprints/socarxiv/pt6b2/

17 of 25

Social Influences on Textual Production: Data

  • Two years of admissions essays submitted to the UC system
  • Matched applicant essays to their high school zipcodes
  • Also have information about gender, income, and first-gen status (more details of how we used these are in the preprint)
  • Applicants wrote 2 essays in first year, 4 essays in second year. So we analyzed them separately

18 of 25

Social Influences on Textual Production: Methods

  • Took the essays they wrote and combined them into one super essay (so one document per students
  • Then topic modeling on the essays and used the topics to predict different characteristics about the applicant's neighborhood
    • Prediction method: linear regression with 10 fold CV
    • Primary result metric is R2. Allows us to compare results with past work
    • Note: we take average topic features for each zipcode that appears in the data
  • Geospatial data
    • American Community Survey (ACS)
    • Data from Raj Chetty's social capital and economic mobility papers from 2022

19 of 25

  • Before the final analysis using all of the topics to predict features, we find that even individual topics are strongly correlated with ACS data

20 of 25

  • Same story for the social/economic features
  • Next step is to see how well all of the topics can predict these features

21 of 25

  • Strong predictors of geospatial information. Red line is from my work using essays to predict SAT scores; blue line is income and test scores

22 of 25

Social influences of textual production: Discussion

  • Some of the relationships between essays and geospatial characteristics are even higher than the relationship between an applicant's test score and their essays
    • Most of these are also stronger than the relationship between household income and test scores
  • Lots of possible future directions for this work, and/but it provides another example of the strength between social context and writing
    • Maybe we should reconsider our ideas of authorship in a general sense? If authorship is so tied to identity and context, what do we think we're getting from text?
    • Implications for the practice and process of college admissions as well

23 of 25

Wrapping up

  • I hope you can see the connection between the sociolinguistic perspective and the paper I presented
    • I also hope you can envision the connection between theories near and dear to you with text!
  • Looking ahead: life might end up becoming even more saturated in text with the advent of large language models
    • Careful, critical analysis typical of the social sciences will probably be needed more than ever

24 of 25

Questions/Comments/Concerns?

Thank you!

aalvero@ufl.edu

25 of 25

But what kinds of methodological frameworks can we use?

  • Text analysis fits into any kind of study design in social science
  • In this way, text analysis is for everyone! Can adapt to lots of different theoretical traditions too

Prediction

Quantitative

Qualitative

This text predicts Y better than this other X.

Adding text to the model increases prediction accuracy by X%

Text feature is correlated with X.

Treatment condition had effect of X on open ended survey responses.

X type of text used to signify Y

Image data described by AI using X kinds of language