1 of 53

Language and computers...and society!

Ian Stewart

College of Computing

13 Nov 2018

2 of 53

Who?

  • PhD student in Human-Centered Computing.
  • Interested in social applications of natural language processing systems.�
  • How can we use NLP to find trends in language change?
  • How is political affiliation related to language choice?
  • When predicting the success of new words, is it more important to consider social or linguistic factors?

3 of 53

Is language cognitive or social?

  • Language is processed in the brain AND people use it to communicate.
  • People use language to (1) express themselves, (2) build relationships, (3) situate themselves in society.
  • We already know that computers can predict general language patterns…

4 of 53

Is language cognitive or social?

  • Language is processed in the brain AND people use it to communicate.
  • People use language to (1) express themselves, (2) build relationships, (3) situate themselves in society.
  • We already know that computers can predict general language patterns…

Can computers help us understand the social side of language?

5 of 53

Overview

  • What can language tell us about the social world?
    • Speakers
    • Relationships
    • Society

6 of 53

Speaker identity and language use

  • Language is a tool for self-presentation: how you consciously and unconsciously present yourself to others. (Goffman 1958)
  • How can someone’s identity affect how they speak?

7 of 53

Speaker identity and language use

  • Language is a tool for self-presentation: how you consciously and unconsciously present yourself to others. (Goffman 1958)
  • How can someone’s identity affect how they speak?
    • Gender: socialized gender roles
    • Age: acquired language from childhood
    • Ethnicity: cultural heritage and background
    • Political stance: value system
    • Geography: regional dialects

8 of 53

Language and gender

  • Speaker gender is often reflected in language use, due to social norms based on gender roles (Eckert 1989).
    • If men are expected to act “tough”, this may result in use of more aggressive language such as swearing (Bucholtz 1989).
  • Example: stereotypes about gender are often reflected in popular culture and in language style.

9 of 53

Speaker identification

  • Problem: Can we identify aspects of a speaker’s identity based on their language use?
  • Applications: targeted advertising, better estimates for platform research

10 of 53

Speaker identification

  • Easy situation: Speaker self-discloses identity information.
    • “I am a man [he/him/his]”
  • Hard situation: Speaker does not self-disclose.
    • “Hey dude what’s up”

11 of 53

Speaker identification

  • Easy situation: Speaker self-discloses identity information.
    • “I am a man [he/him/his]”
  • Hard situation: Speaker does not self-disclose.
    • “Hey dude what’s up”

12 of 53

Speaker identification

  • Easy situation: Speaker self-discloses identity information.
    • “I am a man [he/him/his]”
  • Hard situation: Speaker does not self-disclose.
    • “Hey dude what’s up”

13 of 53

Speaker identification

  • Easy situation: Speaker self-discloses identity information.
    • “I am a man [he/him/his]”
  • Hard situation: Speaker does not self-disclose.
    • “Hey dude what’s up”

14 of 53

Identity data 1: names

  • Can you guess someone’s gender from their name?
  • Exercise: https://gender-api.com/
  • Aspects of names that help computers:�
  • Aspects of names that don’t help:

15 of 53

Identity data 1: names

  • Can you guess someone’s gender from their name?
  • Exercise: https://gender-api.com/
  • Aspects of names that help computers:
    • Suffixes (-a => female, o => male in Italian, Spanish)
    • Consonants (word-final stop => male in English) (Brett, Bob, Nick)
  • Aspects of names that don’t help:
    • Short and ambiguous forms (Alex => Alexander, Alexandra)
    • Arbitrariness (Ashley => was male, now more female)

16 of 53

Identity data 2: speech style

  • Can you guess someone’s gender just based on their speech?
  • World Well-Being project: http://sgiorgi.pythonanywhere.com/

17 of 53

Identity data 2: speech style

  • Can you guess someone’s gender just based on their speech?
  • World Well-Being project: http://sgiorgi.pythonanywhere.com/
  • Aspects of language that help computers:
    • Exclamations (OMG, LOL)
    • Emoticons :-)
    • Pronouns (he, she, they)
    • Kinship terms (mom, kids, auntie)
    • Swears, taboo words
    • Grammar (French “je suis allée” => female)

18 of 53

Identity data 2: speech style

  • Can you guess someone’s gender just based on their speech?
  • World Well-Being project: http://sgiorgi.pythonanywhere.com/
  • Aspects of language that help computers:
    • Exclamations (OMG, LOL)
    • Emoticons :-)
    • Pronouns (he, she, they)
    • Kinship terms (mom, kids, auntie)
    • Swears, taboo words
    • Grammar (French “je suis allée” => female)

19 of 53

Identity data 2: speech style

20 of 53

Identity data 2: speech style

  • Do these prediction factors imply that people are inherently different based on their gender?
  • No!
    • Accurate prediction =/= self-perception.
    • Gender roles (like other aspects of identity) are socially constructed - plenty of examples of women who “speak like men” and vice versa.
    • Macro-level differences often overshadow micro-level effects, such as accommodation (men who spend time with women may speak more like them).
    • Some of the differences are stereotypical: if “wrestling” predicts “MALE” that doesn’t mean that all men like wrestling.

21 of 53

Typical identification system

  1. Choose demographic.
  2. Collect examples of text written by demographic.
  3. Extract features from text (words, style, grammar).
  4. Train model on data subset to predict demographic.
  5. Test model on extra dataset to determine model efficacy.
  6. Examine individual features to determine predictive power.

husband=1, dad=1...

MALE

22 of 53

What could go wrong?

husband=1, dad=1...

MALE

23 of 53

What could go wrong?

  1. Demographic labels are poorly defined. (“gender” is based on ambiguous names).
  2. Text examples are not spontaneous or socially-oriented. (use female and male authors of scientific papers)
  3. Features extracted are too sparse. (“I am a man” is predictive but unlikely to appear in test data).
  4. Features extracted are too domain-specific, not extensible. (“RT” => female doesn’t tell us much about speech style).

husband=1, dad=1...

MALE

24 of 53

Practical questions

  1. What aspects of identity are harder for humans to detect?
  2. What aspects of identity are harder for computers to detect?
  3. What can language not tell us about speaker identity?

25 of 53

Inferring relationships

  • Language is used to negotiate relationships including among friends, family and colleagues (Tannen 1987).
  • People often plan their speech to build social rapport and avoid conflict.
    • Politeness strategies
    • Rhetorical style
    • Power dynamics
  • A speaker’s stance toward other speakers can be revealed based on their conversational behavior.

26 of 53

Inferring relationships

  • Problem: Can computers determine the status of a relationship based on the language used between speakers?
  • Applications: detecting abuse, improving relationships

27 of 53

Politeness

  • Assumption: People should speak politely to one another in order to have productive conversations and build relationships.
  • What are some characteristics of politeness? (Brown and Levinson 1987)

28 of 53

Politeness

  • Assumption: People should speak politely to one another in order to have productive conversations and build relationships.
  • What are some characteristics of politeness? (Brown and Levinson 1987)
    • Negative: avoids infringing on hearer’s autonomy (ex. hedging).
    • Positive: reinforces hearer’s self-image, expectations (ex. white lie).

29 of 53

Exercise: politeness online

It’s stupid and wrong

Climate change is happening and it’s not changing in our favor. If you think differently you’re an idiot.

Screw you trump supporters

30 of 53

Exercise: politeness online

0.87

0.96

0.87

It’s stupid and wrong

Climate change is happening and it’s not changing in our favor. If you think differently you’re an idiot.

Screw you trump supporters

31 of 53

Exercise: politeness online

  • Tool: http://perspectiveapi.com (scroll down to test demo)
  • Try to break it!�
  • What do you think the tool has learned about “toxic language”?
  • How might this tool improve/hurt relationship management online?

32 of 53

Exercise: politeness online

  • What do you think the tool has learned about “toxic language”?
    • Swearing = bad.
    • Personal attacks = bad.
    • Negation = complicated.
  • How might this tool improve/hurt relationship management online?
    • Encourage self-reflection among writers.
    • Help community moderators find the worst comments faster.
    • Impose normative standards on language use.

33 of 53

Exercise: toxic language

34 of 53

Politeness application: police

  • Police officers are expected to protect and serve their constituents, which includes treating citizens with respect.
  • What can language analysis reveal about police interactions with citizens? (Voigt et al. 2017)

35 of 53

Exercise: respect

  1. [name], can I see that driver’s license again? It’s showing suspended. Is that you?
  2. Sorry to stop you. My name’s Officer [name] with the Police Department.
  3. It just says that, uh, you’ve fixed it. No problem. Thank you very much, sir.
  4. All right, my man. Do me a favor. Just keep your hands on the steering wheel real quick.
  5. There you go, ma’am. Drive safe, please.

36 of 53

Exercise: respect

37 of 53

Practical questions

  1. What would a politeness system infer about your relationships online?
  2. Which features are likely to be false positives/negatives for politeness for a language analysis system?
  3. Would you use a writing helper that points out examples of politeness?

38 of 53

What can language not tell us?

  • Lots of interaction is embodied rather than spoken.
  • Shared context may not be detectable, e.g. vague phrases that carry meaning based on prior world knowledge (“that place”).

39 of 53

Language and society

  • Small-level interactions between people shape large-scale patterns in language. (Labov 1963)
  • Example: repeat interactions among people in a specific geographic area lead to dialect differences.

40 of 53

Society: mapping variation

  • People speak differently depending on their area of origin (Trudgill 1974): Southerners say “y’all”, Californians say “hella”, Northerners say “wicked.”
  • These differences often mark historical patterns (migration), geography (isolation), and cultural regions.
  • How can computers aggregate large-scale data to find large-scale language patterns across space?

41 of 53

Society: mapping variation

  • Exercise: https://www.nytimes.com/interactive/2014/upshot/dialect-quiz-map.html
  • Take the quiz, compare with others.
    • Is it correct?
    • What words did you not expect to find?
    • Would you add/remove any words from the list?

42 of 53

Society: mapping variation

Ian’s map

43 of 53

Society: how to find variables?

  • Typical approach: have linguists read data and identify possible variables manually, verify with interviews.
  • Quantitative approach: collect lots of geotagged data and find words that are highly predictive with specific regions.
  • Example: geographic topic model.

44 of 53

Society: how to find variables?

45 of 53

Society: how to find variables?

46 of 53

Society: search trends

47 of 53

Society: search trends

  • The medical information that people search for is often related to real-life health problems, like the flu.
  • How accurately can we predict health outcomes with the language that people use to search?
  • Google Flu Trends: https://www.google.com/publicdata/explore?ds=z3bsqef7ki44ac_

48 of 53

Practical questions

  • Do you believe the output of the dialect survey? How well does it match your intuition of how people talk?
  • Should a doctor trust a system that predicts flu outbreaks based on search queries? What about tweets?
  • How should a computer system like machine translation handle new words?

49 of 53

Recap

  • Language reveals demographic information about speakers.
  • Language reveals insight about relationships.
  • Language reveals large-scale trends in society.

50 of 53

Recap

  • Language reveals demographic information about speakers.
  • Language reveals insight about relationships.
  • Language reveals large-scale trends in society.

“The common misconception [is] that language has primarily to do with words and what they mean. It doesn’t. It has primarily to do with people and what they mean.”�(Clark and Schober, 1992)

51 of 53

Disclaimer

  • Heavy focus on social norms in the United States and on Standard American English.�
  • Cultural norms may vary!

52 of 53

Want more info?

53 of 53

Want more info?