1 of 41

Additional explorations in text analysis

Kevin Lanning

SICSS – South Florida

2 of 41

Overview: Three projects

Using language to identify scholarly communities (brief)�Personality and ego development (not brief)�Word use in news transcripts (brief)

Papers are in Zotero and links to R code are (mostly) in the papers - or just ask me.

3 of 41

Language of scholarly communities

Code in Google drive

4 of 41

Language of scholarly communities

Code in Google drive

5 of 41

Personality and ego development

Background
Study 1: Examined words and LIWC categories characteristic of each ego level in sentence completions (Nature Human Behavior, 2018)
Study 2: Expanded the ego lexicon with broader dictionaries (SPSP 2019)
Study 3: Moved from words to texts (compared several ways for scoring ego level and applied these to sentence completions and blogs (SPSP 2020)

6 of 41

personality > traits

7 of 41

“maturity”

The construct(s) of ego development

Cognitive

Social

Moral

Autonomous/

Integrated (8

-

9)

Fulfillment

Interdependence

Complexity

Individualistic (7)

Development

Mutuality

Tolerance

Conscientious (6)

Achievement

Responsibility

Self

-

criticism

Self

-

Aware (5)

Adjustment

Helpfulness

Exceptions

Conformist (4)

Appearances

Loyalty

Obedience

Self

-

Protective (3)

Trouble

Wariness

Opportunism

Impulsive (2)

Good vs. bad

Solipsism

Urges

One other important parameter of individual differences is, like intelligence, outside of the five factors. It might roughly be called “maturity.”

We can think of maturity as a single dimension, or as the sequential waxing and waning of a series of what are typically called stages or levels. These terms are perhaps misleading; -> ordered series of guiding frameworks or schema.

At the earliest stages, the core concern is with impulse. The core of the model is in the middle categories, where most people in Western, developed countries may be found, and in the tension between conformity and conscience. At the highest stages there are more differentiated concerns with the self.

Breadth - varying facets of cognitive (increasingly sophisticated preoccupations), social (broadening sphere of concern), and moral complexity.

8 of 41

The measure: Wash U. Sentence Completion Test

When a child will not join in group activities…

Impulsive (2)	… they are sick

Self-Protective (3)	… give him 2 choices, join or sit by himself

Conformist (4)	… he might be tired

Self-Aware (5)	… I wonder what is wrong

Conscientious (6)	… I wonder if he doesn't feel good about himself

Individualistic (7)	… it may be a healthy or unhealthy sign

Autonomous/�Integrated (8-9)	… it's sometimes a reflection on the group, not the child.

9 of 41

Overview / goals�

Compile as many scored responses to the WUSCT as possible
Elucidate the construct of ego development using language analysis of both LIWC categories and individual words
- is there evidence supporting a stage model?
- how are stages of development expressed in language?
Provide a starting point for ego level as a culturomic tool

10 of 41

The construct(s) of ego development

“maturity”

Cognitive

Social

Moral

Autonomous/

Integrated (8

-

9)

Fulfillment

Interdependence

Complexity

Individualistic (7)

Development

Mutuality

Tolerance

Conscientious (6)

Achievement

Responsibility

Self

-

criticism

Self

-

Aware (5)

Adjustment

Helpfulness

Exceptions

Conformist (4)

Appearances

Loyalty

Obedience

Self

-

Protective (3)

Trouble

Wariness

Opportunism

Impulsive (2)

Good vs. bad

Solipsism

Urges

One other important parameter of individual differences is, like intelligence, outside of the five factors. It might roughly be called “maturity.”

We can think of maturity as a single dimension, or as the sequential waxing and waning of a series of what are typically called stages or levels. These terms are perhaps misleading; -> ordered series of guiding frameworks or schema.

At the earliest stages, the core concern is with impulse. The core of the model is in the middle categories, where most people in Western, developed countries may be found, and in the tension between conformity and conscience. At the highest stages there are more differentiated concerns with the self.

Breadth - varying facets of cognitive (increasingly sophisticated preoccupations), social (broadening sphere of concern), and moral complexity.

11 of 41

Sample	Impulsive (2)	Self-Protective (3)	Conformist (4)	Self-Aware (5)	Conscientious (6)	Individualistic (7)	Autonomous/ Integrated (8-9)	total
Responses at each level
Univ I	63	184	1287	1622	1006	178	24	4364
Univ II	187	1045	6746	9933	7081	1407	159	26558
Midlife	27	120	1131	1945	1640	479	76	5418
Exemplar	141	402	1007	2167	2412	1130	334	7593
Total	418	1751	10171	15667	12139	3194	593	43933

Words coded at each level
N words	1456	7876	47969	103618	108911	39925	10655	320410
N distinct words	398	1413	3273	5493	6129	3932	2032	10670

The data

12 of 41

Some LIWC results

13 of 41

Ego level as sequence

	2	3	4	5	6	7	8-9
Impulsive (2)		0.90	0.85	0.83	0.80	0.75	0.70
Self-Protective (3)	0.90		0.90	0.91	0.88	0.84	0.80
Conformist (4)	0.84	0.90		0.95	0.92	0.88	0.83
Self-Aware (5)	0.83	0.91	0.95		0.98	0.96	0.93
Conscientious (6)	0.79	0.87	0.91	0.98		0.99	0.97
Individualistic (7)	0.75	0.83	0.87	0.96	0.99		0.98
Autonomous/ Integrated (8-9)	0.70	0.79	0.83	0.93	0.97	0.98

Correlations between word use at different levels supports a simplex model.

Correlations between word counts across all 10670 (top) or most common 1811 terms.

14 of 41

15 of 41

16 of 41

17 of 41

18 of 41

19 of 41

20 of 41

21 of 41

22 of 41

23 of 41

Expanding the model

For each of seven levels, reduce terms into a set of homogeneous facets
Assess cosine similarities between these facets and vectors of the lexicon (using the common crawl.)
Weigh these vectors and combine them into new measures of seven ego levels.
Combine these seven measures (Impulsive …Autonomous) into a single ego score.
Apply these scores to new corpora – here, offensive tweets and ads in the 2016 presidential campaign (considered briefly) and presidential speeches (considered at greater length).

24 of 41

Expanding the ego lexicon

level	nWords	Facet Weight	facet	Original dictionaries: words in SCTs
2	4	0.9	aggression	bothers, fight, hate, violence
2	2	0.7	banal- hyperbolic	amazing, awesome
2	6	1	banal-cool	cool, fine, liked, nice, ok, okay
2	4	0.8	prohibition	cant, nobody, nothing, rules
…	…	…	…	…

hate(.78), violence, fight, bothers, hating, hatred, dislike, fights, racism, fighting, disrespect, annoys, hates, bashing, disgusts, think, injustice, bigotry, complain, blaming, violent, hated, bullying, pisses, bullies, scares, loathe, anymore, animosity, misogyny, irks, bullshit (.50) …

Impulsive-aggression words in expanded dictionary

25 of 41

Ego level in multiple samples

(a weak test, passed)

26 of 41

(a stronger test, failed)

27 of 41

28 of 41

29 of 41

Step 3: Reexamining ego level in words and texts (today)

Is the ego level of a text essentially the average of the ego level of its constituent words?
Can the (expanded) dictionaries be used to score ego level in other texts?

30 of 41

When ego level is computed as averages, estimates for long responses are too low…

31 of 41

…and estimates for short responses may be too high.

32 of 41

LIWC scales associated with prediction errors in sentence completions

33 of 41

Exploring a regression approach

Series 1: Original data / original dictionaries

Data : Initial sample of 45000 responses, split into 80% training/20% test
Model 1: Ego level
Model 2: From seven ego levels
Model 3: + WC
Model 4: + all LIWC
Model 5: - small/LASSO

Series 2: Original data / expanded dictionaries
Series 3 and 4: Blogs/ original and expanded dictionaries

34 of 41

Regressions: Sentence completions

Original dictionaries

Expanded dictionaries

35 of 41

Blogs

Blog Authorship Corpus (Schler et al 2006), available in csv format from Kaggle.
All blogs from <= 2004, 681,288 posts, 140 million words - or approximately 35 posts and 7250 words per person.
Here, a fraction (5%) of the data are examined, including ~ 33000 posts from ~9000 persons.
No measure of ego level - but there is age.

36 of 41

Blogs: Predicting age from sentence-completion derived models

From original dictionaries

From expanded dictionaries

Model Blogs Persons Blogs Persons

37 of 41

Summary

Blogs and responses to sentence completions are not the same

In the sentence completions, the more complex models outperformed simpler ones
This did not generalize to the very different blogs, with a very different “criterion”

No all-purpose tool for assessing ego level across texts is ideal. But the expanded ego dictionaries appear ok.

38 of 41

Summary of personality stuff

Blogs and responses to sentence completions are not the same

In the sentence completions, the more complex models outperformed simpler ones
This did not generalize to the very different blogs, with a very different “criterion”

No all-purpose tool for assessing ego level across texts is ideal. But the expanded ego dictionaries appear ok.

39 of 41

Fox and MSNBC

40 of 41

Fox > MSNBC

41 of 41

MSNBC > Fox