Automatic Text Analysis

Computational Analysis of
Political Communication

2019 University of Mannheim

Wouter van Atteveldt

Today’s program

Goal: Understanding of and practice with automatic text analysis in R

  • Morning session: acquiring, cleaning, and preprocessing data
    • Quick recap: R
    • Examples of automatic text analysis
    • Steps in automatic text analysis

  • Afternoon session:
    • Text analysis Step 1: Obtaining data
    • Text analysis Step 2: Cleaning and preprocessing data
    • Dictionary analysis
    • Sentiment analysis
    • Your project / assignment

Quick recap:

R for Text and Data Analysis

What is R?

  • Open source, multi-platform
  • Full “Turing complete” programming language
  • Text/Console based
  • Community driven:
    • User packages are "1st class citizens"
    • Most functionality comes from packages developed by “people like us”
  • Good tooling/documentation (esp. RStudio)

Why use R for text/data analysis?

  • Compared to proprietary / single task tools (SPSS, Gephi, etc)
    • R can do stats, but also text analysis, network analysis, scraping, ...
    • Learn one language, gain many options
    • Easier to combine multiple methods
    • Can import/export as needed (e.g. excel, gephi)
  • Compared to (e.g.) python
    • Both would be fine, both have strong use in data science
    • R more geared towards stats
    • Python more geared towards general programming and web development

Basics of R

  • Everything is an object (=variable)
    • From single numbers to whole data sets, text collections, and analysis results
  • You give every object a name
  • Every object has a type (e.g. character=text, data.frame=table, etc)
  • You can have multiple objects, load/save them, combine them, etc.

Most of the action is in functions

  • A function is an operation called (usually) on an object
    • E.g. mean(x), summary(x)
  • Which often returns a useful value that you then give a new name:
    • d = read_csv(“data.csv”)
  • And which often can take additional options
    • saveRDS(d, file=“data.rds”)
    • d2 = filter(d, year==2018)
  • Function arguments are positional or named, and obligatory or optional
    • saveRDS(d, file=“data.rds”) is the same as saveRDS(d, “data.rds”)
    • paste(d, sep=” “) is the same as paste(d) because “ “ is default value for sep.
  • Good practice: always name optional arguments, keep main arguments positional

Many functions are in packages

  • A package is a collection of third-party functions (e.g. tidyverse, quanteda)
  • Packages need to be installed before use
    • install.packages(“tidyverse”) # or use packages pane in RStudio
    • Automatically installs all dependencies
    • If asked to compile from source: “No” is the safest answer
    • If you get errors, try to install only the package that gives the error
    • If that doesn’t help, ask me or google the error message
  • Most packages are on CRAN (the R ‘app store’)
    • but some need to be installed from github: install_github(“vanatteveldt/rcanvas”)
  • After installing once, can be used in all sessions
  • Either activate, or explicitly state where a function comes from:
    • library(quanteda)
      dfm(text)
      # is (more or less) the same as:
      quanteda::dfm(text)

Data and data types

  • Every object has a ‘class’ (data type)
    • class(x)
  • Some common “primitive” classes:
    • character = text
    • numeric = numbers (and integer = whole numbers)
    • logical = TRUE/FALSE (abbreviated as T/F)
    • factor = nominal values with value labels
      Note: these are all ‘vectors’ (=columns) of numbers, not single values / scalars
  • Some common container classes:
    • data.frame = rectangular data frame, one type per column
    • matrix = rectangular matrix, all values same type (usually numeric)
    • list = very free data type, any value can be its own type
  • Packages can define their own classes
    • E.g. quanteda’s corpus and dfm objects

Converting between data types:

  • Converting between primitive types:
    as.character, as.numeric etc.
  • Selecting a single column of a data frame (or element of a list):
    dataframe$column, e.g. d$text
  • Selecting a single value of a vector:
    vector[index], e.g. x[3] or d$text[3]
  • Optional: subsetting a matrix or data frame:
    dataframe[rows, columns], e.g. d[3, “text”] or d[3,] for whole row
  • Optional: alternative way of selecting a column or list element:
    dataframe[[“column”]], e.g. d[[“text”]]

The Tidyverse

Tidyverse

  • Collection of packages for data analysis and visualization
    • dplyr (“data-plyers”) for data wrangling
    • ggplot2 for visualization
    • readr for reading/writing csv-like data
    • tidyr for reshaping data
    • strnigr for dealing with textual data
    • haven for reading data from SPSS and other formats
    • purrr for functional programming
    • forcats for dealing with factors
  • Advantages over base R:
    • Single lead author
    • Consistent philosophy
    • Good free/online book at https://r4ds.had.co.nz
  • (but everything can also be done in base R, feel free to mix and match as needed)

Tidyverse: basic idea

  • Each step in a data processing ‘pipeline’ is a single function
    • Read in data, select some columns, filter on specific value, compute something, plot = 5 functions
  • Every function has the same “signature”
    • filter(data, criteria), select(data, columns), mutate(data, values)
  • You can directly refer to columns, use named arguments to create new columns
    • So filter(data, value > 10) # instead of data[data$value > 10, ],
      mutate(data, new_column=old_column + 10), rename(data, new_column=old_column)
  • Functions never change your data, but return a new copy
    • → need to assign to same or new variable to keep changes
    • d2 = mutate(d, age = as.numeric(age))

Tibbles

  • Main primitive or tidyverse
  • Rectangular data frame like in SPSS / Excel
    • Data is stored in columns
    • Each column has a row and data type
  • Very similar to base R data.frame
    • No row.names
    • Can convert with as_tibble and as.data.frame

Dplyr functions

  • (see handout “tidyverse basics” and “summarizing data”, R4DS chapter 5)
  • select to select (and optionally rename) some columns
  • rename to rename some columns, but keep all columns
  • filter to select some rows based on a criterion
  • arrange to sort the data
  • mutate to create new columns or change the value of a column

Reading and Writing data

  • Base R: readRDS and writeRDS
    • Internal format for R, preserves all attributes
    • Best for saving data for use in R only
    • (prefered over save/load for almost all cases)
  • Package readr: read_csv and write_csv
    • Best for simple tabular data, import/export to excel
    • Note: very similar to read.csv / write.csv, but faster and deal with text columns better
  • Package haven: read_sav, write_save, read_dta, etc.
    • For communicating with SPSS, Stata etc.

Tidyverse Pipelines

  • Very often a logical process is a series of small steps
  • Each step is x = function(x, …)
  • This can also be written as a pipeline of functions:

    d = read_csv(“file.csv”)
    d = filter(d, gender == “M”)
    d = mutate(d, age = as.numeric(age))
    #### is the same as:
    d = read_csv(“file.csv”) %>% filter(gender==”M”) %>% mutate(age=as.numeric(age))
  • More formally, x %>% f(y) ←→ f(x, y)
  • Note that the first argument (d) no longer needs to be given!
  • You are free to group related function calls, or not

Computing summary statistics

  • Two step process: group_by(...) and summarize(x=function(...))
  • Result: one row per unique group, columns for groups and summaries
  • Function should summarize multiple values for each group
    • e.g. sum(x), mean(x), n()
  • Often combined in a pipeline:

    d %>% group_by(gender) %>% summarize(age=mean(age), age_sd=sd(age))

Adding summaries to single cases

  • Alternative: group_by(...) and mutate(x=function(...))
  • Result: one row per unique case, new columns for summaries, repeated per case
  • Can be useful to e.g. select highest case, compute proportions, etc.
  • Can mix summary functions and per-case functions in the mutate

    d %>% group_by(gender) %>%
    summarize(max_height=max(height), rel_height=height/max_height))

Visualizing with ggplot

Visualizing with ggplot

  • Widely used library for visualization
  • Many different chart types, can combine different charts
  • Sensible defaults, but everything can be customized
  • Allows ‘theming’ to easily alter appearance for e.g. article or presentation
  • Note: meant for ‘static’ plots, but can be made dynamic. See e.g. http://www.rebeccabarter.com/blog/2017-04-20-interactive/

GGplot philosophy

  • “The Grammar of Graphics”
  • Every plot consists of
    • Data
    • Geometrical elements (e.g. bars, lines, dots)
    • Aesthetic mappings from data to elements (e.g. color=gender, x=income)
    • Other elements (axes, legends, etc.)
  • Basic ggplot syntax:
    ggplot(data) + geom_point(mapping=aes(x=income, y=college)) +
    <more geoms> + <theming and customization>
  • Mappings (and data) can be shared between geoms, or supplied for each separately
  • See help pages for geom_* functions to read about aesthetics

More information / Help

The Assignment

Written Assignment

  • Pose a social scientific question that can be answered with automatic text analysis
  • Gather, read, and clean the needed data
    • Available from me: NY Times on 2016 US election
    • Available online: Guardian API; Amazon reviews; many many more.
  • Perform one or more automatic text analyses
  • Analyse and/or discuss the validity of the method
  • Visualize and/or describe the results
  • Discuss the outcomes in the light of the research question
  • Discuss the limitations and possible improvements

Paper Structure

  • Intro (~1 page): RQ and short theoretical motivation
  • Methods (~2 pages): Data, analysis, validity
  • Results (~2 pages): Exploration, visualization, tests (if applicable)
  • Conclusion (~1 page): Answer, Discussion, Limitations

Submit PDF plus R code (and any material) that reproduces all used figures

And remember: This is not your doctoral thesis :). Simple is good.

Process

  • Today: individual meetings to brainstorm about possible RQ
  • Tuesday - Thursday: make sure you have the data and an analysis plan!
  • Friday: apply new techniques, troubleshooting where needed

Motivational:
Some Applications of Text Analysis

Steps in
Automatic Text Analysis

Steps in Automatic Text Analysis

(van Atteveldt, W., Welbers, K., Van der Velden, M.A.C.G (2019) Studying Political Decision Making With Automatic Text Analysis, Oxford Encyclopedia of Political Decision Making.)

Symbols and meaning

Symbols and meaning

  • Text/language consists of symbols
  • Symbols by themself do not have meaning
  • Text attains meaning when interpreted
    • (in its context)
  • Text/Content analysis is “a research technique for making replicable and valid inferences from texts (or other meaningful matter) to the contexts of their use” (Krippendorff 2004)
  • Main challenge in Automatic Text Analysis:
    Bridge the gap from symbols to meaningful interpretation

Types of Automatic Analysis

  • Rule based analyses (e.g. dictionaries)
    • Assigns meaning by researcher specifying “if word X occurs, the text means Y”
  • Supervised machine learning
    • Train a model on coded training examples
    • Generalizes meaning in human coding of training material
    • “Text X is like other texts that were negative, so X is probably negative”
  • Unsupervised machine learning
    • Find clusters of words that co-occur
    • Meaning is assigned afterwards by researcher interpretation
    • “These words form a pattern, which I think means X”

(e.g. Boumans, J. W., & Trilling, D. (2016). Taking stock of the toolkit: An overview of relevant automated content analysis approaches and techniques for digital journalism scholars. Digital journalism, 4(1), 8-23.

Dictionary / keyword analysis

  • Set of terms per concept
    • (wildcards, synonyms, conditions, proximity)
  • Advantages:
    • Technically easy
    • Transparent
    • Few resources needed
  • Disadvantages:
    • Low validity for non-trivial concepts (sentiment, frames, specific topics)
    • Difficult to create/maintain large dictionaries
    • Can encode biases

Unsupervised Machine Learning: topic modeling

  • Cluster documents / words automatically
  • Most used technique: LDA (latent dirichlet allocation)
    • Creates ‘topics’ of similar documents and terms
    • Mixture model: multiple topics per document, word
    • Many variations exist
  • Advantages:
    • Almost no input / resources needed
    • Good option for exploration
  • Disadvantages:
    • Difficult to determine validity
    • Low control over outcome

Supervised Machine Learning

  • Train a statistical model
    • Linking input (text) to output (topic, sentiment)
    • Based on (many) manually coded example documents
    • Many different algorithms exist (Naive bayes, Support Vector Machines, Neural Networks, …)
  • Advantages:
    • Often very good accuracy
    • Fine control over outcome
    • Validation built into methodology
  • Disadvantages:
    • ‘Black box’: no understanding of process
    • Relatively large training sets needed (>1000s of documents)

Tools for automatic text analysis

  • Dictionary analysis:
    • Many existing off-the-shelf solutions
    • Trivial to run with quanteda, or other tools in e.g. R, Python, etc.
  • Supervised/Unsupervised Machine learning
    • Good software support in R, Python, etc.
    • Off the shelf solutions exist with pretrained models
      (e.g. for Sentiment analysis) but should be validated

Practical:

Gathering and Cleaning Data

Mannheim 2: Steps in Text Analysis - Google Slides