1 of 48

Introduction to

Computational Text Analysis

Digital Integration Teaching Initiative (DITI)

2 of 48

Workshop Agenda

  • Introduction to key terms and concepts in computational text analysis (CTA).
  • Discussion of CTA’s applications and uses in research.
  • Introduction to web-based text analysis tools.
    • Word Counter, Word Trees, Voyant, Lexos

For more information, please see: https://bit.ly/handout-text-resources

3 of 48

What is Computational Text Analysis?

4 of 48

Computational Text Analysis

Computational text analysis refers to the array of methods used to “read” texts with a computer. It is similar to statistical analysis, but the data is texts (words) instead of numbers.

Text analysis:

  • Involves a computer drawing out patterns in a text and a researcher interpreting those patterns.
  • Includes methods such as word count frequency, keywords in context, computational modeling (with machine learning), and sentiment analysis.
  • Is conducted using web-based tools or coding languages like Python and R.

5 of 48

Why Computational Text Analysis?

Computational text analysis can help us analyze very large amounts of data, identify keywords, and discover patterns in texts. Using text analysis, researchers may find surprising results that they would not have discovered from traditional methods alone.

For example: "Gendered Language in Teacher Reviews" by Ben Schmidt shows stark differences in the ways that male and female professors are reviewed on "Rate My Professor."

6 of 48

Language Used in Climate News

Word Cloud of TV News on “Global warming.” Terms like “believe” and “threat” appear frequently with “global warming” in TV news coverage since 2009.

7 of 48

Climate News: Discussion

Go to the Television Explorer. Search “global warming,” “climate crisis,” and “greenhouse effect.”

  • What do you notice about the TV coverage of these terms over time? What is surprising?
  • How do you think political values affect climate language?
  • How might this language shape policies?

8 of 48

Gendered Language

Go to bit.ly/schmidt-gender and try a few queries.

For example:

  • Smart
  • Ditzy
  • Unprofessional
  • Nice

―How do you think Schmidt determined gender for this tool?

9 of 48

Key Terms

  • Corpus (plural–corpora): A collection of texts used for analysis and research purposes.
  • Stop words: Words that appear frequently in a language, like pronouns, prepositions, and basic verbs. These are often removed for computational analysis. Some English stop words include a, the, she, he, I, me, us, of, is, would, could, should, etc.
  • Word Count Frequency: Counting the total times a word appears in a text/corpus or the percentage of how often it appears.
  • nGram: A continuous sequence of n items in a text. A bigram (or 2 continuous words) could be ‘United States,’ while a trigram (3 words) could be ‘yes we can.’

10 of 48

Corpus Building

Questions to consider as you begin your research:

  • What are my research questions, and why am I creating a corpus?
  • What am I asking my corpus to do?
  • What text(s) should form my corpus to answer my research questions?
  • How should I organize my corpus to streamline my research processes and save time?
  • For more on building a corpus, see this handout.

11 of 48

Text Analysis in non-English Languages

Below are some resources that may be helpful if you are considering using non-English language texts:

12 of 48

Our Corpus

For our corpus, we will work with a set of State of the Union addresses from 1990 to 2019.

Download these files.

The easiest way to work with these files is to choose "Download all" and open them with a plain-text editor (TextEdit on Mac, Notepad on Windows). Mac users should be able to click on the zip file to expand it; Windows users will need to right-click and choose "Extract all."

13 of 48

Initial Corpus Analysis

Open any one of the texts from the sample corpus:

What can you observe about the text? How long is it? What kinds of language does it use? What kinds of analysis might you do with a text like this?

Scan through a few more: do they seem largely similar? What do you think might be different?

14 of 48

Exploratory Tools:

Word Counter and Word Trees

15 of 48

Word Counter

  • https://databasic.io/en/wordcounter/
  • A user-friendly basic word counting tool
  • Allows you to count words, bigrams, and trigrams in plain text files and to download spreadsheets with your results
  • The max file upload is 10MB
  • The default is to lowercase all words and apply stopwords, but you can change those settings
  • For more information, please see: https://bit.ly/handout-data-basics-suite

16 of 48

Word Counter Example

This is a word cloud, used to get a sense of the most used words in a document. Words used more often are bigger, than those used less often.

What seems significant in the most frequent terms from Clinton’s 2000 State of the Union Address?

17 of 48

“Tokenizing” Text

Why do you think that “000” is one of the most common words in Clinton’s 2000 SotU address? Open the .txt file and search for “000” to check your guess.

Before words can be counted, they must be “tokenized” or divided into components that programs can treat as distinct segments. Different programs will have different standards for tokenization—this one uses both white spaces and punctuation marks (such as commas) to separate words into tokens. What are some limitations of this approach?

18 of 48

Data Preparation

Go back to the upload/paste screen for WordCounter and unclick the "ignore stopwords" and "ignore cases" options, then count the words again.

What happened? Why do you think the default is to ignore stopwords and remove differences between upper/lowercase words?

Can you think of any limitations to this approach?

19 of 48

Bigrams and Trigrams

In addition to single words, it is also useful to consider bigrams and trigrams. Why do you think the phrase “I ask you” appears so often in the 2000 State of the Union Address? What about “we should”?

20 of 48

Word Tree

  • https://www.jasondavies.com/wordtree/
  • A word tree depicts multiple parallel sequences of words.
  • This is a good way to see patterns in word usage, based on words that appear before and after a term or terms of interest.
  • There are some restrictions in size with this tool: fewer than 1 million words should work.
  • Upload your text, enter a keyword or phrase to search, then try reversing the tree.
  • It’s often useful to search frequent terms identified by WordCounter

21 of 48

Word Tree Example

22 of 48

Tools for Corpus Exploration: �Voyant

23 of 48

Voyant

Voyant makes it possible to perform analyses on one or multiple files in many ways, including word counts, nGrams (n=number of words), word frequency distributions, word trends across documents, and concordances.

https://voyant-tools.org/

For more information, see: https://bit.ly/handout-voyant-intro

24 of 48

Voyant: Upload

Click on Upload and navigate to the folder with the text documents you wish to analyze.

Alternatively, insert URLs or full text into the textbox.

Click here for help and advanced options

25 of 48

Voyant: Dashboard

Results:

After you upload your corpus, you will see the default results page with multiple panes:

  • A word cloud
  • Reader section
  • Trends
  • Document Summary
  • Word Contexts

These boxes can all be changed!

26 of 48

Voyant: Changing Displayed Results

Hover on the right top corner of a pane, and buttons will appear. Select the panes button and choose a new option from the dropdown menu. For example, we might want to try out the "Collocates" tool instead of the word cloud. Click on the ‘?’ to learn more about how the tool works.

27 of 48

Voyant: Tools for Further Exploration

  • Voyant’s Getting Started guide
  • Voyant’s List of Tools, showing all the features possible with Voyant including descriptions of each
  • Some useful tools to explore:
    • MicroSearch
    • Topics
    • Correlations
    • Collocates Graph

28 of 48

Tools for Corpus Exploration: �Lexos

29 of 48

Lexos

Lexos provides a step-by-step guide for text uploading, preparation, and analysis.

  • Upload: upload your .txt file
  • Manage: select the files you want to prepare and analyze
  • Prepare: prepare your text for analysis
  • Visualize: create visualizations of patterns across your corpus or in single texts
  • Analyze: analyze your text

http://lexos.wheatoncollege.edu/upload

For more information, please see: https://bit.ly/handout-Lexos-intro

30 of 48

Lexos: Upload

Click Browse and select your entire text (or drag file into the “Drag Files Here” area). It can be easy to miss when the upload is done—click “Manage” to double check that the text file is there.

31 of 48

Lexos: Manage

Make sure the document you want to use is selected (blue = selected, gray = not selected)

32 of 48

Lexos: Prepare (Scrub Case and Punctuation)

Lexos demonstrates some more advanced options you have for preparing your corpus. By “scrubbing,” you are transforming the texts in your corpus and making choices that will impact your results. Here are some possibilities:

  • Make Lowercase: make all your letters lowercase. Even though you know “A” and “a” are the same letter, the computer treats these as two separate characters. Lowercasing removes this distinction.
  • Remove Punctuation: remove punctuation, which may influence your results.

33 of 48

Lexos: Prepare (Scrub Words)

You can also stem words and remove certain words. Here are some possibilities:

  • Stop/Keep Words: remove a list of words. Usually these would be stopwords. With WordCounter, you had to use the stopwords list the tool provided—now, you can choose your own.
  • Lemmas: standardize to the stem of word. For example, you can stem all forms of the verb talk: talking, talked, talks, etc. to “talk”

34 of 48

Lexos: Removing Stopwords

Get a list of English stopwords here: https://gist.github.com/sebleier/554280. Copy and paste the stopwords (hit "raw", then select all and copy) into the “Stop/Keep Words” box then select “Stop”

35 of 48

Lexos: Applying your Preparations

Once you have made decisions about your preparations, click “Apply” and wait a few minutes. Because the program is going through each document and completing all the processes you selected, it needs some time. Then, you will see the final results of your preparation! You can also download your new corpus.

BEFORE PREP

AFTER PREP

36 of 48

Lexos: Analyze > Top Words (1/2)

The top words tool lets you compare word usage between individual documents and your corpus as a whole. If you want to make more specific comparisons, you can also assign “classes” to subsets of tools with the “Manage” screen.

  • Words with high positive scores are used more often in each document, relative to the rest of the corpus.
  • Words with high negative scores are used less often.

Hit the “Generate” button to see the top words for your texts.

37 of 48

Lexos: Analyze > Top Words (2/2)

38 of 48

Lexos: Analyze > Dendrogram

The dendrogram demonstrates similarity between the different documents. Dendrograms require at least two documents to compare. Dendrograms “cluster” texts to draw out similarities:

  • The greater the distance between texts, the less similar they are.
  • The smaller the distance between texts, the more similar they are.

39 of 48

Lexos: Analyze > Dendrogram Example

40 of 48

Lexos: Save or Reset Your Results

Lexos allows you to save your results as a Lexos file. If you do this, you can re-upload the Lexos file any time to access your cleaned-up corpus as well as the different analyses you’ve done. You can also download modified text files from the “Manage” page—and you can even use those downloaded text files with other tools!

You can also save individual visualizations as images (PNGs).

Finally, if you want to start over, you can “Reset” your Lexos dashboard.

41 of 48

Your Turn!

42 of 48

Your Turn! Voyant and Lexos

Use the sample text or texts of your choice and begin practicing web-browser text analysis. Try uploading text to Voyant or Lexos and explore their features!

  • What interesting or surprising results came up? How might you interpret those results?
  • What other kinds of documents would be useful to compare?
  • How could you use these tools for your research? Which features do you think will be useful in your analysis?
  • How might text analysis complement other research methods?
  • Between Voyant and Lexos, which tool do you prefer and why?

43 of 48

For Further Exploration

44 of 48

Further Exploration: Topic Modeling

Topic modeling is a machine learning method that uses word co-occurrence within documents to identify "topics," or clusters of related terms. This is a topic model based on the Greater Boston Priority Climate Action Plan. In the visualization, topic 3 is selected.

45 of 48

Further Exploration: Sentiment Analysis

Sentiment analysis uses dictionaries and sometimes machine learning to assign sentiment scores (e.g., positive and negative) to documents. You can try this out with the "Drag and Drop Sentiment Analysis" tool.

46 of 48

Data Privacy

  • It’s important to pay attention to data privacy when using digital resources
  • At its simplest, data privacy is a person’s ability to control what of their personal information is shared and with whom.
  • To help you make informed decisions about interacting with digital tools in ways that honor your boundaries with your data and/or personal information, The DITI has prepared a handout on Data Privacy

47 of 48

Resources

48 of 48

Thank you!

Developed by Cara Marta Messina, Juniper Johnson, and Jeff Sternberg