1 of 52

Introduction to

Computational Text Analysis

Digital Integration Teaching Initiative (DITI)

2 of 52

Workshop Agenda

Introduction to key terms and concepts in computational text analysis (CTA).
Discussion of CTA’s applications and uses in research.
Introduction to web-based text analysis tools.

Word Counter, Word Trees, Voyant, Lexos

For more information, please see: https://bit.ly/handout-text-resources

3 of 52

What is Computational Text Analysis?

4 of 52

Computational Text Analysis

Computational text analysis refers to the array of methods used to “read” texts with a computer. It is similar to statistical analysis, but the data is texts (words) instead of numbers.

Text analysis:

Involves a computer drawing out patterns in a text and a researcher interpreting those patterns.
Includes methods such as word count frequency, keywords in context, computational modeling (with machine learning), and sentiment analysis.
Is conducted using web-based tools or coding languages like Python and R.

5 of 52

Why Computational Text Analysis?

Computational text analysis can help us analyze very large amounts of data, identify keywords, and discover patterns in texts. Using text analysis, researchers may find surprising results that they would not have discovered from traditional methods alone.

For example: "Gendered Language in Teacher Reviews" by Ben Schmidt shows stark differences in the ways that male and female professors are reviewed on "Rate My Professor."

6 of 52

Language Used in Climate News

Word Cloud of TV News on “Global warming.” Terms like “believe” and “threat” appear frequently with “global warming” in TV news coverage since 2009.

7 of 52

Climate News: Discussion

Go to the Television Explorer. Search “global warming,” “climate crisis,” and “greenhouse effect.”

What do you notice about the TV coverage of these terms over time? What is surprising?
How do you think political values affect climate language?
How might this language shape policies?

8 of 52

Gendered Language

Go to bit.ly/schmidt-gender and try a few queries.

For example:

Smart
Ditzy
Unprofessional
Nice

―How do you think Schmidt determined gender for this tool?

9 of 52

Key Terms

Corpus (plural–corpora): A collection of texts used for analysis and research purposes.
Stopwords: Words that appear frequently in a language, like pronouns, prepositions, and basic verbs. These are often removed for computational analysis. Some English stopwords include a, the, she, he, I, me, us, of, is, would, could, should, etc.
Word Count Frequency: Counting the total times a word appears in a text/corpus or the percentage of how often it appears.
nGram: A continuous sequence of n items in a text. A bigram (or 2 continuous words) could be ‘United States,’ while a trigram (3 words) could be ‘yes we can.’

10 of 52

Corpus Building

Questions to consider as you begin your research:

What are my research questions, and why am I creating a corpus?
What am I asking my corpus to do?
What text(s) should form my corpus to answer my research questions?
How should I organize my corpus to streamline my research processes and save time?
For more on building a corpus, see this handout.

11 of 52

Text Analysis in non-English Languages

Below are some helpful resources on analyzing non-English language texts:

The Dream of the Red Chamber Experiment, analyzed using Lexos, slides by Scott Kleinman
Removing non-English language stopwords using Voyant
Different natural languages, equal importance by Wanying Wang
Everything you need to know about Multilingual LLMs: Towards fair, performant and reliable models for languages of the world by Sunayana Sitaram, et al.
Natural Language Processing for Non-English Text readings compiled by the University of Texas Libraries

12 of 52

Our Corpus

For our corpus, we will work with a set of State of the Union addresses from 1990 to 2019.

Download these files.

The easiest way to work with these files is to choose "Download all" and open them with a plain-text editor (TextEdit on Mac, Notepad on Windows). Mac users should be able to click on the zip file to expand it; Windows users will need to right-click and choose "Extract all."

13 of 52

Initial Corpus Analysis

Open any one of the texts from the sample corpus:

What can you observe about the text? How long is it? What kinds of language does it use? What kinds of analysis might you do with a text like this?

Scan through a few more: do they seem largely similar? What do you think might be different?

14 of 52

Exploratory Tools:

Word Counter and Word Trees

15 of 52

Word Counter

https://databasic.io/en/wordcounter/
A user-friendly basic word counting tool
Allows you to count words, bigrams, and trigrams in plain text files and to download spreadsheets with your results
The max file upload is 10MB
The default is to lowercase all words and apply stopwords, but you can change those settings
For more information, please see: https://bit.ly/handout-data-basics-suite

16 of 52

Word Counter Example

This is a word cloud, used to get a sense of the most used words in a document. Words used more often are bigger, than those used less often.

What seems significant in the most frequent terms from Clinton’s 2000 State of the Union Address?

17 of 52

“Tokenizing” Text

Why do you think that “000” is one of the most common words in Clinton’s 2000 SotU address? Open the .txt file and search for “000” to check your guess.

Before words can be counted, they must be “tokenized” or divided into components that programs can treat as distinct segments. Different programs will have different standards for tokenization—this one uses both white spaces and punctuation marks (such as commas) to separate words into tokens. What are some limitations of this approach?

18 of 52

Data Preparation

Go back to the upload/paste screen for WordCounter and unclick the "ignore stopwords" and "ignore cases" options, then count the words again.

What happened? Why do you think the default is to ignore stopwords and remove differences between upper/lowercase words?

Can you think of any limitations to this approach?

19 of 52

Bigrams and Trigrams

In addition to single words, it is also useful to consider bigrams and trigrams. Why do you think the phrase “I ask you” appears so often in the 2000 State of the Union Address? What about “we should”?

20 of 52

Word Tree

https://www.jasondavies.com/wordtree/
A word tree depicts multiple parallel sequences of words.
This is a good way to see patterns in word usage, based on words that appear before and after a term or terms of interest.
There are some restrictions in size with this tool: fewer than 1 million words should work.
Upload your text, enter a keyword or phrase to search, then try reversing the tree.
It’s often useful to search frequent terms identified by WordCounter

21 of 52

Word Tree Example

22 of 52

Tools for Corpus Exploration: �Voyant

23 of 52

Voyant

Voyant makes it possible to perform analyses on one or multiple files in many ways, including word counts, nGrams (n=number of words), word frequency distributions, word trends across documents, and concordances. Voyant can also be used to analyze non-Engish texts.

https://voyant-tools.org/

For more information, see: https://bit.ly/handout-voyant-intro

24 of 52

Voyant: Upload

Click on Upload and navigate to the folder with the text documents you wish to analyze.

Alternatively, insert URLs or full text into the textbox.

Click here for help and advanced options

25 of 52

Voyant: Dashboard

Results:

After you upload your corpus, you will see the default results page with multiple panes:

A word cloud
Reader section
Trends
Document Summary
Word Contexts

These boxes can all be changed!

26 of 52

Voyant: Changing Displayed Results

Hover on the right top corner of a pane, and buttons will appear. Select the panes button and choose a new option from the dropdown menu. For example, we might want to try out the "Collocates" tool instead of the word cloud. In this tool, words are colored by category. Click on the ‘?’ to learn more about how the tool works.

27 of 52

Voyant: Custom Stopwords

To add custom stopwords, select the ‘Options’ button in the upper right of the word cloud then select ‘Stopwords’ > ‘Edit List’. Add your custom words then select ‘Save’ then ‘Confirm’.

28 of 52

Voyant: non-English Stopwords

If your texts are not in English, you can manually select the stopword language by selecting Options > Stopwords then clicking the drop-down and selecting the language of your texts.

29 of 52

Voyant: Tools for Further Exploration

Voyant’s Getting Started guide
Voyant’s List of Tools, showing all the features possible with Voyant including descriptions of each
Some useful tools to explore:

MicroSearch
Topics
Correlations
Collocates Graph

30 of 52

Tools for Corpus Exploration: �Lexos

31 of 52

Lexos

Lexos provides a step-by-step guide for text uploading, preparation, and analysis.

Upload: upload your .txt file
Manage: select the files you want to prepare and analyze
Prepare: prepare your text for analysis
Visualize: create visualizations of patterns across your corpus or in single texts
Analyze: analyze your text

http://lexos.wheatoncollege.edu/upload

For more information, please see: https://bit.ly/handout-Lexos-intro

32 of 52

Lexos: Upload

Click Browse and select your entire text (or drag file into the “Drag Files Here” area). It can be easy to miss when the upload is done—click “Manage” to double check that the text file is there.

33 of 52

Lexos: Manage

Make sure the document you want to use is selected (blue = selected, gray = not selected)

34 of 52

Lexos: Prepare (Scrub Case and Punctuation)

Lexos demonstrates some more advanced options you have for preparing your corpus. By “scrubbing,” you are transforming the texts in your corpus and making choices that will impact your results. Here are some possibilities:

Make Lowercase: make all your letters lowercase. Even though you know “A” and “a” are the same letter, the computer treats these as two separate characters. Lowercasing removes this distinction.
Remove Punctuation: remove punctuation, which may influence your results.

35 of 52

Lexos: Prepare (Scrub Words)

You can also stem words and remove certain words. Here are some possibilities:

Stop/Keep Words: remove a list of words. Usually these would be stopwords. With WordCounter, you had to use the stopwords list the tool provided—now, you can choose your own.
Lemmas: standardize to the stem of word. For example, you can stem all forms of the verb talk: talking, talked, talks, etc. to “talk”

36 of 52

Lexos: Removing Stopwords

Get a list of English stopwords here: https://gist.github.com/sebleier/554280. Copy and paste the stopwords (hit "raw", then select all and copy) into the “Stop/Keep Words” box then select “Stop”

37 of 52

Lexos: Applying your Preparations

Once you have made decisions about your preparations, click “Apply” and wait a few minutes. Because the program is going through each document and completing all the processes you selected, it needs some time. Then, you will see the final results of your preparation! You can also download your new corpus.

BEFORE PREP

AFTER PREP

38 of 52

Lexos Vizualize > Rolling Window

Rolling window allow you to look at word trends across one document. To use a rolling window, first select a single text in the "Manage" screen, then:

Go to “Visualize-> Rolling Window” and type in a search term you want to visualize. You can also search multiple terms by clicking “String” and separating words with a comma (climate, action)
Choose a Window size (the number of words each “window” contains). For shorter documents, it’s good to have a number like 300/500. For larger documents, you may want to make your window larger. Play around with the window size until you get a visualization that makes sense.
Click “Generate”

39 of 52

Lexos Vizualize > Rolling Window Results

Using Obama’s 2014 State of the Union Address and searching for the word ‘America’ with a window of 300, we can get an idea of how this word is used in the document.

The x-axis in this graph is the number of words, characters, or lines in the document. The y-axis is the word frequency ratio.

40 of 52

Lexos: Analyze > Top Words (1/2)

The top words tool lets you compare word usage between individual documents and your corpus as a whole. If you want to make more specific comparisons, you can also assign “classes” to subsets of tools with the “Manage” screen.

Words with high positive scores are used more often in each document, relative to the rest of the corpus.
Words with high negative scores are used less often.

Hit the “Generate” button to see the top words for your texts.

41 of 52

Lexos: Analyze > Top Words (2/2)

42 of 52

Lexos: Analyze > Dendrogram

The dendrogram demonstrates similarity between the different documents. Dendrograms require at least two documents to compare. Dendrograms “cluster” texts to draw out similarities:

The greater the distance between texts, the less similar they are.
The smaller the distance between texts, the more similar they are.

43 of 52

Lexos: Analyze > Dendrogram Example

44 of 52

Lexos: Save or Reset Your Results

Lexos allows you to save your results as a Lexos file. If you do this, you can re-upload the Lexos file any time to access your cleaned-up corpus as well as the different analyses you’ve done. You can also download modified text files from the “Manage” page—and you can even use those downloaded text files with other tools!

You can also save individual visualizations as images (PNGs).

Finally, if you want to start over, you can “Reset” your Lexos dashboard.

46 of 52

Your Turn! Voyant and Lexos

Use the sample text or texts of your choice and begin practicing web-browser text analysis. Try uploading text to Voyant or Lexos and explore their features!

What interesting or surprising results came up? How might you interpret those results?
What other kinds of documents would be useful to compare?
How could you use these tools for your research? Which features do you think will be useful in your analysis?
How might text analysis complement other research methods?
Between Voyant and Lexos, which tool do you prefer and why?

47 of 52

For Further Exploration

48 of 52

Further Exploration: Topic Modeling

Topic modeling is a machine learning method that uses word co-occurrence within documents to identify "topics," or clusters of related terms. This is a topic model based on the Greater Boston Priority Climate Action Plan. In the visualization, topic 3 is selected.

Topic model code generation assisted by ChatGPT and Gemini

49 of 52

Further Exploration: Sentiment Analysis

Sentiment analysis uses dictionaries and sometimes machine learning to assign sentiment scores (e.g., positive and negative) to documents. You can try this out with the "Drag and Drop Sentiment Analysis" tool.

50 of 52

Data Privacy

It’s important to pay attention to data privacy when using digital resources
At its simplest, data privacy is a person’s ability to control what of their personal information is shared and with whom.
To help you make informed decisions about interacting with digital tools in ways that honor your boundaries with your data and/or personal information, The DITI has prepared a handout on Data Privacy

51 of 52

Resources

DITI handouts on building a corpus and more links and resources for text analysis

DITI handout on troubleshooting text analysis

NULab list of resources for text analysis

Programming Historian tutorials

“Data-Sitters’ Club” tutorials

Library subject guides on text mining and analysis: guide on getting started, guide on vendor policies

52 of 52

Thank you!

—Developed by Cara Marta Messina, Juniper Johnson, and Jeff Sternberg

For more information on DITI, please see: https://bit.ly/diti-about
Schedule an appointment with us! https://bit.ly/diti-meeting
If you have any questions, contact us at: nulab.info@gmail.com