Introduction to
Computational Text Analysis
Digital Integration Teaching Initiative (DITI)
Workshop Agenda
For more information, please see: https://bit.ly/handout-text-resources
What is Computational Text Analysis?
Computational Text Analysis
Computational text analysis refers to the array of methods used to “read” texts with a computer. It is similar to statistical analysis, but the data is texts (words) instead of numbers.
Text analysis:
Why Computational Text Analysis?
Computational text analysis can help us analyze very large amounts of data, identify keywords, and discover patterns in texts. Using text analysis, researchers may find surprising results that they would not have discovered from traditional methods alone.
For example: "Gendered Language in Teacher Reviews" by Ben Schmidt shows stark differences in the ways that male and female professors are reviewed on "Rate My Professor."
Language Used in Climate News
Word Cloud of TV News on “Global warming.” Terms like “believe” and “threat” appear frequently with “global warming” in TV news coverage since 2009.
Climate News: Discussion
Go to the Television Explorer. Search “global warming,” “climate crisis,” and “greenhouse effect.”
Gendered Language
Go to bit.ly/schmidt-gender and try a few queries.
For example:
―How do you think Schmidt determined gender for this tool?
Key Terms
Corpus Building
Questions to consider as you begin your research:
Text Analysis in non-English Languages
Below are some resources that may be helpful if you are considering using non-English language texts:
Our Corpus
For our corpus, we will work with a set of State of the Union addresses from 1990 to 2019.
The easiest way to work with these files is to choose "Download all" and open them with a plain-text editor (TextEdit on Mac, Notepad on Windows). Mac users should be able to click on the zip file to expand it; Windows users will need to right-click and choose "Extract all."
Initial Corpus Analysis
Open any one of the texts from the sample corpus:
What can you observe about the text? How long is it? What kinds of language does it use? What kinds of analysis might you do with a text like this?
Scan through a few more: do they seem largely similar? What do you think might be different?
Exploratory Tools:
Word Counter and Word Trees
Word Counter
Word Counter Example
This is a word cloud, used to get a sense of the most used words in a document. Words used more often are bigger, than those used less often.
What seems significant in the most frequent terms from Clinton’s 2000 State of the Union Address?
“Tokenizing” Text
Why do you think that “000” is one of the most common words in Clinton’s 2000 SotU address? Open the .txt file and search for “000” to check your guess.
Before words can be counted, they must be “tokenized” or divided into components that programs can treat as distinct segments. Different programs will have different standards for tokenization—this one uses both white spaces and punctuation marks (such as commas) to separate words into tokens. What are some limitations of this approach?
Data Preparation
Go back to the upload/paste screen for WordCounter and unclick the "ignore stopwords" and "ignore cases" options, then count the words again.
What happened? Why do you think the default is to ignore stopwords and remove differences between upper/lowercase words?
Can you think of any limitations to this approach?
Bigrams and Trigrams
In addition to single words, it is also useful to consider bigrams and trigrams. Why do you think the phrase “I ask you” appears so often in the 2000 State of the Union Address? What about “we should”?
Word Tree
Word Tree Example
Tools for Corpus Exploration: �Voyant
Voyant
Voyant makes it possible to perform analyses on one or multiple files in many ways, including word counts, nGrams (n=number of words), word frequency distributions, word trends across documents, and concordances.
For more information, see: https://bit.ly/handout-voyant-intro
Voyant: Upload
Click on Upload and navigate to the folder with the text documents you wish to analyze.
Alternatively, insert URLs or full text into the textbox.
Click here for help and advanced options
Voyant: Dashboard
Results:
After you upload your corpus, you will see the default results page with multiple panes:
These boxes can all be changed!
Voyant: Changing Displayed Results
Hover on the right top corner of a pane, and buttons will appear. Select the panes button and choose a new option from the dropdown menu. For example, we might want to try out the "Collocates" tool instead of the word cloud. Click on the ‘?’ to learn more about how the tool works.
Voyant: Tools for Further Exploration
Tools for Corpus Exploration: �Lexos
Lexos
Lexos provides a step-by-step guide for text uploading, preparation, and analysis.
http://lexos.wheatoncollege.edu/upload
For more information, please see: https://bit.ly/handout-Lexos-intro
Lexos: Upload
Click Browse and select your entire text (or drag file into the “Drag Files Here” area). It can be easy to miss when the upload is done—click “Manage” to double check that the text file is there.
Lexos: Manage
Make sure the document you want to use is selected (blue = selected, gray = not selected)
Lexos: Prepare (Scrub Case and Punctuation)
Lexos demonstrates some more advanced options you have for preparing your corpus. By “scrubbing,” you are transforming the texts in your corpus and making choices that will impact your results. Here are some possibilities:
Lexos: Prepare (Scrub Words)
You can also stem words and remove certain words. Here are some possibilities:
Lexos: Removing Stopwords
Get a list of English stopwords here: https://gist.github.com/sebleier/554280. Copy and paste the stopwords (hit "raw", then select all and copy) into the “Stop/Keep Words” box then select “Stop”
Lexos: Applying your Preparations
Once you have made decisions about your preparations, click “Apply” and wait a few minutes. Because the program is going through each document and completing all the processes you selected, it needs some time. Then, you will see the final results of your preparation! You can also download your new corpus.
BEFORE PREP
AFTER PREP
Lexos: Analyze > Top Words (1/2)
The top words tool lets you compare word usage between individual documents and your corpus as a whole. If you want to make more specific comparisons, you can also assign “classes” to subsets of tools with the “Manage” screen.
Hit the “Generate” button to see the top words for your texts.
Lexos: Analyze > Top Words (2/2)
Lexos: Analyze > Dendrogram
The dendrogram demonstrates similarity between the different documents. Dendrograms require at least two documents to compare. Dendrograms “cluster” texts to draw out similarities:
Lexos: Analyze > Dendrogram Example
Lexos: Save or Reset Your Results
Lexos allows you to save your results as a Lexos file. If you do this, you can re-upload the Lexos file any time to access your cleaned-up corpus as well as the different analyses you’ve done. You can also download modified text files from the “Manage” page—and you can even use those downloaded text files with other tools!
You can also save individual visualizations as images (PNGs).
Finally, if you want to start over, you can “Reset” your Lexos dashboard.
Your Turn!
Your Turn! Voyant and Lexos
Use the sample text or texts of your choice and begin practicing web-browser text analysis. Try uploading text to Voyant or Lexos and explore their features!
For Further Exploration
Further Exploration: Topic Modeling
Topic modeling is a machine learning method that uses word co-occurrence within documents to identify "topics," or clusters of related terms. This is a topic model based on the Greater Boston Priority Climate Action Plan. In the visualization, topic 3 is selected.
Further Exploration: Sentiment Analysis
Sentiment analysis uses dictionaries and sometimes machine learning to assign sentiment scores (e.g., positive and negative) to documents. You can try this out with the "Drag and Drop Sentiment Analysis" tool.
Data Privacy
Resources
DITI handouts on building a corpus and more links and resources for text analysis
DITI handout on troubleshooting text analysis
NULab list of resources for text analysis
Programming Historian tutorials
“Data-Sitters’ Club” tutorials
Library subject guides on text mining and analysis: guide on getting started, guide on vendor policies
Thank you!
—Developed by Cara Marta Messina, Juniper Johnson, and Jeff Sternberg