Topic Modeling
now what???
Zoe Borovsky, Ph.D.
Librarian for Digital Research and Scholarship
Adjunct Professor of Scandinavian
Prepared for UCLA DH 201
October 22, 2012
let's write a document
20% Topic 1 (Digitization)
20%
Topic 2
10%
Topic 3
50%
Topic 4
TEI, OCR, scanner
edition, reading, interpretation
words, topics, documents
You should see this:
List of documents
Topic
Contribution
Doc: Text Encoding (Chapter 17)
What's the main topic for this chapter?
How much does that topic contribute to this chapter?
What are the words that cluster in this topic?
(Need to consult topic words)
What's a good name for this topic?
If I'm interested in text encoding--what other chapters in this book should I read?
Visualizing a topic model -- one document
New questions
Is there another chapter that has the same distribution of topics?
Are there some chapters that have a unique mixture of topics?
Which topics seem to contribute the most to this collection of documents?
What is a topic?
It's a cluster of words -- ones that tend to occur in documents that are about a specific topic.
Topic 20: field, textual, autopoietic, marked, dimension, system, fields, markup, dementian, forms
We asked Mallet to give us the top 10 words for each topic.
Try searching the text on the word "field" -- which documents have the highest instance of that word?
Why use topic modeling?
Because we're interested in meaning -- not just strings of characters.
How many meanings are there for the word "field"? Topic modeling would list "field" -- as in crop, grain, agriculture -- in another topic.
We "train" a group of sample texts, then apply the topic model to new texts as a way to of categorizing them according to our model.
So, if we we had hundreds of DH papers...
And we wanted to find the ones about text markup and encoding, we'd use this "model", but run it on our new corpus.
Then we'd look for documents that fit the model of documents that are about text encoding.
Visualizing/Analyzing the corpus
What are the ways we can visualize the results?
Is there another chapter that has the same distribution of topics?
Are there some chapters that have a unique mixture of topics?
Which topics seem to contribute the most to this collection of documents?
More graphs
With this much data, you need visualizations to INTERPRET data.
Graphs are typically arranged in an x and y axes--based on data arranged in rows and columns.
| Topic One | Topic Two |
Chapter One | 0.123 | 0.234 |
Chapter Two | 0 | 0.345 |
More graphs, cont.
Make a copy of the original spreadsheet.
Copy three columns at a time to a new spreadsheet.
Delete copied Topic & Contribution,
Move the (next) Topic & Contribution columns next to Filename column.
Filename (Chapter) | Topic | Contribution |
| | |
More graphs, cont.
Motion chart
https://docs.google.com/spreadsheet/ccc?key=0Ap-nABJyLNeMdFRaaHdrRGp6aVRfUmhzNTlodlNMZnc#gid=4
Network diagram: (Fusion table) shows how chapters relate to topics. which topics are shared.