1 of 14

Topic Modeling

now what???

Zoe Borovsky, Ph.D.

Librarian for Digital Research and Scholarship

Adjunct Professor of Scandinavian

zoe@library.ucla.edu

Prepared for UCLA DH 201

October 22, 2012

2 of 14

let's write a document

20% Topic 1 (Digitization)

20%

Topic 2

10%

Topic 3

50%

Topic 4

TEI, OCR, scanner

edition, reading, interpretation

3 of 14

words, topics, documents

4 of 14

You should see this:

List of documents

Topic

Contribution

5 of 14

Doc: Text Encoding (Chapter 17)

What's the main topic for this chapter?

How much does that topic contribute to this chapter?

What are the words that cluster in this topic?

(Need to consult topic words)

What's a good name for this topic?

If I'm interested in text encoding--what other chapters in this book should I read?

6 of 14

Visualizing a topic model -- one document

7 of 14

New questions

Is there another chapter that has the same distribution of topics?

Are there some chapters that have a unique mixture of topics?

Which topics seem to contribute the most to this collection of documents?

8 of 14

What is a topic?

It's a cluster of words -- ones that tend to occur in documents that are about a specific topic.

Topic 20: field, textual, autopoietic, marked, dimension, system, fields, markup, dementian, forms

We asked Mallet to give us the top 10 words for each topic.

Try searching the text on the word "field" -- which documents have the highest instance of that word?

9 of 14

Why use topic modeling?

Because we're interested in meaning -- not just strings of characters.

How many meanings are there for the word "field"? Topic modeling would list "field" -- as in crop, grain, agriculture -- in another topic.

We "train" a group of sample texts, then apply the topic model to new texts as a way to of categorizing them according to our model.

10 of 14

So, if we we had hundreds of DH papers...

And we wanted to find the ones about text markup and encoding, we'd use this "model", but run it on our new corpus.

Then we'd look for documents that fit the model of documents that are about text encoding.

11 of 14

Visualizing/Analyzing the corpus

What are the ways we can visualize the results?

Is there another chapter that has the same distribution of topics?

Are there some chapters that have a unique mixture of topics?

Which topics seem to contribute the most to this collection of documents?

12 of 14

More graphs

With this much data, you need visualizations to INTERPRET data.

Graphs are typically arranged in an x and y axes--based on data arranged in rows and columns.

Topic One

Topic Two

Chapter One

0.123

0.234

Chapter Two

0

0.345

13 of 14

More graphs, cont.

Make a copy of the original spreadsheet.

Copy three columns at a time to a new spreadsheet.

Delete copied Topic & Contribution,

Move the (next) Topic & Contribution columns next to Filename column.

Repeat.

Filename (Chapter)

Topic

Contribution

14 of 14

More graphs, cont.