WordSeer is a research project at UC Berkeley's Computer Science Division and School of Information. It's a web-based text analysis and sensemaking environment for humanists and social scientists.
Let's unpack that:
If you're collaborating with us, you'll know your WordSeer URL -- it'll look like wordseer.berkeley.edu/<something>
When you open up WordSeer for the first time, it should show you something like this:
Figure 2: What you see when you open up WordSeer Shakespeare.
It may seem like a lot of information, but it breaks down into simple components. You're seeing a single panel with a
The next section explains these terms.
WordSeer has a few different types of components that fit together in different ways to help you analyze your texts.
The most important idea in WordSeer is the slice. A slice is very simple, it's just a set of sentences you want to analyze. WordSeer can make completely arbitrary slices, but it's more useful to think of slices as the results of searches and filters.
Examples of slices are
A slice is an abstract concept. It's just a set of sentences you want to analyze. In WordSeer, data takes concrete visual representations through Panels that contain visualizations (this chapter) and overviews (this chapter).
Figure 3: A panel showing the List of Search Results visualization for the slice: sentences containing the word "good" in Act 5 of Hamlet.
Figure 5: A panel showing the Document Viewer for the slice: all sentences in Julius Caesar.
Panels are like windows on your desktop. They display information derived from a slice. I say "derived from" because it the information can be be something as simple as displaying a list of sentences in the slice, like Figures 3 and 4, displaying a document, like in Figure 5 Or, as complex as showing a word tree of all the sentences matching a given term, like in Figure 2.
Figure 6(b) The components of the panel showing the List of Search Results for the slice: all sentences in Hamlet.
WordSeer supports opening many panels side by side (as many as you want, but after two or three, most screens get crowded).
Multiple panels allow you to make side-by-side comparisons of different slices and different visualizations.
For example, in Figure 6(c), I have two panels open, comparing Hamlet's speeches in Act 1 and Act 5. The lists of phrases, nouns, verbs, and adjectives in the two panels are different. Each panel shows the overviews for its slice. In the left hand panel, I have hamlet's speeches in Act 1, and and the right hand panel, I have hamlet's speeches in Act 5.
Whether you're looking at the whole collection, or just a list of sentences in a smaller slice, overviews give you a sense of how the sentences distributed across several useful types of categories.
Overviews also double as filters, which allow you to select just the subset of sentences that match each category.
WordSeer has three types of filters/overviews:
These are meaningful categories created from annotations in the input.
For example, WordSeer's Shakespeare instance, uses the Internet Shakespeare Editions. These are XML files annotated with with Title, Act, Scene, Line and Speaker values for each speech, so WordSeer extracts these annotations and makes them available as filter categories.
Figure 7(a): Metadata categories for the slice: all sentences in Hamlet
Overviews double up as filters. For example, clicking on "Act 2, Scene 2" in Figure 7(a) above narrows the slice in the panel. It makes a new, more specific slice, i.e. all sentences in Act 2, scene 2 of Hamlet. The panel's metadata categories change to reflect the smaller slice, with the result shown below:
Figure 7(b): The resulting metadata categories after clicking on "Act 2, Scene 2" in 7(a). Clicking makes a new, more specific slice: Act2, Scene 2 of Hamlet. In this slice (which contains 300 sentences) the overview shows that Hamlet says 149 sentences, Polonius 70, the King 23, and so on.
The above example deal with categorical metadata. But what data types that are more naturally expressed as continuous ranges, such as time? For these data types, the metadata pane shows a different type of overview-filter, a distribution chart.
In our Shakespeare collection, the only numerical data type we have is "line" for the line number within the scene. This is what the overview-filter looks like:
If I want to look at just lines in a particular range, I can drag the handles and click the "filter" button:
In other collections, these sliders might be used to select date ranges or other more meaningful spans.
WordSeer gives you a a rough overview of the general complexity of the sentences in the slice by computing the sentence length (number of words) and average word length (average number of characters per word) for each sentence.
These are hidden by default because not every scholar is interested in them, but you can see them by clicking the "Language Features" checkbox:
In WordSeer, "phrases" are sequences of two or more words. Every panel in WordSeer shows you an overview of the most frequent phrases in the slice. For example, if we zoom in on Figure 6 -- the panel showing the list of sentences in Hamlet, we see the most frequent phrases in Hamlet.
Figure 8(a) The most frequent 2-word phrases in Hamlet that don't contain stop words.
There are two options:
Figure 8(b): Clicking the checkbox in Figure 8(a) produces this list, the most frequent 2-word phrases, including phrases that contain stop words.
This frequent phrases overview doubles as a filter. For example, if you want to see all 13 occurrences of "good night" in Hamlet, you can click the table row for "good night", producing Figure 9.
Figure 9: Adding a filter for the phrase "good night" to the slice: all sentences in Hamlet creates a new, smaller slice of just 13 sentences. All the other filters in the panel change to reflect the new slice.
WordSeer uses a computational linguistics technology called part-of-speech tagging to automatically categorize words into their parts of speech.
As part of the overview of a slice, WordSeer calculates the most frequent nouns, verbs, and adjectives in the slice and displays them in a list. Figure 10 zooms in on the lists in Figure 6. The lists show the most frequent nouns, verbs, and adjectives for all sentences in Hamlet.
There is an option: "group by stem". A stem is a common root from which different word forms are derived. For example, enabling this option would group together "read", "reading", and "reads" under the single label "read", and show the added-up count for all of them.
Just like the list of phrases, these word lists double as filters. For example, clicking on the word "lord" in the Nouns list above in Figure 10 would filter the "Hamlet" slice to just the sentences containing "lord" in "Hamlet".
In the previous section, we saw how filters can be used to select sentences for analysis. In this section, we'll look at how we can select sentences with WordSeer's search capabilities.
Figure 11(a): WordSeer's search box. Pressing "Go" will show a list of all the sentences that contain the word "heaven".
WordSeer has a pretty complicated search box, but for simple searches, you can ignore most of it.
WordSeer has two different kinds of search modes: keyword and grammatical. Keyword search works by matching keywords in the text. You type in words or phrases, and get sentences that match those words or phrases.
To perform a keyword search in WordSeer, type words into the search box, and leave the grammatical relation set to anywhere in the text.
There are many different ways of doing keyword searches:
WordSeer's other search mode, grammatical search, is more complex. It allows you to search over grammatical relationships between words. These relationships are things like "verb subject", "verb object", "adjective modifer", etc. Grammatical search allows you to ask questions like, "what are all the adjectives that apply to the word 'man'", and "what are all the verbs that 'Hamlet' is the agent of"?
The full list of grammatical relationships supported by WordSeer is described in detail here, in the Stanford Dependencies Manual. It explains all the different kinds of relationships available in WordSeer and gives examples of them in sentences.
Grammatical searches are more complex because there are three pieces of information in a grammatical relationship.
Why aren't 2 and 3 interchangeable? Consider the two sentences "Look at the poster display, it's really nice", and "Look at the display poster, it's really nice". In both cases, there's a noun compound relationship between "poster" and "display". However different word orders give the compounds slightly different meanings. In the "poster display" is a display of posters which is really nice, in the second "display poster" is a poster for display, and the poster is really nice. Computational linguistics technology represents the two relationships as noun_compound(display, poster) and noun_compound(poster, display).
You can activate grammatical search mode using the drop-down menu in the top search bar. Selecting any relationship other than "anywhere in the text" will perform a grammatical search with that relationship. Another search box will appear to the right of the relations menu, so you can specify both words.
Just like keyword search, there are a few ways to do a grammatical search. Here _____ means that you leave the box blank
The Grammatical Search Bar Charts visualization was developed specifically for grammatical search queries. It's like List of Search Results except augmented with bar charts of how many words match the grammatical relationship. Below, Figure 11(c) shows how this visual can, be used to investigate descriptions of facial attributes.
Searching for the "face, eyes, hair [described as] _______" with the Grammatical Search Bar Charts visualization.
The bar charts show how often the each of the words appear in a "described as" relationship, as well as the words that describe them. In 11(c) for "face,eyes,hair described as ______" the chart shows that "eyes" is the most commonly described feature, at 83 times.
The list of matching sentences is shown below the chart, with the matching words highlighted. The charts are also interactive. Clicking on a word filters the list of sentences to match that word, as shown below in 11(d).
Figure 11(d): I can filter the list of sentences to just the adjectives "sweet", "heavenly" and "fair" by clicking on chart on the left hand side. I click on the bars for "sweet", "heavenly" and "fair" (dark blue), which filters the list of sentences to only those in which "face", "eyes", or "hair" are described as "sweet", "heavenly" or "fair".
The list of search results is the simplest visualization WordSeer offers. It's exactly what it sounds like: a list of sentences that matches the results of a slice.
You can either search for a term using the search box, or open up a new blank panel using the "All Sentences" button:
Here is the list of search results for "speaker: Ophelia" "act: 3":
You can sort by line and scene, and filter by any of the words, phrases or metadata categories (collapsed on the side).
Clicking on any of these sentences opens up the sentence in the document viewer:
Clicking on a sentence in the sentence list opens it up in the document viewer and highlights it.
This is another simple visualization that shows documents that correspond to a slice. You can open it up from the top bar by clicking "All Documents"
Or search directly using the search box.
For example, here are all the documents in which "rome" is mentioned, sorted by how many times they're mentioned.
Double clicking on a document in this listing opens it up in the Document Viewer
This "visualization" shows the contents of a document. Like the other visualizations, it responds to filters.
For example, here is the text of Romeo and Juliet, filtered to romeo's lines:
The Word Tree visualization was developed in 2008 by Martin Wattenberg and Fernanda Viegas. It allows for quick overviews and exploration of the contexts in which a word occurs.
You can make a new word tree panel by clicking on the Word Tree button in the top bar:
But if you already have a word in mind, you can create the word tree for it directly from the search bar:
If you don't specify a search term WordSeer will just use the most frequent word in the collection (or slice, if you add filters).
In Shakespeare, this is what you get if you open up a word tree with no search -- a Word Tree of the word "good" (which is the most frequent content word in the collection (excluding stop words such as "the" and "a")).
Figure 12: A word tree with no search terms. In the shakespeare collection, this produces a tree of the word "good".
Clicking on a branch in the word tree expands it to just that context. The contexts are arranged in order from most to least frequent.
Figure 12(b) Clicking on the branch "the" filters the tree to just the sentences containing "the good". More context is now visible. You can now see that "the good duke" and "the good queen" "the good gods" are common constructions.
The branches in grey are individual sentences. Hovering over a sentence branch shows a popup containing more information about the sentence, and the option to open up the document viewer for the play at that sentence:
Figure 12(c) Hovering over an individual sentence (grey text) shows a popup with the full sentence and any associated metadata. Here, we see that this particular sentence is from "The Life of King Henry the Eighth Act 1, Scene 1, line 51" and is said by Norfolk.
Adding filters will change the visualization to reflect the most frequent word in the filtered set.
For example, if we filter to just the speaker "Romeo",
we'll get a word tree of the most frequent word said by Romeo, which, appropriately enough, happens to be "love":
Figure 12(c): Word Trees --- just like overviews -- respond to filters. When we filter to "speaker: Romeo", we get a word tree of the most frequent word in those sentences, which happens to be "love".
If we specify a search term, the Word Tree's word is fixed to that term, even if it isn't the most frequent one in the slice.
For example, Figure 12(c) shows that the most frequent word in Romeo's speeches is "love" -- but what if we want to investigate "love" in other plays?
We can start by typing in "love" in the search bar and selecting the "Word Tree" visualization:
This fixes "love" as the center term in the Word Tree:
Figure 12(d) The word tree for "love" across all of Shakespeare's plays.
We can now apply other filters. For example, 12(c) shows that "love" occurs 130 times in "Two Gentlement of Veronal". To investigate, further we can click to filter to just that play, and explore the tree. Figure 12(e) below shows the tree for "love" in the play, with the "in love" branch expanded.
Figure 12(e): The word tree for "love" in "Two Gentlemen of Verona", with the "in love" branch currently expanded to show all 18 sentences.
If, while exploring a word tree branch, you want to make a new word tree centered around that branch, use the Word Menu.
Word Frequency charts are great for comparing the frequencies of words across different categories. With them, you can answer questions like, "How does Gertrude's involvement in the events of Hamlet change over the course of the play"? And, which are the characters, plays, and scenes that mention "love" the most?
You can make a new word frequency graph panel by clicking on the top menu bar:
Or, if you have a search in mind, by typing it in to the search bar:
Sometimes, we're interested in how a particular category relates to other categories.Word Frequency graphs can help investigate such relationships. For example, how are lines by Gertrude (one category of sentences) spread out over the different acts of Hamlet (another category of sentences)?
To examine Gertrude's involvement in the play, we make a new Word Frequencies graph and filter it to just "speaker: Gertrude". Her pattern of involvement is immediately clear. It rises in Acts 1 and 2, peaks in Act 3, and falls in Act 4 and 5.
Figure 13(a) Frequency graphs for Gertrude's lines across different metadata categories.
Scrolling down, we can see that her activity is concentrated in Act 3, Scene 4, and Act 4, Scenes 5 and 7.
The "Normalize" option converts between showing raw counts, and showing percentages. For example, above, We see that 2% of the lines in Act 3, Scene 4 are by Gertrude.
The same chart without normalization looks slightly different, because the scenes are all of are of different lengths. Act 4 scene 7 is a shorter scene, so even though there are just 4 sentences, Gertrude is proportionally more involved than in Act 5, Scene 1 and Act 5 Scene 2, which have 7 counts. Sometimes Normalization is the right choice, but other times, it may not make sense.
Word Frequency Graphs are interactive. Clicking on a bar, or selecting a range (in the case of continuous values) filters the other graphs to show only matches in that range.
For example, if we wanted to repeat the analysis of Gertrude's involvement in the play for every character in Hamlet, we could take advantage of this filtering function.
First, we could create a word frequency plot for play:"Hamlet" and hide all but the "speaker" and "act_title" graphs:
Then, by clicking the speakers one by one, we see their involvement reflected in the top graph, for example, clicking on "Horatio" reveals his pattern:
Figure 13(b) Horatio's involvement in the play.
Figure 13(c) Filtering works the other way too, for example, in Act 4, the King and Polonius have many lines.
Word Frequency graphs also do what they're named for: show word frequencies across categories. You can search for multiple words, and either stack the charts or group them:
For example, Figure 13(d) shows the incidence of the death-related words, "dead" "kill" and "die" across all of Shakespeare's plays, and over Act. Perhaps unsurprisingly, the later in the plays we progress, the more frequent these words become.
Figure 13(d) The incidence of the words "dead" and "kill" and "die" across all of Shakespeare's plays, and split by Act. Perhaps unsurprisingly, the later in the plays we progress, the more frequent these two death-related words become.
Figure 13(e): FIltering to just "Romeo and Juliet" and un-clicking the "Stacked" checkbox reveals the pattern for just that play.
One of the most powerful ways to get around WordSeer is the Word Menu. When you see something interesting in a visualization, the word menu often gives you a way to follow up on that thought by creating a new visualization, adding something to a group, or exploring related ideas.
Figure 14(a) The word menu for "father"
The search options in the word menu allow you to click on a word and search for it in a new visualization. More importantly though, they show you the different relations in which that word appears.
Figure 14(b): The Word Menu for "father" showing the different ways in which "father" is used, and the number of times each one appears in the collection. In this example, we can see the predominant ways in which father is described by examining the "adjectival modifier" search option. Fathers in shakespeare are "good", "dear", "noble", "ghostly", "royal" and "sweet".
Clicking on any of these options does a grammatical search for that relationship.
WordSeer also calculates co-occurrence relationships between words, which you can access through the "Related Words" option:
This option pops up two different windows, one of words that occur nearby, and another of words that occur in similar contexts.
The "Nearby Words" option displays words that occur in the same sentences as the clicked-on word. These words are sensitive to the slice:
For example, if we search for "father" and look at the nearby words, those are filtered to the words that occur near "son" in sentences that contain "father".
Clicking any word in this list brings up the option to see where the words co-occur. Clicking on the "show co-occurrences" button above would bring up a List of Search Results pane showing sentences where "son", and "daughter" co-occur.
If your collection is small enough, WordSeer also computed words that occur in similar contexts. When we click Related Words for "son" in Shakespeare, we get the following popup in addition to the nearby words:
Words appear in a lot of different places in WordSeer -- lists of frequent words, lists of nearby words, in document views, in sentence popups, and in the list of sentences. If a word turns blue when you hover over it, clicking or right-clicking on it will make a word menu.
The only exceptions are the word trees and the overviews, in which clicks have a different meaning. The lists of most frequent nouns, verbs, and adjectives in a slice double as filters: clicking a word filters the slice to just the sentences containing that word. In the word tree, clicking a branch filters the word tree to just sentences matching that branch. In both these places, to make a word menu, you have to right-click instead.
When we put a collection into WordSeer, it automatically extracts any annotations in the text and turns them into metadata filters. While these are useful, they're not always enough. Sometimes, we want to define our own units of analysis.
For example, consider the question, "How does the treatment of love in Shakespeare vary between the comedies and tragedies"? Here, "comedies" and "tragedies" are units of analysis that don't come pre-defined.
Another example of such a question is "What are are the different characteristics of speeches by male and female speakers?" Here, our units of analysis are "speeches by male speakers" and "speeches by female speakers" -- we don't have those as pre-defined categories either.
Yet another is, "How do concepts of emotion correlate with mentions of people in power -- how often do emotions like "anger", "sadness", "joy", "hate" correlate with different kinds of people in power? This is more complex. We want to look at "the sentences mentioning different types of people in power" and correlate them with "sentences mentioning different types of emotion".
In this section, I'll show how WordSeer's Document Sets, Sentence Sets, and Word Sets features can help conduct exactly these types of analyses.
Sets aren't just for comparison -- once you make a set, it persists, you don't lose it. You can use them to collect interesting things to look at, or to make conceptual groupings for your own understanding.
Document sets are most useful when you're looking at gathering certain types of documents together. You make document sets by searching and filtering in the Documents with Matches view. To make a document set, just select some documents, and click "Add to Group".
For example, If you wanted to follow up on the question of "How does the treatment of love vary between the comedies and tragedies", we could do that in the following way. First, collect all the comedies:
Then, add them to a group by clicking the "Add to group" button:
Name the new group "comedies". This creates a new group: "comedies". We can do the same for Tragedies, and the Document Sets overview now shows two sets, "comedies" and "tragedies"
We can now use these sets as filters, because they appear in the metadata overview:
So, now if we wanted examine the treatment of "love" across the two sets of documents, we could do a word frequencies comparison. The word frequencies chart automatically uses the new document set categories.
Figure 15(a) Comparison of the normalized word frequencies of "in love" over the document sets "comedies" and "tragedies". We see that about 0.4% of the sentences in the comedies mention "in love", whereas less than half that, around 0.1% of the sentences in the tragedies do the same.
You can put sentences into sets from the List of Search Results and Grammatical Search Bar Charts views. Click the checkboxes next to the sentences you want, and then add them to the set.
For example, let's collect speeches by female speakers in "The Merchant of Venice" into a sentence set. First, narrow down the list of sentences to just that play.
Use the auto-suggest box to quickly select to "Merchant of Venice".
Here, I've just finished adding the 240 sentences spoken by Portia to the set, and I'm about to add 50 by Nerissa:
After adding all the women's sentences, I get 338 sentences. After doing the same for the men, and ignoring characters with less than 5 sentences, I get
Now I can begin comparing them. I open up two panes, and look at the word frequencies across the acts for the two sets:
The lists of frequent words in the two panels are all slightly different from each other, and the characters' patterns of involvement in the play are also very different.
Word Sets are just collections of words. When they act as filters, though, they match all the sentences that contain a word.
Word sets can be used as search terms in the search box. Instead of typing in a long list of words, you can just use the word set instead.
The search will match any of the words in the set.
There are two ways to make and edit word sets in WordSeer, through the Word Menu, and through the Word Sets panel.
As explained in the Word Menu section, clicking or right-clicking on a word almost anywhere in WordSeer opens up the Word Menu.
The word menu has options to either add the word to a word set (you can add it to a new one if you don't have existing ones) and to edit existing word sets:
If you make a new set, it'll automatically be named after the word, and you can add more words to it using the word menu:
Here, I'm adding "king" to the "lord" word set.
If you want to add words directly, you can use the "Edit word set" option:
This brings up the word set in a free-floating window:
Double click anywhere in the window to start editing. I'm going to add some more royal words.
Press "OK" to save, or "Cancel" to discard your edits.
I'm also going to rename this set to "royals". You can do this by clicking on the title:
You can also make and manage your Word Sets with the "Word Sets" overview:
Clicking "New" creates a new set, and "Delete" deletes the selected set. Double click to open the word set up in a window, or to rename it. Here, I'm creating a new set and naming it "god/supernatural":
Right click or double click on an entry to rename it. Double clicking opens it up in a window, for editing.
In this set, I put "god, almighty, heaven, and spirit".
I can now compare how some emotion-related words co-occur with the two categories
For this, I simply do two searches in the word frequency graph and compare the split across categories. As expected, the comedies have more happiness and the tragedies have more anger, but the comparison between royals and supernaturals is interesting:
It appears, in fact that the "royals" words are much less associated with the happy search (orange) than the "god/supernatural" words. Only 0.61% of the royal sentences have "happy" words, whereas twice as many (proportionally speaking) of the "god/supernatural" sentences have "happy" words.
You can save and export everything you see in WordSeer into images and data files for use in other places. This includes all the graphs, visualizations, tables of data, and lists of words and sentences.
Every single WordSeer table has a tiny "Save" button on its top left corner:
For image-based visualizations, such as the Grammatical Search Bar Charts, Word Trees, and Word Frequencies, click on the save button at the top of the panel to generate download links to each of the visualizations as an image.
If you want to save the image in a filtered state, just click the save button after performing your operations -- the images generated always reflect the current state of the chart.
WordSeer saves your history onto your computer, so don't worry if you accidentally close the page. When you open it up again, your history will be available to you from the pane on the left. Just click a row to open up the panel again. Your history won't be available if you use a different computer or a different username though.