ConText User Guide

Short Training Guide for ConText

Jana Diesner, Sandra Franco, Ming Jiang, and Chieh-Li Chin

This guide provides a brief introduction to analyzing text data in ConText (http://context.ischool.illinois.edu/).

What is ConText?

We developed ConText to facilitate the construction of network data based on natural language text data via different techniques. As part of this process, several Natural Language Processing (NLP) techniques are needed, which are also available in ConText.

Data

For illustrative purposes, we use free and open sample text data about the Enron case, which can be downloaded from the ConText webpage^[1]. The dataset is from the Wikipedia article about the collapse of Enron^[2], which we partitioned into several short text files.

ConText basic principles

Each analysis process is organized into data, settings, results.

Step 1: Data: provided by user. ConText does currently not facilitate the collection of any data.
Step 2: Settings: Any choices that need to be made.
Step 3: Results: Typically a screen output, link to output directory on disk, and option to rerun analyses with modified settings.

Input accepted: .txt files.
Results are not overwritten.
Modified data become new data sets.

Organization of Manual

This manual is organized from general to more specific analysis, and from common to more specialized techniques.

Summarization - gain a quick understanding and overview of the data
Text Pre-processing - prepare the data for further analysis
Network Construction - extract network data from text data
Codebook Construction - a facilitating step for the previous point

Summarization

Topic Modeling

What is topic modeling?

Topic modeling identifies the main themes represented in a text corpus. Each theme is represented as a vector of words, which are sorted based on their strength of association with a topic. This technique is probabilistic; i.e. the results might differ from run to run.

How is it done?

Topic Modeling is implemented based on Latent Dirichlet Allocation (LDA); we leverage Mallet for this purpose (http://mallet.cs.umass.edu/topics.php ).

Step 1: Click “Browse” to select input text data and click “Next”.

Step 2: Configuration option, “Topics”

“Number of Topics”: Start with a low number, e.g. 5 to 10, for small or very homogenous datasets. For larger or more topically diverse datasets, a higher number, e.g. 20, is suggested.
“Number of Words per Topic”: number of words describing a topic that shall output. They are sorted from highest to lowest average fit.
“Number of Iterations”: number of sampling iterations that the algorithm applies to find representation words for each topic.

Step 2: Configuration option, “Stopword List”

ConText offers Mallet’s stop word list, which works in English. It is located under your installation folder of “~/ConText-public/data/Stoplists/stop.txt”. These words will be excluded from consideration. You can also (define and) use your own stop word list.

Step 3: Click “Browse” to select the place to store output data and click “Run”.

The next interface provides four options that can be selected.

When selecting “Open top words list”, the interface will show a table with as many topics as selected by the user, the average fit of topic per data (topics are automatically decreasingly sorted by fit), and the words most representative for this topic; Topics are not labeled. Defining a label and interpreting a topic is up to the user (Chang et al., 2009).

When selecting “Open topic”, the interface will show the distribution of topics over documents. For example, in the document “accounthing_other” (See the screenshot below), Topic 1 that is represented by words “stock, enron, company, trading, energy, companies and earnings” (See the screenshot above) accounts the biggest ratio and Topic 5 as well as Topic 2 are next, meaning this file mainly talks about trading and the finance of Enron company. While the value of “Fit to Topic 3” is zero, which indicates the file doesn’t mention anything related to the Dynegy Corporation.

“Locate output directory” will open the output folder directly. “Re-run..” is to do the routine again.

Sentiment Analysis

What is sentiment analysis?

Sentiment analysis is a technique that identifies the emotional extent of text data, and identifies which terms, phrases or text portions match which sentiment categories. Common categories include positive, negative and neutral sentiment (Shanahan et al., 2006).

How is it done?

In ConText, a widely used and previously built predefined sentiment lexicon maps text terms to the categories of positive, negative and neutral (Wilson et al., 2005). Users can modify this lexicon (see for example Diesner & Evans, 2015 ), or use their own lexicon, including categories of their choice. The predefined lexicon disambiguates terms based on part of speech (e.g., word “cost” can be neutral as a verb while it is negative as a noun ).

Step 1: Click “Browse” to select input text data and click “Next”.
Step 2: “Configuration interface”: click “Browse” to select a sentiment dictionary (see explanation above). Click “Next”.

Step 3: Click “Browse” to select the place to store output data and click “Run”.
Click “Open tabular result”: the interface will show a table with the sentiment encoded words, their part of speech, sentiment, and cumulative frequency in the selected corpus. The table can be sorted per each column by clicking on the table header. Click “Frequency” can sort sentiment words according to their occurrence frequency.

Corpus Statistics

What are corpus statistics?

Overview on (weighted) word frequencies and distribution of words across texts.

How is it done?
An exercise in counting.

Step 1: Click “Browse” to select input text data and click “Next”.

Step 2: Configuration option” “Results include POS”: if checked, words are disambiguated based on their part of speech.

Step 3: Click “Browse” to select the place to store output data and click “Run”.

“Open tabular result”: table with the following:

“Term”: represents each word appeared in the text
“Frequency”: cumulative frequency per word in text
“TF*IDF” natural log of term frequency times inverse document frequency. low for noise terms, high for terms that are highly descriptive of the dataset
“Ratio of texts”: percentage of texts that the word occurs in
“Part of speech”: grammatical function per word

Screenshot below is the result when checking in “Results include POS”, where you can see the word “worth” appeared 5 times as JJ (adjective) and 2 times as IN (preposition or conjunction).

Screenshot below is the result without checking in “Results include POS”, where you can see the word “worth” appeared a total of 7 times.

Visualization

What is visualized?

The most buzz word loaded technique in ConText. This displays the outcome of topic modeling (each topic one cluster) and sentiment analysis (color coded: red = negative, green = positive, blue = neutral, black = no sentiment encoded).

Step 1: Click “Browse” to select input text data and click “Next”.

Step 2: “Configuration”:

“Topic Modeling Properties: consult section on topic modeling
Sentiment Analysis: consult section on sentiment analysis
“Cloud Visual Properties

single: this is a wordle of the topic modeling results only
clustered: each cloud represent a topic. This is a D3 file, which the user can figure as they please.
Width, Height and Font size: these parameters set the size of word cloud and the font size of words displayed in the clustered cloud (Recommended: Width - 1200, Height - 1200 & Font Size - 12).

Step 3: Click “Browse” to select the place to store output data and click “Run”.
“Open Word Cloud”, the interface will show visualization results; Click “Locate output directory” will open the output folder directly and “Re-run..” is to do the routine again.

Wordle (single word cloud), where size of the words demonstrate importance within text, which is computed by frequency. So, large sized words such as “Enron” and “company” emerge more frequently in text than smaller words like “Dynegy,” and “trading.” Color of the words is just for visual appeal.

Clustered: Word color represent sentiment polarity (i.e., green - positive, red - negative & blue - neutral). Words within each box represent a topic. In addition, the more prevalent the topic, the more central, and the yellower the box. From the screenshot below, we can see that the most two obvious topics are a) the finance of Enron and b) the cooperation between Enron and Andersen.

Text Pre-processing

Parts of Speech (POS) Tagging

What does it do?
Identify the POS per word.

How does it work?

Based on a probabilistic model (http://nlp.stanford.edu/software/tagger.shtml ).

Step 1: Click “Browse” to select input text data and click “Next”.
Step 2: Configuration option, select the language of input data and click ”Next”.

Step 3: Click “Browse” to select the place to store output data and click “Run”.
“Open tabular result”: the interface will show the table with word, POS tag, cumulative frequency across corpus. The result shown in the screenshot below is sorted based on “Frequency”.

Remove stop words

What is it?

This routine is used to eliminate functional words like “the”, “a” from text data. According to the Corpus Statistics before removing stop words (See above screenshot), the most common words are often stop words.

How does it work?
A predefined delete list is applied to the text data.

Step 1: Click “Browse” to select input text data and click “Next”.
Step 2: Configuration option, click “Browse” to select txt file that lists stop words needed to be eliminated. The default stop word list is located at “~/ConText-public/data/Stoplists/stop.txt”. The list can be modified, and user can also point to their own list. This list needs to be a text file with 1 stop list item per line.

“Stopword Method”:

“Drop”: simply removes the stop word.
“Drop and Insert Placeholder”: replace stop words with ‘’’.

Step 3: Click “Browse” to select the place to store output data and click “Run”.

“Open modified data”: the interface will show the text of each file after removing stop words.

The data after stop word removal are also stored and available as input for further analysis (e.g., Corpus Statistics, Bigram Detection etc.).

Stemming

What is it?

This routine converts every word into its morpheme.

How does it work?

Based on Stanford NLP for stemming and lemmatization (http://nlp.stanford.edu/software/corenlp.shtml )

Step 1: Click “Browse” to select input text data and click “Next”.
Step 2: Click “Browse” to select the place to store output data and click “Run”.
Click “Open modified data”, the interface will show the text of each file after removing stop words;
Click “Locate output directory” will open the output folder directly and “Re-run..” is to do the routine again. The result is shown below, where we can see words “made” and “booking” within initial text are stemmed as “make” and “book” within processed data.

Bigram Detection

What is it?

Overview on bigram (i.e., two adjacent sequential words) frequency across the text.

How does it work?

An exercise in counting.

(Recommend: doing this routine after removing stop words).

Step 1: Click “Browse” to select input text data and click “Next”.
Step 2: Click “Browse” to select the place to store output data and click “Run”.
Select any option to look for results or re-run the routine.

Click “Open tabular result”, the interface will show a list of bigram words with occurrence frequency as well as mutual information (https://en.wikipedia.org/wiki/Mutual_information ).

Click “Locate output directory” will open the output folder directly.
Click “Re-run..” is to do the routine again.

Network Analysis

Network Construction from Entity Types

What is it?

Build named-entity (https://en.wikipedia.org/wiki/Named-entity_recognition ) like organization, person and location etc. network across the text based on entity type.

How does it work?

Step 1: Click “Browse” to select input text data and click “Next”.
Step 2: In the interface “Entity Typed Based Network: Configuration”, select “Aggregation”, type “Distance” (i.e., maximum number of words between two recognized entities ), select “Unit of Analysis” and check in “Entity Types” to choose what kind of relationships you want to extract from initial data. The detail setting parameters are shown below. Finally, click “Next”.

Step 3: Click “Browse” to select the place to store output data and click “Run”.
Select any option to look for results or re-run the routine.

Click “Open tabular result”, the interface will show the edge table, of which each row represent one edge with source node (label) and target node (label) as well as edge weight (i.e., the occurrence frequency of the edge within text).

Click “Open modified data”, the interface will show the initial text of each file with marked entity type for some words/phrases.

Click “Locate output directory” will open the output folder directly.In output folder, “EntityNetwork.graphml” file is a visualized entity-based network, which should be opened by Gephi (http://gephi.github.io/ ).

Click “Re-run..” is to do the routine again.

Network Construction from Codebook

What is it?

Build a network across the text based on words/phrases set in the codebook.

How does it work?

Step 1: Click “Browse” to select input text data and click “Next”.
Step 2: In the interface “Co-Occurrence based Network: Configuration”, click “Browse” to select “codebook.csv” file; Select “Network Type”, “Aggregation”, “Distance” (i.e., maximum number of words between two identified words/phrases listed in the codebook ) and “Unit of Analysis” like the following screenshot. Finally, click “Next”.

Step 3: Click “Browse” to select the place to store output data and click “Run”.
Select any option to look for results or re-run the routine.

Click “Open tabular result”, the interface will show the edge table, of which each row represent one edge with source node and target node as well as edge weight (i.e., the occurrence frequency of the edge within text).

Click “Open modified data”, the interface will show the same results of “Codebook Application”.

Click “Locate output directory” will open the output folder directly In output folder, “CorpusNetwork.gexf” file is a visualized codebook-based network, which should be opened by Gephi (http://gephi.github.io/ ).

Click “Re-run..” is to do the routine again.

Codebook Construction

Entity Detection

What is it?

Extract words/phrases representing named-entities.

How does it work?

Step 1: Click “Browse” to select input text data and click “Next”.
Step 2: In the interface of “Configuration”, click “Next”.
Step 3: Click “Browse” to select the place to store output data and click “Run”.
Select any option to look for results or re-run the routine.

Click “Open tabular result”, the interface will show a list of words/phrases with correspondent entity as well as occurrent frequency;

Click “Locate output directory” will open the output folder directly.
Click “Re-run..” is to do the routine again.

Codebook Application

What is it?

Replace key words/phrases with uniform and completed expression across the text.

How does it work?

Detect words/phrases listed in the prepared codebook and replace them with terms that you set in the codebook.

Create codebook “codebook.csv” based on the output of “Entity Detection”, “Corpus Statistics” and “Bigram Detection”. File “codebook.csv” contains three columns, where column A lists initial words/phrases from input data, column B listing revised words/phrases, and column C lists correspondent entity type

Entity types: person, organization, location, resource, task, event, information, time, attribute, and date.

Step 1: Click “Browse” to select input text data and click “Next”.
Step 2: In the interface “Codebook Application: Configuration”, click “Browse” to select codebook file with “Method Type” (e.g., Normalization) and “Insertion Mode” (e.g., Positive Filter with Placeholder) and click “Next”.

Step 3: Click “Browse” to select the place to store output data and click “Run”.
Select any option to look for results or re-run the routine.

Click “Open modified data”, the interface will show each inputted file with identified words/phrases that have been replaced with terms set in the codebook.

Click “Locate output directory” will open the output folder directly.
Click “Re-run..” is to do the routine again.

Reference

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).

Shanahan, J. G., Qu, Y., & Wiebe, J. (Eds.). (2006). Computing attitude and affect in text: theory and applications (Vol. 20). Dordrecht, The Netherlands: Springer.

Wilson, T., Wiebe, J., & Hoffmann, P. (2005, October). Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 347-354). Association for Computational Linguistics.

Diesner J and Evans C. (2015) Little Bad Concerns: Using Sentiment Analysis to Assess Structural Balance in Communication Networks. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Short paper. Paris, France.

Copyright © 2016 University of Illinois Board of Trustees. All rights reserved. Written by Jana Diesner, Sandra Franco, Ming Jiang, and Chieh-Li Chin at Jana Diesner’s Lab, GSLIS/ iSchool, UIUC. This work is licensed under the Creative Commons Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

[1] Download the dataset from http://context.lis.illinois.edu/Enron_Wikipedia_Data.zip. This dataset is licensed under the Creative Commons Attribution-Share-Alike License 3.0. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

[2] https://en.wikipedia.org/wiki/Enron_scandal

What is ConText?

Data

ConText basic principles

Organization of Manual

Summarization

Topic Modeling

Sentiment Analysis

Corpus Statistics

Visualization

Text Pre-processing

Parts of Speech (POS) Tagging

Remove stop words

Stemming

Bigram Detection

Network Analysis

Network Construction from Entity Types

Network Construction from Codebook

Codebook Construction

Entity Detection

Codebook Application

Reference