Short Training Guide for ConText

Jana Diesner, Sandra Franco, Ming Jiang, and Chieh-Li Chin

This guide provides a brief introduction to analyzing text data in ConText  (http://context.lis.illinois.edu/).

  1. What is ConText?

We developed ConText to facilitate the construction of network data based on natural language text data via different techniques. As part of this process, several Natural Language Processing (NLP) techniques are needed, which are also available in ConText.

  1. Data

For illustrative purposes, we use free and open sample text data about the Enron case, which can be downloaded from the ConText webpage[1]. The dataset is from the Wikipedia article about the collapse of Enron[2], which we partitioned into several short text files.

  1. ConText basic principles

  1. Organization of Manual

This manual is organized from general to more specific analysis, and from common to more specialized techniques.

  1. Summarization

  1. Topic Modeling

What is topic modeling?

Topic modeling identifies the main themes represented in a text corpus. Each theme is represented as a vector of words, which are sorted based on their strength of association with a topic. This technique is probabilistic; i.e. the results might differ from run to run.

How is it done?

Topic Modeling is implemented based on Latent Dirichlet Allocation (LDA); we leverage Mallet for this purpose (http://mallet.cs.umass.edu/topics.php ).

  1. Sentiment Analysis

What is sentiment analysis?

Sentiment analysis is a technique that identifies the emotional extent of text data, and identifies which terms, phrases or text portions match which sentiment categories. Common categories include positive, negative and neutral sentiment (Shanahan et al., 2006).

How is it done?

In ConText, a widely used and previously built predefined sentiment lexicon maps text terms to the categories of positive, negative and neutral (Wilson et al., 2005). Users can modify this lexicon (see for example Diesner & Evans, 2015 ), or use their own lexicon, including categories of their choice. The predefined lexicon disambiguates terms based on part of speech (e.g., word “cost” can be neutral as a verb while it is negative as a noun ).  

  1. Corpus Statistics

What are corpus statistics?

Overview on (weighted) word frequencies and distribution of words across texts.

How is it done?
An exercise in counting.

 

Screenshot below is the result when checking in “Results include POS”, where you can see the word “worth” appeared 5 times as JJ (adjective) and 2 times as IN (preposition or conjunction).

Screenshot below is the result without checking in “Results include POS”, where you can see the word “worth” appeared a total of 7 times.

  1. Visualization

What is visualized?

The most buzz word loaded technique in ConText. This displays the outcome of topic modeling (each topic one cluster) and sentiment analysis (color coded: red = negative, green = positive, blue = neutral, black =  no sentiment encoded).

  1. Text Pre-processing

  1. Parts of Speech (POS) Tagging

What does it do?
Identify the POS per word.

How does it work?

Based on a probabilistic model (http://nlp.stanford.edu/software/tagger.shtml ).

  1. Remove stop words  

What is it?

This routine is used to eliminate functional words like “the”, “a” from text data. According to the Corpus Statistics before removing stop words (See above screenshot), the most common words are often stop words.

How does it work?
A predefined delete list is applied to the text data.

 

  1. Stemming

What is it?

This routine converts every word into its morpheme.

How does it work?

Based on Stanford NLP for stemming and lemmatization (http://nlp.stanford.edu/software/corenlp.shtml )

  1. Bigram Detection

What is it?

Overview on bigram (i.e., two adjacent sequential words) frequency across the text.

How does it work?

An exercise in counting.

(Recommend: doing this routine after removing stop words).

  1. Network Analysis

  1. Network Construction from Entity Types

What is it?

Build named-entity (https://en.wikipedia.org/wiki/Named-entity_recognition ) like organization, person and location etc. network across the text based on entity type.

How does it work?

  1. Network Construction from Codebook

What is it?

Build a network across the text based on words/phrases set in the codebook.

How does it work?

  1. Codebook Construction

  1. Entity Detection

What is it?

Extract words/phrases representing named-entities.

How does it work?

  1. Codebook Application

What is it?

Replace key words/phrases with uniform and completed expression across the text.

How does it work?

Detect words/phrases listed in the prepared codebook and replace them with terms that you set in the codebook.

  1. Reference

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).

Shanahan, J. G., Qu, Y., & Wiebe, J. (Eds.). (2006). Computing attitude and affect in text: theory and applications (Vol. 20). Dordrecht, The Netherlands: Springer.

Wilson, T., Wiebe, J., & Hoffmann, P. (2005, October). Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 347-354). Association for Computational Linguistics.

Diesner J and Evans C. (2015) Little Bad Concerns: Using Sentiment Analysis to Assess Structural Balance in Communication Networks. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Short paper. Paris, France.

Copyright © 2016 University of Illinois Board of Trustees. All rights reserved. Written by Jana Diesner, Sandra Franco, Ming Jiang, and Chieh-Li Chin at Jana Diesner’s Lab, GSLIS/ iSchool, UIUC. This work is licensed under the Creative Commons Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.


[1] Download the dataset from http://context.lis.illinois.edu/Enron_Wikipedia_Data.zip. This dataset is licensed under the Creative Commons Attribution-Share-Alike License 3.0. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

[2] https://en.wikipedia.org/wiki/Enron_scandal