Content analysis with open source tools

In the following, I describe how I use open source tools to make different kinds of content analysis of text. In my particular case, I used Ubuntu 11.10 to perform analysis of interview transcripts. The same approach could be used with other Linux distributions, and interview transcripts are just one of many possible types of texts that can be analyzed with these tools. I also used the extensible editor Emacs to conduct these analyses, and all the shell commands were carried out in Emacs shell-mode. The same things could be done in any Linux terminal, but the advantage of using shell-mode in Emacs is that it is very easy to edit the commands and keep working with the output, since it is all in a text buffer.

At the outset, I consider a case where I have a plain text file entitled “Transcripts.txt”, and this file contains the interview transcripts that I want to analyze.

Making a concordance

When analyzing text, it can be useful to make a concordance - or index - of all the words that are contained in the text. Word frequency analysis is the simplest kind of content analysis. The Linux Cookbook contains the following formula for creating a simple concordance of the words in a text[1]:

$ tr ' ' ' RET

> ' < Transcripts.txt | sort | uniq -c RET

To explain the conventions used in the example above, $ is not to be written. It simply identifies the command prompt when logged into a unix/Linux terminal as a regular user. RET indicates that the “Enter” button should be pressed. The results of this command is something like this (when applying the recipe to John 3:16-17 in KJV as an example text):

      2 For

      2 God

      1 Son

      1 Son,

      1 be

      1 begotten

      1 believeth

      2 but

      1 condemn

      1 everlasting

      1 gave

      1 have

      1 he

      2 him

      2 his

      1 in

      1 into

      1 life.

      1 loved

      1 might

      2 not

      1 only

      1 perish,

      1 saved.

      1 sent

      1 should

      1 so

      3 that

      4 the

      1 through

      1 to

      1 whosoever

      2 world

      1 world,

      1 world;

I have developed this recipe somewhat further:

$ tr ' ' '

> ' < Transcripts.txt | sort -f | uniq -c -i | sort -r

When using “sort -f” instead of “sort”, you tell the sort command to ignore case. In order to ignore case when counting the results, I write “uniq -c -i” instead of “uniq -c”. In the last pipe, “sort -r” provides a sorting of the word frequencies according to the largest number of occurrences rather than sorting by the first letter in the words. Applying this revized recipe to the same paragraph of text provides the following result:

      4 the

      3 that

      2 world

      2 not

      2 his

      2 him

      2 but

      2 God

      2 For

      1 world;

      1 world,

      1 whosoever

      1 to

      1 through

      1 so

      1 should

      1 sent

      1 saved.

      1 perish,

      1 only

      1 might

      1 loved

      1 life.

      1 into

      1 in

      1 he

      1 have

      1 gave

      1 everlasting

      1 condemn

      1 believeth

      1 begotten

      1 be

      1 Son,

      1 Son

The only problem with this output is that “world”, “world,” and “world;” are listed as different words. To avoid this, we can add a little sed one-liner into our script as follows:

$ tr ' ' '

> ' < John.txt | sed 's/[^a-zA-Z0-9]//g' | sort -f | uniq -c -i | sort -r

The result is that only alpha-numeric characters are considered, and the output now looks as follows:

      4 world

      4 the

      3 that

      2 not

      2 his

      2 him

      2 but

      2 Son

      2 God

      2 For

      1 whosoever

      1 to

      1 through

      1 so

      1 should

      1 sent

      1 saved

      1 perish

      1 only

      1 might

      1 loved

      1 life

      1 into

      1 in

      1 he

      1 have

      1 gave

      1 everlasting

      1 condemn

      1 believeth

      1 begotten

      1 be

A similar result can be reached by using the unix trim command (tr) to remove all punctuation characters instead of sed. The recipe would then be:

$ tr ' ' '

> ' < John.txt | tr -d '[:punct:]' | sort -f | uniq -c -i | sort -r

The resulting concordance would look exactly the same. The benefit of using trim instead of sed is that it works better when analyzing texts in languages with other letters (like e.g. in Norwegian where we have the additional letters æ, ø, å).

Key word in context

Another basic method is to analyze the occurrence of certain key words in their context. Grep is a good friend in such a circumstance. If, for instance, you want to find more about the context of the key word “world” in the entire third chapter of the gospel of John (say you saved the text in a textfile entitled John3.txt), you could do the following:

$ cat John3.txt | grep -i world

Using “grep -i” tells grep to ignore case. The output would be:

16 For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.

17 For God sent not his Son into the world to condemn the world; but that the world through him might be saved.

19 And this is the condemnation, that light is come into the world, and men loved darkness rather than light, because their deeds were evil.

Piping the results through sed, you could easily do some tricks to e.g. change the key words to all caps (or something else):

$ cat John3.txt | grep -i world | sed 's/world/WORLD/g'

The output would then be:

16 For God so loved the WORLD, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.

17 For God sent not his Son into the WORLD to condemn the WORLD; but that the WORLD through him might be saved.

19 And this is the condemnation, that light is come into the WORLD, and men loved darkness rather than light, because their deeds were evil.

You can also adjust the sed string to put pipes around the key word, like this:

$ cat John3.txt | grep -i world | sed 's/world/\| world \|/g'

16 For God so loved the | world |, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.

17 For God sent not his Son into the | world | to condemn the | world |; but that the | world | through him might be saved.

19 And this is the condemnation, that light is come into the | world |, and men loved darkness rather than light, because their deeds were evil.

The possibilities are many if you are willing to learn a little sed :-)

If you want to search for more than one key word at once, you can use egrep:

$ egrep -w 'word1|word2' /path/to/file

You can also list more than two words in the similar manner, and you can use the command in connection with the cat command. The options from regular grep normally work fine with egrep, so if you - for instance - want to search for the words definition and formula in your interview transcripts, and you want grep to display two lines of context before and after, you use the following command:

$ egrep -w -i -C 2  'definition|formula' Transcripts.txt


[1] See: http://dsl.org/cookbook/cookbook_16.html