INFO 630/CS 674 Lecture Notes: Deeper Thought


Scribes: Vladimir Barash, Stephen Purpura, Shaomei Wu

Sep-22-2007


Our finger exercise demonstrated that the choice of normalization function impacted document ranking based on document length. But there is a fundamental difference between our finger exercise and the TREC data sets – limited vocabulary. Our finger exercise example and the TREC corpora are not degraded with misspellings or nonsense terms such as terms which might appear from a partially successful OCR (optical character recognition) process. In such a process, the term ‘cat’ may appear as ‘cats’, ‘cal’, or ‘eat’.


Question:

Why might vocabulary diversity have an impact on the normalization function? Such vocabulary diversity might be caused by a significant presence of nonsense terms (misspellings, etc.).


Answer: We’ve already noted in the lecture that more non-zero term frequencies cause the retrieval process to be biased towards long documents. But we didn’t examine the impact of extremes in vocabulary diversity on the effectiveness of the normalization function.


The finger exercise used 5 terms, while the TREC data is supposed to have a more “representative” vocabulary. Even the TREC corpora is a special case of real-world data because it is supposed to be error-free. Real-world vocabulary introduces two common problems for NLP researchers: synonymy and polysemy.


Unlike our finger example, real world corpora contain synonyms – like kitty and cat. Given a query which includes {“cat”}, use of the term “kitty” in the document instead of “cat” reduces the term frequency of “cat” and can cause the information retrieval process to rank documents which contain the word “kitty” lower than other documents which might not even contain the word “cat” even though an expert reviewer would consider “kitty” as a synonym for “cat” and on-topic.


In addition, terms have a range of meanings based on the context. In American political documents, the word “choice” sometimes refers to the discussion of reproductive rights and other times it refers to an election. So, the query {“choice”, “abortion”} might rank documents about elections higher than documents about reproductive rights if “choice” is not disambiguated.


In addition to vocabulary diversity caused by synonyms and polysemes, [SBM ‘96] mentions another reason that vocabulary diversity may exist – corpora may be degraded with nonsense terms and misspellings such as terms which might appear from a partially successful OCR (optical character recognition) process.


Now we’ll examine an information retrieval example which is affected by these types of vocabulary diversity.


For the purposes of this exercise, we will use the following ‘idf’ function (note this is different from the idf function used in the finger exercise but the same as the idf function used in the notes):



where N is the size of the corpus (the total number of documents in the corpus) and nj is the number of documents that contain term j.


Given our match function for each document:

,


here t is the number of term in a given query. Given q as a binary query vector, the function can be simplified as:

.


Increasing the number of term synonyms in the vocabulary varies tfd[j] by increasing the probability that tfd[j] will decrease in value. Similarly, when polysemes are present, they will artificially increase tfd[j]. When the tfd[j] terms increase or decrease in value relative to the real match against the user’s information goal, the normalization function can play havoc on the retrieval probabilities.


Consider the following example documents in a 5 document corpus:

Documents:

Doc#1:
Cats are cool, soft, fuzzy and bouncy!

Doc#2:
Dogs love to eat and run around.

Doc#3:
It was raining cats and dogs the other night... so bad that I couldn't go outside. Sometimes I would come to the window and just stare at the rain. It was very depressing, but in the morning, I felt better!

Doc#4:
It's a dog-eat-dog world out there. From puppies to big hounds, everyone struggles to survive, to avoid his superior and to beat up on his inferior. That's just how it is.

Doc#5:
Cats and dogs are two common types of household animals. There are many species of cats and dogs - from the common house cat, to the Blue Russian, from bulldog to shepherd. Both cats and dogs have been domesticated by man many thousands of years ago and are loved and cared for by many pet owners today. There are even urban legends of cat owners having statistically better health than non-cat owners - and everyone knows how useful a dog can be, for protecting the house, for instance! There are many more things to say about cats and dogs, but I think I've run out of time, so I have to go. Thank you for listening!


If you run an experiment where you find document weights for both the full text vocabulary and for the Porter stemmed vocabulary (with stop word removal) versions, you will find very different match weightings. (For more information on Porter stemming, see http://tartarus.org/~martin/PorterStemmer/)

Now examine the results of the query: {“cat”, “love”}


Full vocabulary

Document # raw tfidf squared for ‘cat’ raw tfidf squared for ‘love’

1 0 0

2 0 25.00

3 0 0

4 0 0

5 225.00 0


Porter stemmed vocabulary

Document # raw tfidf squared for ‘cat’ raw tfidf squared for ‘love’

1 2.78 0

2 0 6.25

3 2.78 0

4 0 0

5 100.00 6.25


Full vocabulary

Document # L2-Norm Pivoted TFIDF Norm

1 10.4894 23.7915

2 8.2365 23.3409

3 28.3702 27.3677

4 25.6369 26.8210

5 62.8521 34.2640


Porter stemmed vocabulary

Document # L2-Norm Pivoted TFIDF Norm

1 10.1379 17.4178

2 6.3465 16.6595

3 19.1848 19.2272

4 17.5000 18.8902

5 43.0200 23.9942


Full vocabulary

Document # L2-Normed Score Pivoted TFIDF Normed Score

1 0 0

2 0.6071 0.2142

3 0 0

4 0 0

5 0.2387 0.4378

Rank: d2 > d5 > d1, d3, d4 d5 > d2 > d1, d3, d4


Porter stemmed vocabulary

Document # L2-Normed Score Pivoted TFIDF Normed Score

1 0.1644 0.0957

2 0.3939 0.1501

3 0.0869 0.0867

4 0 0

5 0.2906 0.5210

Rank: d2 > d5 > d1 > d3 > d4 d5 > d2 > d1 > d3 > d4



For either L2-Normalization weightings or Pivoted TFIDF weightings, the Porter Stemmed result produces a better document ranking than the full vocabulary versions. But in this case, the Pivoted TFIDF scores always outperform the L2-Normalized scores for both the full vocabulary and the Porter stemmed vocabulary. In the documents, there are no synonyms for ‘cat’ or ‘love’, although ‘pets’ could be considered a synonym for ‘cat’ and “non-cat” causes some noise.


Now consider the query: {“dog”, “love”}


Full vocabulary

Document # raw tfidf squared for ‘dog’ raw tfidf squared for ‘love’

1 0 0

2 0 25.00

3 0 0

4 25.00 0

5 6.25 0


Porter stemmed vocabulary

Document # raw tfidf squared for ‘dog’ raw tfidf squared for ‘love’

1 0 0

2 2.78 6.25

3 2.78 0

4 0 0

5 69.44 6.25


Full vocabulary

Document # L2-Normed Score Pivoted TFIDF Score

1 0 0

2 0.6071 0.2142

3 0 0

4 0.1950 0.1864

5 0.0398 0.0730

Rank: d2 > d4 > d5 > d1, d3 d2 > d4 > d5 > d1, d3


Porter stemmed vocabulary

Document # L2-Normed Score Pivoted TFIDF Score

1 0 0

2 0.6565 0.2501

3 0.0869 0.0867

4 0 0

5 0.2518 0.4515

Rank: d2 > d5 > d3 > d1, d4 d5 > d2 > d3 > d1, d4


Like the {“cat”, “love”} query, for the {“dog”,”love”} query, Pivoted TFIDF scores outperform L2-Normalized scores on the Porter stemmed vocabulary. Within the Porter stemmed vocabulary, L2-Normalized scores prefer short documents and Pivoted TFIDF scores prefer long documents. But the Pivoted TFIDF scores fail to outperform L2-Normalized scores on the full vocabulary. Most of the reason for this is the presence/absence of the query terms in the document. In d4, ‘dog’ has a polyseme – ‘dog-eat-dog’ which is filtered as a different term by the Porter stemming algorithm. However, in the full vocabulary, the two instances of ‘dog’ in ‘dog-eat-dog’ cause d4 to erroneously be ranked as highly relevant and effect the normalization score.


How much is the normalization score affected? You can think of the normalization factor as an amplifier of the ‘tf idf’ terms. Documents are being rewarded or punished due to the type of normalization. The following table shows how much more the Pivoted TFIDF Normed scores are being punished under the full vocabulary, compared to the Porter stemmed vocabulary. When the numbers in the table are larger, the L2-Norm is comparatively greater than the Pivoted TFIDF Norm.


Ratio of L2-Norm/Pivoted TFIDF Norm

Document # Full vocabulary Porter stemmed vocabulary

1 0.4409 0.5820

2 0.3529 0.3810

3 1.0366 0.9978

4 0.9559 0.9264

5 1.8343 1.7930



Combining the effects of the vocabulary diversity (flatter or inflated ‘tf’s and different norm scores), it is easy to see how retrieval probabilities can change based on the vocabulary diversity.


References:


See http://docs.google.com/Doc?id=dcpkz9gb_42wmz5b5 for the full texts, processing instructions, and raw statistics about the text.


For a spreadsheet of the full vocabulary document matrix and statistics, see:

http://spreadsheets.google.com/pub?key=pswp60NXd6HBLztSVi-eGcw


For a spreadsheet of the Porter stemmed vocabulary document matrix and statistics, see:

http://spreadsheets.google.com/pub?key=pswp60NXd6HAkLlHrLiEtEg