1 of 30

An automated tool for detecting bias in Internet-based hypermedia

Richard Townsend (1002335)

2 of 30

?

Is this site's coverage proportionate, fair and balanced?

Can I trust this site on issue x?

What's a site said on this topic?

Is there anything I should read?

Should I approach this site about working with my business?

Has this site shifted position on this issue?

Is this site editorially independent?

What are people saying about my product?

Has this site told the truth?

Where does this site get its news from?

Does this writer like Obama?

What does this site talk about?

What's the Washington Post written about waste water treatment?

How does this site's coverage compare with others?

Does this author like the iPhone?

Will this news have an impact on my shares?

Is this story genuine?

Motivation

3 of 30

Motivation > Demo

Enter a free-form query, like

Apple bloomberg.com

Sentimentron fetches every document that matches, analyses them for sentiment indicators, and displays a result on a time-series graph

McCain

Obama...

Orange Juice

Results often match up with real-world sentiment indicators (stock prices, election polls)

4 of 30

Motivation > Demo

Key stats

Visualisation available at the document, phrase and sentence levels

Time series chart shows the number of each feature detected over time, allowing trend analysis

Context available by clicking on each data point to see the summarized articles

Subdomain breakdown

5 of 30

#

5 936 464 keywords

85 GB data

39 361 detected domains

2 111 101 HTML documents

951 360 dates

1 468 915 documents analysed

3 063 762 extracted links

Motivation > Demo > Technical

6 of 30

Motivation > Demo > Technical

7 of 30

Motivation > Demo > Technical

8 of 30

Motivation > Demo > Technical (pysen)

Couple of different ways of doing Sentiment analysis:

Probabilistic (e.g. Naive Bayes)

Pattern recognition (e.g.support vector machines)

Unsupervised lexicons (e.g. SentiWordNet)

Can use a variety of features to classify documents:

Words, word pairs, words+POS tags...

Perform well, tendency to overfit

Structure of a document/sentence is lost

50% accuracy is very bad, >70% accuracy is quite good

Humans apparently disagree about 79% of time

More advanced variants involve discourse analysis

"Who holds an opinion about what?"

Few data sets exist, probably overkill

Lots of commercial interest in this sector

9 of 30

Motivation > Demo > Technical (pysen)

LingPipe

Sentiment analaysis built-in

Punitive licensing restrictions

AlchemyAPI

Commercial, third-party service

Too expensive to use at the required scale

Rate_Sentiment

Perl

No evidence for how well it works

SAS Text Analytics

Commerical product used in financial/CRM applications

Viralheat

Restricted to 360 characters

Requires developer license

Other

Very few open-source projects

10 of 30

Motivation > Demo > Technical (pysen)

Sentiment analysis component is called Pysen

Python library, integrated into backend

Pysen is fast, scalable and as accurate as possible

Looks at sentiment in three ways: phrases, sentences, and documents

Phrases are unambiguous, scored with bag-of-words

Phrase contents compared to a database, probability of correctness determined

Parameters chosen off-line to balance precision and recall

Sentences are structured lists of unambiguous phrases

Matched against a database of pre-trained sentences

Documents are statistical summaries of phrase/sentence data

Number of positive and negative sentences and phrases fed into a decision tree classifier to determine overall label

Sentences and phrases which are likely to be incorrect are removed before each classification stage

11 of 30

Motivation > Demo > Technical (pysen)

"I loved my dog"

That's a phrase

"I loved my dog, but I hated how it ate the carpet."

That's a sentence

This is a document:

Opinion: Just because it's in the past, doesn't mean it doesn't hurt

A. Doglover

I loved my dog, but I hated how it ate the carpet. Every morning I came downstairs to witness the devastation it had inflicted on my pride and joy...

12 of 30

Motivation > Demo > Technical (pysen)

Very few datasets available for testing

Everything's focussed on Twitter and user reactions

Getting humans to agree on the content of a news article is difficult

MPQA corpus is not primarily designed for Sentiment analysis

FIRST project is developing a corpus of financial news blogs, but it's not available

Tried to create a realistic training corpus, didn't succeed

In the end, opted to use Bo Pang / Lillian Lee's movie review dataset for training

Above the phrase-level, pysen looks at the relationship between sentiment objects

Movie review / sentence polarity datasets are interesting because they're

Combine subjective and objective information

Quite large

Available

Widely used

13 of 30

Motivation > Demo > Technical (pysen)

Theoretical performance (self evaluation)

Theoretical performance (10-fold cross evaluation)

Feature

Accuracy

Precision (positive)

Precision (negative)

Recall (positive)

Recall (negative)

Phrases

70%

0.69

0.70

0.30

0.56

Sentences

70%

0.68

0.73

0.19

0.07

Documents

67%

0.70

0.63

0.703

0.639

Feature

Accuracy

Precision (positive)

Precision (negative)

Recall (positive)

Recall (negative)

Sentences

58%

0.56

0.59

0.12

0.08

Documents

57%

0.57

0.57

0.59

0.55

14 of 30

Motivation > Demo > Technical > Performance

Issue

Resolution

What do you compare?

Assume the body of opinion on the web influences people's decisions

The volume of positive/negative features published on that day affects an indice

Missing information for some dates

Assume these values are the same as those in prior days

Missing date ranges

Post-processing to insert missing dates

Insufficient data

Use the CommonCrawl URL index to trawl sites which can provide it

Biased/unreliable authors

Use a range of sites to provide the broadest feature set possible

Irrelevant content

If necessary, restrict output to business-focussed sites

Sites anticipate business changes

Apply date adjustment (lag) to the sentiment data to account for this

15 of 30

Motivation > Demo > Technical > Performance (Daring Fireball)

Web sentiment

Daily change in stock price

16 of 30

Motivation > Demo > Technical > Performance (Daring Fireball)

Trend's much clearer with a 10-day moving average

Web sentiment

Daily change in stock price

17 of 30

Motivation > Demo > Technical > Performance (Daring Fireball)

Trend's clearer still with a 30-day moving average

Web sentiment

Daily change in stock price

18 of 30

Motivation > Demo > Technical > Performance (Daring Fireball)

Can get a correlation of ρ=0.7 by modifying lagging and smoothing

19 of 30

Motivation > Demo > Technical > Performance (Daring Fireball)

Be wary of coincidence!

Correlation between things isn't proof - it's an indication

Apple Inc.'s stock price consistently grew between 2007-2012, meaning the daily change in stock price is generally positive

This particular blog generally posts items which support Apple

"Someone said to me the other day that I'd picked a good time to start writing about Apple professionally, [...] I don't take any credit for it. [...] Given how successful Apple's been since 2002, I think it's unsurprising that if you look at everything I've written about them and say "what is positive, what is negative?", that most of it is, yes, overwhelmingly positive."

-- John Gruber [Author of Daring Fireball]

This example demonstrates the pitfalls of assessing performance

Does demonstrate that, in practise, pysen is pessimistic

20 of 30

Motivation > Demo > Technical > Performance (MSFT)

Phrases, Daily change in Microsoft stock price between 22/Jan/2007 - 18/Aug/2008

21 of 30

Motivation > Demo > Technical > Performance (GOOG)

Phrases, Daily change in Google stock price between 7/Mar/2007 - 18/Jul/2008

22 of 30

Motivation > Demo > Technical > Performance (Obama)

Smoothing (days)

CA

FL

GA

IL

MI

NC

NJ

NY

OH

PA

TX

VA

0

0.00

0.04

0.05

0.02

0.07

-0.02

-0.02

0.01

0.01

0.04

-0.02

0.07

5

0.06

0.07

0.10

0.05

0.07

-0.04

-0.01

-0.04

0.07

0.01

-0.03

0.14

10

0.11

0.03

0.13

0.05

0.12

-0.05

0.00

-0.02

0.05

0.02

-0.08

0.15

15

0.17

-0.01

0.16

0.08

0.12

-0.05

0.02

0.03

0.04

0.03

-0.14

0.17

20

0.19

0.02

0.10

0.06

0.12

-0.06

0.03

0.08

0.06

0.07

-0.16

0.15

25

0.24

0.07

0.07

0.06

0.09

-0.06

0.05

0.14

0.09

0.14

-0.23

0.12

30

0.28

0.07

0.05

0.07

0.04

-0.05

0.06

0.09

0.09

0.17

-0.37

0.05

North Carolina had a particularly close result, inconclusive polling data...

But Ohio and New Jersey weren't particularly close

23 of 30

Motivation > Demo > Technical > Performance (McCain)

Smoothing (days)

CA

FL

GA

IL

MI

NC

NJ

NY

OH

PA

TX

VA

0

0.06

-0.05

-0.01

-0.03

-0.06

0.00

-0.06

0.04

-0.05

-0.05

-0.03

-0.02

5

0.13

-0.09

0.05

-0.04

-0.10

-0.06

-0.14

0.07

-0.07

-0.12

-0.05

-0.03

10

0.10

-0.12

0.08

-0.08

-0.13

-0.04

-0.16

0.09

-0.02

-0.14

-0.08

-0.04

15

0.07

-0.14

0.11

-0.13

-0.12

-0.04

-0.32

0.07

0.03

-0.16

-0.07

-0.07

20

0.03

-0.18

0.15

-0.13

-0.06

-0.02

-0.41

0.00

0.03

-0.15

-0.10

-0.05

25

-0.02

-0.21

0.17

-0.15

-0.02

-0.03

-0.46

-0.07

0.01

-0.2

-0.03

-0.02

30

-0.08

-0.20

0.19

-0.19

0.04

-0.04

-0.45

-0.04

0.01

-0.23

0.09

0.03

Interesting to note that Texas, where McCain comfortably won, correlates poorly

24 of 30

Motivation > Demo > Technical > Performance

Phrases seem to do better than any other feature

More information available

Theoretical performance is reasonable, given pysen's purpose

Designed to fare well on datasets it hasn't been trained on

Real-world performance is tricky to evaluate

Internet sources are biased, have to take this into account

Seems to be able to detect longer-term trends

Need more data!

25 of 30

Motivation > Demo > Technical > Performance > Project Management

Milestone

Plan

Reality

Article storage

17/10/2012

08/02/2012, with improvements since

(114 days late)

RSS/Atom Crawling

24/10/2012

Dropped

HTML Crawling

30/10/2012

18/11/2012

(25 days late)

HTML parsing and combination

26/11/2012

Dropped

Keyword extraction

05/12/2012

03/02/2013, with improvements since

(60 days late)

Sentiment analysis

07/01/2013

25/01/2013 (18 days late)

Web Interface - Site Framework

14/01/2013

20/11/2012 (55 days early)

Article Retrieval

17/01/2013

24/02/2013 (7 days late)

Web Interface - Site Details

02/02/2013

18/02/2013 (16 days late)

Web Interface - Site Coverage

11/02/2013

18/02/2013 (7 days late)

Article Similarity

18/02/2013

Dropped

Web Interface - Article Details

20/02/2013

17/02/2013 (3 days early)

Web Interface - Article Lists

20/02/2013

Dropped

26 of 30

Motivation > Demo > Technical > Performance > Project Management

Plans change

HTML/RSS/Atom crawling dropped in favour of the CommonCrawl

The number of articles means computing similarity measures would be very hard

Lists of articles replaced with contextual user interface to reduce bandwidth

Lots of things missing from the project specification (but included in the progress report)

Headline extraction

Content extraction

Date extraction (!)

API

Budget

Sentiment analysis took a long time get right

Specification was vague about what information would be shown

Didn't quite achieve 80% accuracy on Pang and Lee's dataset

Despite this, project did finish on time and didn't cost too much

27 of 30

Motivation > Demo > Technical > Performance > Project Management > Future

Better approach to sentiment analysis

Incorporate user feedback; Train on other datasets

Potentially open-source the sentiment analyser; seek improvements from other applications

Smarter keyword extraction and storage

Extracts lots of keywords which aren't needed

Handling keywords causes database contention

Better architecture

CommonCrawl Foundation announced a URL index in January

Now possible to selectively process individual sites

Precomputing data is very expensive, can't change the sentiment analyser once in production

Better use of links

Could be useful to study how influences towards a topic spread through websites

Interesting uses of the API

Doesn't have to communicate with just the website

Alternate applications (e.g. Reddit client) in development

Query engine improvements

Formally-defined query grammar

Better performance

28 of 30

Motivation > Demo > Technical > Performance > Project Management > Future

Content extraction

Using shallow text features at present

Could combine this with document structure analysis across an entire site to achieve better results

Under-explored in the literature

Traceability

Essential for users to know

Why an article appears in the results

How a particular result was achieved

Mitigated by better visualisation:

Visual demonstration of pysen's process

Presentation of probability information

Archive.org links are not quite the same

29 of 30

Motivation > Demo > Technical > Performance > Project Management > Future > Conclusion

Sentimentron is arguably something completely new!

Unique test bed for sentiment analysis technology

Makes use of new web technologies (HTML5, Canvas, AJAX)

Lots of opportunities for extension and study

Vast link database

Keyword indexing

Document retrieval

Better sentiment analyser

Construction of distributed systems

Semantic web

Five interesting end products

Sentimentron website: excercise in visualisation

API: useful for alternative apps

pysen: extensible and reusable sentiment analyser

pydate: probabilistic date extraction

Data set: really big collection of real-world sentiment data

Thanks for listening, any questions?

30 of 30

References / sources

"79% of humans disagree"

Ogneva, M,http://mashable.com/2010/04/19/sentiment-analysis/, retrieved 2012-12-13.

Andrea Esuli; Fabrizio Sebastiani (2006)."SentiWordNet: A publicly available lexical resource for opinion mining". Proceedings of LREC. pp. 417–422.

Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. "Thumbs up?: sentiment classification using machine learning techniques." Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002.

Pang, Bo, and Lillian Lee. "A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts." Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2004.

McDonald, Ryan, et al. "Structured models for fine-to-coarse sentiment analysis." Annual Meeting-Association For Computational Linguistics. Vol. 45. No. 1. 2007.

Presidential poll data

http://electoral-vote.com/evp2013/Info/2008-pres-polls.csv