An automated tool for detecting bias in Internet-based hypermedia
Richard Townsend (1002335)
?
Is this site's coverage proportionate, fair and balanced?
Can I trust this site on issue x?
What's a site said on this topic?
Is there anything I should read?
Should I approach this site about working with my business?
Has this site shifted position on this issue?
Is this site editorially independent?
What are people saying about my product?
Has this site told the truth?
Where does this site get its news from?
Does this writer like Obama?
What does this site talk about?
What's the Washington Post written about waste water treatment?
How does this site's coverage compare with others?
Does this author like the iPhone?
Will this news have an impact on my shares?
Is this story genuine?
Motivation
Motivation > Demo
Enter a free-form query, like
Apple bloomberg.com
Sentimentron fetches every document that matches, analyses them for sentiment indicators, and displays a result on a time-series graph
McCain
Obama...
Orange Juice
Results often match up with real-world sentiment indicators (stock prices, election polls)
Motivation > Demo
Key stats
Visualisation available at the document, phrase and sentence levels
Time series chart shows the number of each feature detected over time, allowing trend analysis
Context available by clicking on each data point to see the summarized articles
Subdomain breakdown
#
5 936 464 keywords
85 GB data
39 361 detected domains
2 111 101 HTML documents
951 360 dates
1 468 915 documents analysed
3 063 762 extracted links
Motivation > Demo > Technical
Motivation > Demo > Technical
Motivation > Demo > Technical
Motivation > Demo > Technical (pysen)
Couple of different ways of doing Sentiment analysis:
Probabilistic (e.g. Naive Bayes)
Pattern recognition (e.g.support vector machines)
Unsupervised lexicons (e.g. SentiWordNet)
Can use a variety of features to classify documents:
Words, word pairs, words+POS tags...
Perform well, tendency to overfit
Structure of a document/sentence is lost
50% accuracy is very bad, >70% accuracy is quite good
Humans apparently disagree about 79% of time
More advanced variants involve discourse analysis
"Who holds an opinion about what?"
Few data sets exist, probably overkill
Lots of commercial interest in this sector
Motivation > Demo > Technical (pysen)
LingPipe
Sentiment analaysis built-in
Punitive licensing restrictions
AlchemyAPI
Commercial, third-party service
Too expensive to use at the required scale
Rate_Sentiment
Perl
No evidence for how well it works
SAS Text Analytics
Commerical product used in financial/CRM applications
Viralheat
Restricted to 360 characters
Requires developer license
Other
Very few open-source projects
Motivation > Demo > Technical (pysen)
Sentiment analysis component is called Pysen
Python library, integrated into backend
Pysen is fast, scalable and as accurate as possible
Looks at sentiment in three ways: phrases, sentences, and documents
Phrases are unambiguous, scored with bag-of-words
Phrase contents compared to a database, probability of correctness determined
Parameters chosen off-line to balance precision and recall
Sentences are structured lists of unambiguous phrases
Matched against a database of pre-trained sentences
Documents are statistical summaries of phrase/sentence data
Number of positive and negative sentences and phrases fed into a decision tree classifier to determine overall label
Sentences and phrases which are likely to be incorrect are removed before each classification stage
Motivation > Demo > Technical (pysen)
"I loved my dog"
That's a phrase
"I loved my dog, but I hated how it ate the carpet."
That's a sentence
This is a document:
Opinion: Just because it's in the past, doesn't mean it doesn't hurt
A. Doglover
I loved my dog, but I hated how it ate the carpet. Every morning I came downstairs to witness the devastation it had inflicted on my pride and joy...
Motivation > Demo > Technical (pysen)
Very few datasets available for testing
Everything's focussed on Twitter and user reactions
Getting humans to agree on the content of a news article is difficult
MPQA corpus is not primarily designed for Sentiment analysis
FIRST project is developing a corpus of financial news blogs, but it's not available
Tried to create a realistic training corpus, didn't succeed
In the end, opted to use Bo Pang / Lillian Lee's movie review dataset for training
Above the phrase-level, pysen looks at the relationship between sentiment objects
Movie review / sentence polarity datasets are interesting because they're
Combine subjective and objective information
Quite large
Available
Widely used
Motivation > Demo > Technical (pysen)
Theoretical performance (self evaluation)
Theoretical performance (10-fold cross evaluation)
Feature | Accuracy | Precision (positive) | Precision (negative) | Recall (positive) | Recall (negative) |
Phrases | 70% | 0.69 | 0.70 | 0.30 | 0.56 |
Sentences | 70% | 0.68 | 0.73 | 0.19 | 0.07 |
Documents | 67% | 0.70 | 0.63 | 0.703 | 0.639 |
Feature | Accuracy | Precision (positive) | Precision (negative) | Recall (positive) | Recall (negative) |
Sentences | 58% | 0.56 | 0.59 | 0.12 | 0.08 |
Documents | 57% | 0.57 | 0.57 | 0.59 | 0.55 |
Motivation > Demo > Technical > Performance
Issue | Resolution |
What do you compare? | Assume the body of opinion on the web influences people's decisions The volume of positive/negative features published on that day affects an indice |
Missing information for some dates | Assume these values are the same as those in prior days |
Missing date ranges | Post-processing to insert missing dates |
Insufficient data | Use the CommonCrawl URL index to trawl sites which can provide it |
Biased/unreliable authors | Use a range of sites to provide the broadest feature set possible |
Irrelevant content | If necessary, restrict output to business-focussed sites |
Sites anticipate business changes | Apply date adjustment (lag) to the sentiment data to account for this |
Motivation > Demo > Technical > Performance (Daring Fireball)
Web sentiment
Daily change in stock price
Motivation > Demo > Technical > Performance (Daring Fireball)
Trend's much clearer with a 10-day moving average
Web sentiment
Daily change in stock price
Motivation > Demo > Technical > Performance (Daring Fireball)
Trend's clearer still with a 30-day moving average
Web sentiment
Daily change in stock price
Motivation > Demo > Technical > Performance (Daring Fireball)
Can get a correlation of ρ=0.7 by modifying lagging and smoothing
Motivation > Demo > Technical > Performance (Daring Fireball)
Be wary of coincidence!
Correlation between things isn't proof - it's an indication
Apple Inc.'s stock price consistently grew between 2007-2012, meaning the daily change in stock price is generally positive
This particular blog generally posts items which support Apple
"Someone said to me the other day that I'd picked a good time to start writing about Apple professionally, [...] I don't take any credit for it. [...] Given how successful Apple's been since 2002, I think it's unsurprising that if you look at everything I've written about them and say "what is positive, what is negative?", that most of it is, yes, overwhelmingly positive."
-- John Gruber [Author of Daring Fireball]
This example demonstrates the pitfalls of assessing performance
Does demonstrate that, in practise, pysen is pessimistic
Motivation > Demo > Technical > Performance (MSFT)
Phrases, Daily change in Microsoft stock price between 22/Jan/2007 - 18/Aug/2008
Motivation > Demo > Technical > Performance (GOOG)
Phrases, Daily change in Google stock price between 7/Mar/2007 - 18/Jul/2008
Motivation > Demo > Technical > Performance (Obama)
Smoothing (days) | CA | FL | GA | IL | MI | NC | NJ | NY | OH | PA | TX | VA |
0 | 0.00 | 0.04 | 0.05 | 0.02 | 0.07 | -0.02 | -0.02 | 0.01 | 0.01 | 0.04 | -0.02 | 0.07 |
5 | 0.06 | 0.07 | 0.10 | 0.05 | 0.07 | -0.04 | -0.01 | -0.04 | 0.07 | 0.01 | -0.03 | 0.14 |
10 | 0.11 | 0.03 | 0.13 | 0.05 | 0.12 | -0.05 | 0.00 | -0.02 | 0.05 | 0.02 | -0.08 | 0.15 |
15 | 0.17 | -0.01 | 0.16 | 0.08 | 0.12 | -0.05 | 0.02 | 0.03 | 0.04 | 0.03 | -0.14 | 0.17 |
20 | 0.19 | 0.02 | 0.10 | 0.06 | 0.12 | -0.06 | 0.03 | 0.08 | 0.06 | 0.07 | -0.16 | 0.15 |
25 | 0.24 | 0.07 | 0.07 | 0.06 | 0.09 | -0.06 | 0.05 | 0.14 | 0.09 | 0.14 | -0.23 | 0.12 |
30 | 0.28 | 0.07 | 0.05 | 0.07 | 0.04 | -0.05 | 0.06 | 0.09 | 0.09 | 0.17 | -0.37 | 0.05 |
North Carolina had a particularly close result, inconclusive polling data...
But Ohio and New Jersey weren't particularly close
Motivation > Demo > Technical > Performance (McCain)
Smoothing (days) | CA | FL | GA | IL | MI | NC | NJ | NY | OH | PA | TX | VA |
0 | 0.06 | -0.05 | -0.01 | -0.03 | -0.06 | 0.00 | -0.06 | 0.04 | -0.05 | -0.05 | -0.03 | -0.02 |
5 | 0.13 | -0.09 | 0.05 | -0.04 | -0.10 | -0.06 | -0.14 | 0.07 | -0.07 | -0.12 | -0.05 | -0.03 |
10 | 0.10 | -0.12 | 0.08 | -0.08 | -0.13 | -0.04 | -0.16 | 0.09 | -0.02 | -0.14 | -0.08 | -0.04 |
15 | 0.07 | -0.14 | 0.11 | -0.13 | -0.12 | -0.04 | -0.32 | 0.07 | 0.03 | -0.16 | -0.07 | -0.07 |
20 | 0.03 | -0.18 | 0.15 | -0.13 | -0.06 | -0.02 | -0.41 | 0.00 | 0.03 | -0.15 | -0.10 | -0.05 |
25 | -0.02 | -0.21 | 0.17 | -0.15 | -0.02 | -0.03 | -0.46 | -0.07 | 0.01 | -0.2 | -0.03 | -0.02 |
30 | -0.08 | -0.20 | 0.19 | -0.19 | 0.04 | -0.04 | -0.45 | -0.04 | 0.01 | -0.23 | 0.09 | 0.03 |
Interesting to note that Texas, where McCain comfortably won, correlates poorly
Motivation > Demo > Technical > Performance
Phrases seem to do better than any other feature
More information available
Theoretical performance is reasonable, given pysen's purpose
Designed to fare well on datasets it hasn't been trained on
Real-world performance is tricky to evaluate
Internet sources are biased, have to take this into account
Seems to be able to detect longer-term trends
Need more data!
Motivation > Demo > Technical > Performance > Project Management
Milestone | Plan | Reality |
Article storage | 17/10/2012 | 08/02/2012, with improvements since (114 days late) |
RSS/Atom Crawling | 24/10/2012 | Dropped |
HTML Crawling | 30/10/2012 | 18/11/2012 (25 days late) |
HTML parsing and combination | 26/11/2012 | Dropped |
Keyword extraction | 05/12/2012 | 03/02/2013, with improvements since (60 days late) |
Sentiment analysis | 07/01/2013 | 25/01/2013 (18 days late) |
Web Interface - Site Framework | 14/01/2013 | 20/11/2012 (55 days early) |
Article Retrieval | 17/01/2013 | 24/02/2013 (7 days late) |
Web Interface - Site Details | 02/02/2013 | 18/02/2013 (16 days late) |
Web Interface - Site Coverage | 11/02/2013 | 18/02/2013 (7 days late) |
Article Similarity | 18/02/2013 | Dropped |
Web Interface - Article Details | 20/02/2013 | 17/02/2013 (3 days early) |
Web Interface - Article Lists | 20/02/2013 | Dropped |
Motivation > Demo > Technical > Performance > Project Management
Plans change
HTML/RSS/Atom crawling dropped in favour of the CommonCrawl
The number of articles means computing similarity measures would be very hard
Lists of articles replaced with contextual user interface to reduce bandwidth
Lots of things missing from the project specification (but included in the progress report)
Headline extraction
Content extraction
Date extraction (!)
API
Budget
Sentiment analysis took a long time get right
Specification was vague about what information would be shown
Didn't quite achieve 80% accuracy on Pang and Lee's dataset
Despite this, project did finish on time and didn't cost too much
Motivation > Demo > Technical > Performance > Project Management > Future
Better approach to sentiment analysis
Incorporate user feedback; Train on other datasets
Potentially open-source the sentiment analyser; seek improvements from other applications
Smarter keyword extraction and storage
Extracts lots of keywords which aren't needed
Handling keywords causes database contention
Better architecture
CommonCrawl Foundation announced a URL index in January
Now possible to selectively process individual sites
Precomputing data is very expensive, can't change the sentiment analyser once in production
Better use of links
Could be useful to study how influences towards a topic spread through websites
Interesting uses of the API
Doesn't have to communicate with just the website
Alternate applications (e.g. Reddit client) in development
Query engine improvements
Formally-defined query grammar
Better performance
Motivation > Demo > Technical > Performance > Project Management > Future
Content extraction
Using shallow text features at present
Could combine this with document structure analysis across an entire site to achieve better results
Under-explored in the literature
Traceability
Essential for users to know
Why an article appears in the results
How a particular result was achieved
Mitigated by better visualisation:
Visual demonstration of pysen's process
Presentation of probability information
Archive.org links are not quite the same
Motivation > Demo > Technical > Performance > Project Management > Future > Conclusion
Sentimentron is arguably something completely new!
Unique test bed for sentiment analysis technology
Makes use of new web technologies (HTML5, Canvas, AJAX)
Lots of opportunities for extension and study
Vast link database
Keyword indexing
Document retrieval
Better sentiment analyser
Construction of distributed systems
Semantic web
Five interesting end products
Sentimentron website: excercise in visualisation
API: useful for alternative apps
pysen: extensible and reusable sentiment analyser
pydate: probabilistic date extraction
Data set: really big collection of real-world sentiment data
Thanks for listening, any questions?
References / sources
"79% of humans disagree"
Ogneva, M,http://mashable.com/2010/04/19/sentiment-analysis/, retrieved 2012-12-13.
Andrea Esuli; Fabrizio Sebastiani (2006)."SentiWordNet: A publicly available lexical resource for opinion mining". Proceedings of LREC. pp. 417–422.
Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. "Thumbs up?: sentiment classification using machine learning techniques." Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002.
Pang, Bo, and Lillian Lee. "A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts." Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2004.
McDonald, Ryan, et al. "Structured models for fine-to-coarse sentiment analysis." Annual Meeting-Association For Computational Linguistics. Vol. 45. No. 1. 2007.
Presidential poll data
http://electoral-vote.com/evp2013/Info/2008-pres-polls.csv