Text Mining Projects
Fall 2019
Professionalism is important in public presentations, so please use the “would I be happy for my parents to read this in the newspaper” test when uploading content. Humor is great; abusive language or disparaging groups of people is firmly not acceptable.
Example slide: Title
Explanation
Format your slide however makes the most sense for your project; this is a suggestion.
Results (graphics)
Using background images or fonts is great, but please don’t change the presentation theme, as this changes it for everyone
Your name
Plotting Baseball Fan Sentiments Over the Season
What can we learn about how baseball fans are feeling given by their reddit comments?
Using NLTK sentiment analysis the answer remains unclear without further data processing
Max Dietrich
Tracking Subreddit Sentiments Over Time
Can we see changes in the tone of a community over time? Can we find out what caused those changes? By Eamon O’Brien
Comparing pinkbike.com comment sentiment to reaction rate - Shawn
10/17/19:
Shared Words with the Olin College Wikipedia
Do Wikipedia pages for related subjects have more shared words than unrelated subjects?
*according to my very limited data set and dependant on what words you consider as relevant
Katie Thai-Tang
Comparing word frequency of Wikipedia articles
Short Answer: Yes!*
How Use of Pronouns in Media changes over Time
Data was taken from ~50 random(ish) books from Project Gutenberg and used to compare the difference in frequency of use of different pronouns and how they changed over time.
It is easy to tell that masculine pronouns are far more prevalent, but there seem to be no trends over time.
This could be a result of a lack of samples, a lack of more modern books where the change would be more noticeable, or the randomness in the selection of books.
Results (graphics)
Cameron “Cali” Wierzbanowski
Predicting the Outcome of a Ranked Game of
Florian Michael-Schwarzinger
50.33 %
49.67 %
Is it possible to accurately predict the outcome of a match by only looking at the champions in the match and what side of the map each team’s base is on?
Map Side Win Rate
Champion Win Rate
43.95 %
52.38 %
47.88 %
52.64 %
49.15 %
48.09 %
50.44 %
55.56 %
48.18 %
44.03 %
YES! (with a 50.44 % success rate)
(so no, not really)
Answer:
Food Recommender 9000
Build an application that recommends ten similar foods given one food.
Technologies used:
Sample Output
Live presentation line
Add your slide BEFORE this one if you’d like to share your work with the class in a ~one minute presentation.
Add your slide AFTER this one if you’d rather not present (people can just read your slide instead).
The 2020 Presidential Election Candidates’ Tweets
It is already hard enough to keep up with this upcoming presidential election. To top it off, there are a lot of presidential candidates. Thankfully, twitter makes it a lot easier! Since all of the presidential candidates (including the president) are very active on this platform, we can look at their past ~3200 tweets to see what words are most associated with each of them using scikit and the election as a whole!
As a quick overview, the words that the presidential candidates are associated with are to the right. The size of the word correlates to the amount of candidates who are associated with it.
Navi Boyalakuntla
Spells and Themes - Harry Potter
Analysing the changing frequency of spells in Harry Potter by book.
The spell usage overall goes up as the books go along. The peak, however, of most spells, is split pretty evenly amongst the third, fourth, and fifth book. Again, this makes sense. The third book was the only one with a competent Defense Against the Dark Arts teacher, the fourth had a competition that required Harry learn a lot of new magic, and in the fifth, Harry and friends formed Dumbledore's Army, and teach others a lot of newer spells.
Maya Al-Ahmad
Generating New Podcast Transcripts
Overview of goal: Using archived transcript data, generate a new Critical Role podcast using Markov’s text synthesis.
Areas of refinement: creating more parameters to scrub generated sentence to lessen formatting errors, and limit sentences to make more grammatical sense
Potential uses: generation of any transcript for a TV show, podcast. movie
Limitations: cannot take into consideration factors that aren't directly within the parameters set
Generated text examples:
LAURA:Your grandma? When her foot. You see now, the meme, when Orthax and unnecessary." What's going to.
LIAM: Pterodactyl. That was bet. Ships on the extent of dragon just a death sentence. Okay, you.
TALIESIN: Attack. uh-huh! Yep! There are just being like, "Oh well, but they carved into it might have a few.
TRAVIS: Tower, above the guards: Oi! What's going to get prepared to me, showing up in.
LAURA:(loudly laughs) You glide over, can cut. Can’t jump up with a long foot.
By: Christian Quijano
Markov the Inventor
Goal: Use Markov text synthesis on thousands of patents to generate made up inventions
Database: US Patent and Trademark Office bulk data storage system
Analysis: Markov text synthesis
Results: Very fun, but not very useful. This was an interesting exercise in sentence structure and included fascinating vocabulary usage, but has not invented the next big thing... yet!
Nathan Weil
Who is behind Wikipedia articles?
Have you ever been curious whether specific Wikipedia documents are written by the same person or the same group of people?
This project aimed to find out similarities between a document and all documents connected to the document as links. By checking the word frequencies, the cosine similarity of each document was checked.
Since Wikipedia documents are written by hundreds of people, it is hard to find out whether they are written by the same people. However, the project discovered that specific topics have multiple documents clustered together with high similarities.
SeungU Lyu
Carpe per Diem
Analyzing the frequency of Carpediem mailing list emails per day.
This program finds dates or times by searching for the specific patterns, such as “11/15/19” or “Oct 15”, in the Carpediem mailing list archives. It then uses dictionaries to count the number of emails per day.
Future adaptations of this program would include noting the timestamp and date of each email, as well as searching the email text for event descriptions and times/locations described in the email. The goal of this future work would be to output a daily schedule of events including their descriptions.
Gail Romer
Sentiment Analysis of Shakespeare’s Sonnets
Analyzing the overall sentiment of Shakespeare’s sonnets and how the sentiment changes depending on who the sonnet is addressed to.
The majority Shakespeare’s sonnets have a positive sentiment, with an average compound sentiment of 0.54. The average compound sentiment for the Fair Youth sonnets was 0.60 and for the Dark Lady sonnets it was 0.32, so the Fair Youth Sonnets, which were addressed to a young man, were overall more positive than the Dark Lady Sonnets, which were addressed to a woman.
Results (graphics)
Kyle Bertram
Example slide: Youtube Communities
Explanation
This project aimed to use analysis of youtube comments as a metric to measure the health of a community surrounding various youtube channels. To measure the health of a community several aspects of the comments were measured: the volume of comments (a large volume of comments often signifies discussion which is a halmark of a healthy community), the likes applied to the comments (this signifies the general agreement of the community where higher values tend to be corelated with healthier communities), and the positive/negative sentiment of the comments (negative sentiments tend to describe a community with many negative feelings, where positive sentiments signify the opposite and tend to happen more frequently in healthy communities.)
Results (graphics)
Timothy Novak
How do Different News Sources Treat News?
Short answer: it’s unclear.
In terms of expressed sentiment, it seems that all news sources have an overall neutral tone.
Something that I found interesting from analyzing word frequency was the use of “Trump” (more informal) versus “President” (more formal). The more liberal and the more conservative sites preferred “Trump” to “President,” whereas the ever-judicious NPR favored “President” over “Trump.”
My dataset was pretty limited (only 10 articles per news source), which may have caused the inconclusiveness of my findings, but it seems that there are few trends and patterns that can be observed from this form of analysis of news sources.
Annie Tor
Overall Sentiment of News Providers
Odalys Benitez
Random Text Generator Using Facebook Chat History
Jinfay Justin Yuan
Some generated results:
Cringe:
"'and i just feel like talking'", "'Me n mike'", "'Have been waiting'", "'Since 1:39'", "'Omg'",
???:
"'the cootie levels been flaring'"
The Nerd:
"'Anyways u should check out this one calculation'", "'?'", "'i know!'", "'its insane'", "'ok'", "'so
Spam detector: how well does it work?
This project was intended to find some differences in word, language, and style between spam and non-spam(ham) SMS messages. Some objects include creating a spam generator and a spam detector.
In analyzing the dataset, I used methods include word frequency histogram, sentiment analysis, and text classifier. I have also tried to generate some random spam messages without using the Markov analysis.
With the text classifier, I was able to create a program that takes a certain SMS text and classify it as “spam” or “ham.” It generally works pretty well with an accuracy rate of more than 94%. Here are some examples to the right:
spam_detector("Welcome to UK-mobile-date this msg is FREE giving you free calling to 08719839835. Future mgs billed at 150p daily. To cancel send go stop to 89123")
“It's a spam!”
spam_detector("I love mini project 3!!!!")
“It's a ham!”
spam_detector("CONGRATULATIONS! You have won a GRAND PRIZE FOR 3 millio dollars!!!!!Log on to www.claimmyprize.com to claim this money!")
“It's a spam!”
Vincent Mu
Generic Keyword Lyrics in Popular EDM Songs
Purpose
Findings (215 songs’ lyrics scraped from Genius.com)
Afraz Padamsee