1 of 23

Text Mining Projects

Fall 2019

Professionalism is important in public presentations, so please use the “would I be happy for my parents to read this in the newspaper” test when uploading content. Humor is great; abusive language or disparaging groups of people is firmly not acceptable.

2 of 23

Example slide: Title

Explanation

Format your slide however makes the most sense for your project; this is a suggestion.

Results (graphics)

Using background images or fonts is great, but please don’t change the presentation theme, as this changes it for everyone

Your name

3 of 23

Plotting Baseball Fan Sentiments Over the Season

What can we learn about how baseball fans are feeling given by their reddit comments?

Using NLTK sentiment analysis the answer remains unclear without further data processing

Max Dietrich

4 of 23

Tracking Subreddit Sentiments Over Time

Can we see changes in the tone of a community over time? Can we find out what caused those changes? By Eamon O’Brien

5 of 23

Comparing pinkbike.com comment sentiment to reaction rate - Shawn

Plotted comments with more than 30 responses

Data taken from 30 most recent posts

Opportunity for future exploration

See how well the sentiment of the post aligns with the sentiment of the comments
See how trends might vary with other samples

10/17/19:

6 of 23

Shared Words with the Olin College Wikipedia

Do Wikipedia pages for related subjects have more shared words than unrelated subjects?

*according to my very limited data set and dependant on what words you consider as relevant

Katie Thai-Tang

Comparing word frequency of Wikipedia articles

Short Answer: Yes!*

7 of 23

How Use of Pronouns in Media changes over Time

Data was taken from ~50 random(ish) books from Project Gutenberg and used to compare the difference in frequency of use of different pronouns and how they changed over time.

It is easy to tell that masculine pronouns are far more prevalent, but there seem to be no trends over time.

This could be a result of a lack of samples, a lack of more modern books where the change would be more noticeable, or the randomness in the selection of books.

Results (graphics)

Cameron “Cali” Wierzbanowski

8 of 23

Predicting the Outcome of a Ranked Game of

Florian Michael-Schwarzinger

50.33 %

49.67 %

Is it possible to accurately predict the outcome of a match by only looking at the champions in the match and what side of the map each team’s base is on?

Map Side Win Rate

Champion Win Rate

43.95 %

52.38 %

47.88 %

52.64 %

49.15 %

48.09 %

50.44 %

55.56 %

48.18 %

44.03 %

YES! (with a 50.44 % success rate)

(so no, not really)

Answer:

9 of 23

Food Recommender 9000

Build an application that recommends ten similar foods given one food.

Technologies used:

Wikipedia API/library
NLTK TFIDF

Sample Output

10 of 23

Live presentation line

Add your slide BEFORE this one if you’d like to share your work with the class in a ~one minute presentation.

Add your slide AFTER this one if you’d rather not present (people can just read your slide instead).

11 of 23

The 2020 Presidential Election Candidates’ Tweets

It is already hard enough to keep up with this upcoming presidential election. To top it off, there are a lot of presidential candidates. Thankfully, twitter makes it a lot easier! Since all of the presidential candidates (including the president) are very active on this platform, we can look at their past ~3200 tweets to see what words are most associated with each of them using scikit and the election as a whole!

As a quick overview, the words that the presidential candidates are associated with are to the right. The size of the word correlates to the amount of candidates who are associated with it.

Navi Boyalakuntla

12 of 23

Spells and Themes - Harry Potter

Analysing the changing frequency of spells in Harry Potter by book.

The spell usage overall goes up as the books go along. The peak, however, of most spells, is split pretty evenly amongst the third, fourth, and fifth book. Again, this makes sense. The third book was the only one with a competent Defense Against the Dark Arts teacher, the fourth had a competition that required Harry learn a lot of new magic, and in the fifth, Harry and friends formed Dumbledore's Army, and teach others a lot of newer spells.

Maya Al-Ahmad

13 of 23

Generating New Podcast Transcripts

Overview of goal: Using archived transcript data, generate a new Critical Role podcast using Markov’s text synthesis.

Areas of refinement: creating more parameters to scrub generated sentence to lessen formatting errors, and limit sentences to make more grammatical sense

Potential uses: generation of any transcript for a TV show, podcast. movie

Limitations: cannot take into consideration factors that aren't directly within the parameters set

Generated text examples:

LAURA:Your grandma? When her foot. You see now, the meme, when Orthax and unnecessary." What's going to.

LIAM: Pterodactyl. That was bet. Ships on the extent of dragon just a death sentence. Okay, you.

TALIESIN: Attack. uh-huh! Yep! There are just being like, "Oh well, but they carved into it might have a few.

TRAVIS: Tower, above the guards: Oi! What's going to get prepared to me, showing up in.

LAURA:(loudly laughs) You glide over, can cut. Can’t jump up with a long foot.

By: Christian Quijano

14 of 23

Markov the Inventor

Goal: Use Markov text synthesis on thousands of patents to generate made up inventions

Database: US Patent and Trademark Office bulk data storage system

Analysis: Markov text synthesis

Results: Very fun, but not very useful. This was an interesting exercise in sentence structure and included fascinating vocabulary usage, but has not invented the next big thing... yet!

"The nose assembly. A centrifuge with a minor amount of suction."
"After the model aircraft is in its active region and the other side of the core member is provided such that the meat chunks of higher fat content of the tube deforms the tubing wall to pivot out of service."

Nathan Weil

15 of 23

Who is behind Wikipedia articles?

Have you ever been curious whether specific Wikipedia documents are written by the same person or the same group of people?

This project aimed to find out similarities between a document and all documents connected to the document as links. By checking the word frequencies, the cosine similarity of each document was checked.

Since Wikipedia documents are written by hundreds of people, it is hard to find out whether they are written by the same people. However, the project discovered that specific topics have multiple documents clustered together with high similarities.

SeungU Lyu

16 of 23

Carpe per Diem

Analyzing the frequency of Carpediem mailing list emails per day.

This program finds dates or times by searching for the specific patterns, such as “11/15/19” or “Oct 15”, in the Carpediem mailing list archives. It then uses dictionaries to count the number of emails per day.

Future adaptations of this program would include noting the timestamp and date of each email, as well as searching the email text for event descriptions and times/locations described in the email. The goal of this future work would be to output a daily schedule of events including their descriptions.

Gail Romer

17 of 23

Sentiment Analysis of Shakespeare’s Sonnets

Analyzing the overall sentiment of Shakespeare’s sonnets and how the sentiment changes depending on who the sonnet is addressed to.

The majority Shakespeare’s sonnets have a positive sentiment, with an average compound sentiment of 0.54. The average compound sentiment for the Fair Youth sonnets was 0.60 and for the Dark Lady sonnets it was 0.32, so the Fair Youth Sonnets, which were addressed to a young man, were overall more positive than the Dark Lady Sonnets, which were addressed to a woman.

Results (graphics)

Kyle Bertram

18 of 23

Example slide: Youtube Communities

Explanation

This project aimed to use analysis of youtube comments as a metric to measure the health of a community surrounding various youtube channels. To measure the health of a community several aspects of the comments were measured: the volume of comments (a large volume of comments often signifies discussion which is a halmark of a healthy community), the likes applied to the comments (this signifies the general agreement of the community where higher values tend to be corelated with healthier communities), and the positive/negative sentiment of the comments (negative sentiments tend to describe a community with many negative feelings, where positive sentiments signify the opposite and tend to happen more frequently in healthy communities.)

Results (graphics)

Timothy Novak

19 of 23

How do Different News Sources Treat News?

Short answer: it’s unclear.

In terms of expressed sentiment, it seems that all news sources have an overall neutral tone.

Something that I found interesting from analyzing word frequency was the use of “Trump” (more informal) versus “President” (more formal). The more liberal and the more conservative sites preferred “Trump” to “President,” whereas the ever-judicious NPR favored “President” over “Trump.”

My dataset was pretty limited (only 10 articles per news source), which may have caused the inconclusiveness of my findings, but it seems that there are few trends and patterns that can be observed from this form of analysis of news sources.

Annie Tor

20 of 23

Overall Sentiment of News Providers

Obtained 200 most recent articles from different news providers
Ran NLTK sentiment analyzer on each article
Derived compound values and averaged them to find the most common sentiment of the news provider.
Added standard deviation bars to show how negative and positive news providers have been in their articles
New York Times has had the most overall positive compound value

Odalys Benitez

21 of 23

Random Text Generator Using Facebook Chat History

Downloaded entire FB chat history (10 years worth) and parsed through to find messages I wrote�
Conducted word frequency analysis:

Most common word: “I” used 27573 times
Best friend: “Nalin” at 1067 times
16 different iterations of the word: “nice” �

Used Markov Analysis of text messages to try and generate sentences → (not so perfect)

Jinfay Justin Yuan

Some generated results:

Cringe:

"'and i just feel like talking'", "'Me n mike'", "'Have been waiting'", "'Since 1:39'", "'Omg'",

???:

"'the cootie levels been flaring'"

The Nerd:

"'Anyways u should check out this one calculation'", "'?'", "'i know!'", "'its insane'", "'ok'", "'so

22 of 23

Spam detector: how well does it work?

This project was intended to find some differences in word, language, and style between spam and non-spam(ham) SMS messages. Some objects include creating a spam generator and a spam detector.

In analyzing the dataset, I used methods include word frequency histogram, sentiment analysis, and text classifier. I have also tried to generate some random spam messages without using the Markov analysis.

With the text classifier, I was able to create a program that takes a certain SMS text and classify it as “spam” or “ham.” It generally works pretty well with an accuracy rate of more than 94%. Here are some examples to the right:

spam_detector("Welcome to UK-mobile-date this msg is FREE giving you free calling to 08719839835. Future mgs billed at 150p daily. To cancel send go stop to 89123")

“It's a spam!”

spam_detector("I love mini project 3!!!!")

“It's a ham!”

spam_detector("CONGRATULATIONS! You have won a GRAND PRIZE FOR 3 millio dollars!!!!!Log on to www.claimmyprize.com to claim this money!")

“It's a spam!”

Vincent Mu

23 of 23

Generic Keyword Lyrics in Popular EDM Songs

Purpose

To see if popular EDM songs share similar lyrical topics that are relatively generic and widely-applicable, with key-words onto which the listeners can project themselves while focusing on the track’s instrumental

Findings (215 songs’ lyrics scraped from Genius.com)

Highest frequency: Pronouns such as “I” and “we”
Most words in top 100 are meaningless/ non-descriptive words such as “on”, “if”, “like”
Most used ‘unique’ words were generic and generally ‘feel-good’: e.g. “love”, “yeah”, “light”
Overall the top phrases were pretty generic and center in some way around the ‘self’

Afraz Padamsee