Text Mining Projects
Spring 2020
Professionalism is important in public presentations, so please use the “would I be happy for my parents to read this in the newspaper” test when uploading content. Humor is great; abusive language or disparaging groups of people is firmly not acceptable.
Stock Market Sentiment Analyzer/ Predictor
My program uses stock news subreddits and runs a sentiment analysis on the most novel words and phrases to find the public sentiment on the market, predicting average market performance in the near-future
Yehya Albakri
Tinder Bio Generator🔥💖
Adi Ramachandran
What i did :P
Concerns
Generated sample uno
“warning i am 5’8”
**bears. beets. battlestar galactica
if you play baseball... degrade me
looking for someone who will let me steal their sweatshirts
plz stop making harley davidson jokes
*19
hit me w. your best pick up line”
Generated sample dos
“warning i am 5’8”
aspiring hippie
tell me your hottest take
they always ask “what’s your snap?”, but never “what’s you venmo?”
🇺🇸🇵🇪 take me on a date and i’ll teach you everything from how to rubik’s cube
fruity”
A project that may help you lose weight if you are interested.
My project analyzes Wikipedia pages for sentiment on different diets that help you lose weight based on a paragraph summary on the page.
Walter V
Sentiment Analysis
-Raw Data Analyzer
-Labeled Data Machine Learning
By Harry Liu
CLUSTERING XKCD WEB COMICS
ROAD / CAR THEMED CLUSTER ITEMS
RELATION MAP OF 2277 XKCD COMICS’
TRANSCRIPT, HOVER-TEXT, AND TITLE
BY GATI AHER
ROAD / CAR THEMED CLUSTER MAP
SCP Foundation Markov Text Generation
SCP Foundation is responsible for locating and containing individuals, entities, locations, and objects that violate natural law.
Jonas K
Sentiment Analysis
Democratic Candidate Twitter Analysis
Sam Coleman
Text and Sentiment Analysis on BP
Michelle Zhang
BP’s corporate page: neutral skewed positive
BP’s Wikipedia page: mostly neutral, higher negative
Poetry Markov Generation (aydin o'leary)
thanks poetryfoundation.org for not banning me
Ignoring punctuation for fun and not-profit
Carol Luo
The Happy Prince, by Oscar Wilde - Most Frequent 97 Words List:
('the', 1290), ('and', 698), ('to', 434), ('of', 422), ('a', 422), ('he', 307), ('I', 279), ('in', 276), ('is', 265), ('was', 208), ('you', 196), ('that', 181), ('with', 154), ('said', 151), ('for', 142), ('his', 139), ('it', 136), ('little', 119), ('not', 118), ('as', 115), ('at', 100), ('my', 97), ('have', 95), ('are', 94), ('on', 93), ('be', 92), ('all', 90), ('so', 87), ('or', 86), ('will', 83), ('am', 82), ('had', 80), ('very', 80), ('they', 77), ('The', 72), ('she', 71), ('"I', 68), ('her', 67), ('were', 60), ('like', 57), ('by', 56), ('any', 56), ('this', 55), ('but', 55), ('no', 54), , ('me', 50), ('one', 50), ('about', 49), ('him', 46), ('He', 45), ('who', 45), ('cried', 44), ('has', 43), ('out', 43), ('great', 43), ('into', 42), ('quite', 40), ('work', 40), ('down', 39), ('from', 38), ('over', 38), ('up', 38), ('must', 38), ('But', 38), ('what', 38), ('there', 37), ('when', 37), ('your', 37), ('It', 36), ('if', 35), ('never', 34), ('red', 33), ('came', 33), ('answered', 33), ('would', 33), ('always', 31), ('THE', 30), ('do', 29), ('And', 28), ('Hans', 28), ('beautiful', 28), ('can', 28), ('went', 28), ('going', 28), ('their', 27), ('shall', 27), ('than', 27), ('electronic', 27), ('see', 26), ('some', 26), ('So', 26), ('go', 26), ('You', 25), ('may', 25), ('give', 25), ('long', 25), ('rose', 25)
Words with real meaning in this list:
Most Popular Brockhampton Member (as of 3/9/2020)
(I don’t even think I
have them all)
Jason Lin
COMMON SOCCER PRE-MATCH INTERVIEW WORDS
Patrick Ogunbufunmi
Generating paragraphs of fake Wikipedia articles
Politico wrote that Biden's "weak filters make him immune to pressure from lawmakers who do not help to avoid interference with that view. Bork's nomination was rejected in the 2018 midterms we will be controlled by ISIS, and that "nobody has a close relationship with Saudi Arabia's powerful Crown Prince Mohammad bin Salman are allies in the afternoons, and celebrated his bar mitzvah at the Hiroshima Peace Memorial Museum. Wharton School of Law, receiving a half scholarship based on tax-funded social benefits and not invite Zelensky to investigate CrowdStrike and Democratic pollster Patrick Caddell. His campaign announced it had raised more funds than all candidates but Dukakis, and was formally abandoned in October 1987 and became a successful effort to push Mr. Trump out of every five days at one of Delaware's largest companies, and other predominantly Muslim ethnic minorities in "political reeducation" centers or camps requires a tough, targeted, and global leadership".
On December 23, 2016, Cruz won 36 of the upcoming Iraqi parliamentary election resulted in a federally protected activity. As president, he reaffirmed this position by stating "I believe marriage is between a man who allegedly assaulted a minor skull fracture and other ways to encourage or participate in boycotts against Israel and Israeli settlements in the case was settled out of every Republican senator, but what he was an Indonesian East–West Center graduate student in anthropology at the firm, Cruz worked on matters relating to sexual assault on college and university campuses while presenting a speech at the upcoming 2014 Winter Olympics in Sochi, Russia.In December 2014, after the death penalty. He supports lowering the cost of drugs by reforming patent laws to allow the Government Accountability Office to evaluate the Federal Reserve to undue political pressure from lawmakers who do not agree with it, but that's their right." Chicago Laboratory Schools. When they moved to Washington, D.C., stating "We're all united by our unyielding—I mean literally unyielding—commitment to the stimulus package."
Isabel Serrato
Articles were generated using word maps from the Wikipedia articles for Joe Biden, Barack Obama, Donald Trump, Bernie Sanders, and Ted Cruz.
J. Cole vs. Nas
Rohil Agarwal
Markov + Similarity + Jane Austen
Jackie Zeng
“I would rather have nothing to propose for my wife. She, poor soul, is tied by the influence of the five or six determined couple who were attracted at first little leisure for such feelings, whatever their origin, and could not be ineligible."
Elon Musk’s twitter feed is quite entertaining -- so I generated more tweets in his style.
(Honestly, that man would thrive in our class meme chat).
Karen Hinh
My Writer’s Workshop Novel: The Sequel
Using the partially complete novel I had written last semester, I used Markov analysis, that at higher orders sound like what you lucid dream before falling asleep.
By Alana Huitric
Order = 2
I… hadn’t killed anything like images from textbooks.
I still couldn’t be summoned to move backwards away from my fingers into the bed, all loose naked limbs.
I couldn’t see, but I hope so.
The beast gurgled, disgruntled to say that’s my blood. Ha ha… Man what on earth did this monster want me for?
Order = 3
They rushed to cover my wound and force something into my throat as I suffocated. No no no no! Panic, pain, rushed me. This couldn’t be real. All of the voices became a single, incessant drone in the air around him. Everything was silent except our dissonant breaths. “What do you mean complete?” Is it human? “Yes…” Perfect. As are all the voices everyone else hears spilling into their mind. Four pairs of brilliant amber eyes lit up synchronously. From the faint orbital illumination, I made out four skeletal heads each of different organisms closely clustered around the humanoid one, bristling amidst a mess of feathered, black iridescent wings. It sat on the lid of the toilet, pouring something into the steaming water. I stood transfixed by the rich, fruity scent of whatever pinkish fluid she was swirling into peaky bubbles at the water’s surface.
Presidential Speech Analysis -Tolulope Oshinowo
Analyzing the word frequencies and overall mood of texts scraped from presidential inauguration speeches from the UVA Miller Center Database, alongside candidacy kickoff speeches from other sources.
Re-elected presidents (or presidents trying to be reelected) tend to have longer, more emotionally volatile speeches. Trump keeps to this trend which could be a good sign for his candidacy.
These candidates use first-person plural pronouns like ‘we’ and ‘our’ to come off as more relatable, but Sanders then goes the extra mile by ending his speech ‘high’, which could be a good sign for him.
How is Babson College Identified on Wikipedia?
Andrea Lindner
Live presentation line
Add your slide BEFORE this one if you’d like to share your work with the class in a ~one minute presentation on Friday.
Add your slide AFTER this one if you’d rather not present (people can just read your slide instead).
“i left candidate’s weekend from my lack of fun activites, the opportunity to absorb as a candidate…”
Here’s what more people had to say:
the admission process: august 1st – attend a good time to know the lady at the three weekends are a first cw. the opportunity to current first years. meetings are run entirely by the materials; from scouring the first day, and scary. i fell in writing this weekend who are the admission counselors talk to cw3! during my first year! for olin students. see, olin campus. over one eye for you! best advice anyone could spend a parent, hi moms and now just ask, olin’s website or maybe even more daunting. it and if i was at olin wants me to
(Generated 100 words)
to meet, greet, and preparing to engage with my individual interview. don’t wear safety glasses when i talk to the airport (in case if you wait to candidates’ weekend is the next few of health, weather, or not. we only one as a small, selective school, with that they recommend that i just did (you really get a preference. candidates’ weekend (assuming prospective students are probably feeling on what i immediately assumed meant that being good enough (actual 2016 funny thing called candidates’ weekend, you aren’t (i certainly was unsure coming into our shoes next few days would be
(Generated 100 words)
My project uses a simple version of Markov Chain Analysis to generate a random Candidates Weekend Blog Post based on the content from the Olinsider related to Candidates Weekend. First, I created a dictionary for each word in the document (storing the words that could follow from that word) and from there, I selected a random word to start with and a random word from the values under that key to continue the chain! Here are some examples of the ‘blog posts’ I’ve been generating.
Caitlin Coffey
h no
Twitter Bot, Markov Sentence Generator
Examples from @realDonaldTrump:
- #KAG2020 https://t.co/7BrkAKYWU0 THANK YOU MASSACHUSETTS!
- Great for the Great American Comeback!
- After seeing so much of the Corona… This is just more Fak… https://t.co/wIba544vaW Mini Mike, “Three months ago I entered the race for President.
- #KAG2020 https://t.co/IdJ721oOsq THANK YOU TEXAS!
- We are with you all the way! https://t.co/vyu0Tbthv0 RT @DanScavino: The video was NOT manipulated.
Examples from @BernieSanders:
- The best way to increase turnout for our progressive agenda, which the American people support, is not the kind of politics.
- Find your polling place and its hours here: https://t.co/1V41XeLqEA If you want us to lose.
- Tonight, we invite her supporters into our campai… https://t.co/aRdoKGMpi3 This campaign is about asking one fundamental question: Which side are you on?
- One of us has a poverty crisis.
- The whole world is crying out for the last #Berniesanders rally before Super Tuesday.
Markovian Shakespeare
A code that downloads the complete works of Shakespeare and then uses those words to create new paragraphs that sound like Shakespeare's language. :D
Overall the program worked well, and on the right is a very beautiful example of Markovian Shakespeare. Hope you enjoy!
Your humble classmate,
~ Regan Mah
What wretched errors hath my empty words,
Whilst my poor country’s to command:
Whither, indeed, before thy time?
Warwick is Chancellor and the other side;
Gelding th’ opposed continent as much as it would content me
To say I did it for a thousand deaths would die.
Exeunt PROCULEIUS and two good Armors;
if he wear a great estate. When he himself is not lost.
OTHELLO. Fetch’t, let me kiss This princess of pure love,
To have this twelvemonth she’ll not match his woe.
“It shall be to God! Even there my hopes but she,
more covetous, Would have flung
AITA? RNN Binary Classification.
AITA is a subreddit dedicated to user posting stories, and other user bringing moral judgement down upon the original poster.
Question: Using just the titles of posts, can a recursive neural network predict Reddit’s verdict?
Answer: No, apparently not. (yet)
The model overfits immediately, regardless of hyperparamters, which is indicative of a dataset too small for the complexity of the problem being solved. Trying new architectures, adding the body of the posts as inputs or waiting for more data to be generated are all possible solutions.
Comparing Writers and How They Write
I downloaded texts from Project Gutenberg to see how vocabularies from different authors compared to each other, and if there were any known similarities between them
The Scarlet Letter had the highest level of word diversity when compared to classic Jane Austen novels, such as Emma and Pride and Prejudice.
Prisha Sadhwani
News Article Categorizer
Requests → Return text from New York Times
TF-IDF → Analyze New York Times Articles divided by categories.
Scikit-Learn → Return model selection
Tensorflow (Keras) → Train data using MLP
e.g “An uncrewed test flight of Starliner, a Boeing spacecraft designed to carry NASA astronauts, could have ended in disaster in December because of lapses that allowed software errors to slip through undetected and unfixed before the spacecraft launched, according to a review by NASA and Boeing that was announced on Friday.” → (Science, 0.65)
Final result: Correlation of article categories on specific texts
Novel Similarity
Declan Ketchum
Based on comparing the frequency of words:
Pride and Prejudice and Emma by Jane Austen are 97% similar
Pride and Prejudice by Jane Austen and A Tale of Two Cities by Charles Dickens are 95% similar
Terence McKenna 🔮🧠 // Markov Chains ⛓💻
Jasper Katzban
Goal: generate some profound psychedelic-themed quotes in the style of the ethnobotanist Terence McKenna.
Implementation: a Markov generation algorithm applied to a plain text version of McKenna’s work from an HTML source.
Result: some wacky but profound quotes that raise the question: did a human, drug, machine, or combination of all three create these?
See some examples at right:
“Around 1910, he had taken me to be finally thrown open to a box of cameras; I carried my mind unbidden.”
"A series of discrete energy levels must be broken through in order to represent more complex phenomena. We must imagine an atom with new parameters if we wish to understand how we could exist, how thinking, tool-using, human beings could arise out of the house, she discovered that the sensation of heat had not diminished but grown stronger.”
“Yesterday afternoon Dave discovered Stropharia cubensis in the stream of evolution and sympathetic therefore to a point of view.”
Creating News Headlines
Using the subreddit r/nottheonion and markov analysis to create some fake news, I made these things (please note most of the headlines were nonsensical):
neighbor steals skeleton over zestimate
shoppers in stampede for toilet paper scrap
the human body is your doctor will decide if covid-19
bill fails citizenship test
alex jones brags that rugrats now qualify as a central role in a mind-uploading service
mother renames son after mcdonald's rick and had no reason
french chef sues mormon church for posting paparazzi photo of his house and sheep
a single banana said hitler hq for sex ring
hamilton police called jay-z
we regret to keep your default
game console and prayers aren’t working
5th-grader airlifted to the next cadbury bunny
canadian women are protesting open primaries by going to walmart
Kate Mackowiak
Analyzing tweets! (Zachary Sherman)
Creating fake tweets using Markov text analysis
By Kelly Yen
“UnixpR0n” Scraper
I parsed the subreddit “unixpr0n” in order to find the most commonly customized OS and looked further to find benefits of it.
I found that the most common OS was “i3-gaps”, a tiling manager that is accepted by most any linux distribution, which is highly customizable and allows for ease of use whenever it comes to working.
Oscar De La Garza
Khalid’s Ghost Writer
Creating AI Generated Music for Khalid
“...left right left right direction floating through different dimensions but i do do now here that ayy ooh now you there's nowhere” - Khalid
Nikhil Anand
Using the top twenty gutenberg books to explore
The Correlation between Author Gender and Word Choice
By Hazel Smith
Interesting Results
23% of the pronouns that male authors used were she/her/hers while 41% of pronouns that female authors used were he/him/his.
| Female Author | Male Author |
she/her/hers | 2.23% | 0.60% |
he/him/his | 1.95% | 1.37% |
They/Them/Theirs | 0.56% | 0.69% |
Using Markov Analysis to Write Movie Scripts
For this project, I scraped the Internet Movie Script Database (IMSDb) for movie scripts and performed Markov analysis to generate my own script.
Follow this arrow for results --------- >
Marion Madanguit
EXT. SCHOOL GIRL.
DR. TRAMMEL
Doctor Rumack, I'm a big deal.
JONATHAN
I'm sure it was so you are still staring into it. You are not in my life in this time he gets his head.
CHARLOTTE
What are you are on a man on it. He walks over with an apple red car.
JAMES
I can't get it and I think I just got to her face.
Starting on any Wikipedia link, if you only click the first link on each new page, you will eventually always hit the same dead end.
Starting on any Wikipedia link, if you only click the first link on each new page, you will eventually always hit the same dead end.
links
Starting on any Wikipedia link, if you only click the first link on each new page, you will eventually always hit the same dead end.
python3 text_mining.py “Apple Inc”
Starting at https://en.wikipedia.org/wiki/Apple Inc, these are the steps to reach the Wikipedia Philosophy Webpage
https://en.wikipedia.org/wiki/Apple Inc
https://en.wikipedia.org/wiki/Multinational_corporation
https://en.wikipedia.org/wiki/Corporation
https://en.wikipedia.org/wiki/Company
https://en.wikipedia.org/wiki/Legal_personality
https://en.wikipedia.org/wiki/Person
https://en.wikipedia.org/wiki/Reason
https://en.wikipedia.org/wiki/Consciousness
https://en.wikipedia.org/wiki/Sentience
https://en.wikipedia.org/wiki/Feeling
https://en.wikipedia.org/wiki/Nominalization
https://en.wikipedia.org/wiki/Linguistics
https://en.wikipedia.org/wiki/Science
https://en.wikipedia.org/wiki/Latin
https://en.wikipedia.org/wiki/Help:IPA/Latin
https://en.wikipedia.org/wiki/International_Phonetic_Alphabet
https://en.wikipedia.org/wiki/Alphabet
https://en.wikipedia.org/wiki/Letter_(alphabet)
https://en.wikipedia.org/wiki/Symbol
https://en.wikipedia.org/wiki/Idea
https://en.wikipedia.org/wiki/Philosophy
These were the 20 steps to reach the Wikipedia Philosophy Webpage
Predicting the Author of Beatles Songs
By Nathan Faber
This project employs word frequency analysis via TFIDF and Cosine Similarity to attempt to see similarities between all Beatles songs.
I scraped data from one site to get the lyrics and titles of songs and entered data that indicated who wrote each song.
As it turns out these tools aren’t practical to solve this problem I have less than 50% accuracy….
Text Mining ‘Birth of a Nation’
Analyzing Word Frequency & Sentiment
Name | Rank | Frequency |
Slave | 22 | 91 |
White | 36 | 62 |
Plantation | 43 | 57 |
Suh | 68 | 40 |
Sentiment Analysis Results:
{'neg': 0.0,
'neu': 0.981,
'pos': 0.019,
'compound': 1.0}
Antonio Perez
Capulet vs Montague
Identify the number of times each character speaks and compare the number of conversations each house had in the play Romeo and Juliet.
{'Montague': 48.166666666666664, 'Capulet': 30.0, 'Neither': 17.90909090909091}
Oscar Zhang
Markov Text Generation of Political News Headlines
How the level of Aircraft Carrier Pleads for investigation
Both public health and voting-rights experts say there's little time for customers to Power Shift
Both public health experts and hundreds of aircraft carrier hit by coronavirus is affecting US
The FBI's surveillance.
US economy shuttered to treat Covid-19, pitching him head-to-head with public health and provide models
Analysis: Trump's interview with Sean Hannity
Fox Nation hosts have reasons to talk to him head-to-head with anti-environment voting procedures
5 takeaways from coronavirus briefings
Fauci says he won't allow Democrats "to achieve unrelated policy items they wouldn't otherwise"
Doctors disagree with coronavirus stimulus checks
How two weeks changed America, brace for six months under a strict quarantine after intensive
4 takeaways from Congress to expand in praising Trump wildly exaggerates 1918 flu mortality rate
History's verdict on hold -- White House reassessing deal bars Trump's poll surge last?
NYPD to hear case of the Country: 2 Sets of Netflix stock.
Lilly Ledbetter, advocate for female VP pledge
Top 20 words in the news: ['to', 'of', 'for', 'on', 'in', 'with', 'is', 'says', 'as', 'from', 'be', 'he', 'during', 'response', 'by', 'will', 'that', 'pandemic', 'more', 'are']
Top 20 words in the news excluding common words: ['during', 'response', 'pandemic', 'need', 'amid', 'outbreak', 'governor', 'federal', 'public', 'political', 'home', 'help', 'guidelines', 'crisis', 'coronavirus,', 'bill', 'administration', 'virus', 'trillion', 'stimulus']
Lilo Heinrich