1 of 45

Text Mining Projects

Spring 2020

Professionalism is important in public presentations, so please use the “would I be happy for my parents to read this in the newspaper” test when uploading content. Humor is great; abusive language or disparaging groups of people is firmly not acceptable.

2 of 45

Stock Market Sentiment Analyzer/ Predictor

My program uses stock news subreddits and runs a sentiment analysis on the most novel words and phrases to find the public sentiment on the market, predicting average market performance in the near-future

Yehya Albakri

3 of 45

Tinder Bio Generator🔥💖

Adi Ramachandran

What i did :P

  • Stealing Scraping Tinder Bios with a shady API
  • Cleaning this colloquial, emoji filled, ‘murky’ text dataset
  • Some grafz
  • Bypassing good old Markov and generating Tinder Bios via training final stage (fine-tuning) of GPT-2 124M NLP model

Concerns

  • Hella Ethical issues
  • Some serious Sampling Bias :/
  • Shitty documentation
  • No docstrings, anywhere
  • Abusing a state of the art NLP model for unethical, unworthy reasons

Generated sample uno

“warning i am 5’8”

**bears. beets. battlestar galactica

if you play baseball... degrade me

looking for someone who will let me steal their sweatshirts

plz stop making harley davidson jokes

*19

hit me w. your best pick up line”

Generated sample dos

“warning i am 5’8”

aspiring hippie

tell me your hottest take

they always ask “what’s your snap?”, but never “what’s you venmo?”

🇺🇸🇵🇪 take me on a date and i’ll teach you everything from how to rubik’s cube

fruity”

4 of 45

A project that may help you lose weight if you are interested.

My project analyzes Wikipedia pages for sentiment on different diets that help you lose weight based on a paragraph summary on the page.

Walter V

5 of 45

Sentiment Analysis

-Raw Data Analyzer

-Labeled Data Machine Learning

By Harry Liu

6 of 45

  • METRIC MULTI-DIMENSIONAL SCALING (MDS)
  • COSINE SIMILARITIES
  • TERM-FREQUENCY X INVERSE DOCUMENT-FREQUENCY (TF_IDF)

CLUSTERING XKCD WEB COMICS

ROAD / CAR THEMED CLUSTER ITEMS

RELATION MAP OF 2277 XKCD COMICS’

TRANSCRIPT, HOVER-TEXT, AND TITLE

BY GATI AHER

ROAD / CAR THEMED CLUSTER MAP

7 of 45

SCP Foundation Markov Text Generation

SCP Foundation is responsible for locating and containing individuals, entities, locations, and objects that violate natural law.

  • An individualized serial number of foundation was taken to deleterious mutations. Attempted escapes. However staff member species arrived nearly four 24.
  • Indeed just no stop breathing. Subject was assembled like to dr. Andrews how █████ later determined that the contents have seen.
  • Response to choke on the entry documents redacted dollars in site██. Following species in an SCP-282 specimens should be made. Further.

Jonas K

8 of 45

Sentiment Analysis

  1. Reporting on the coronavirus from a reputed business journal, here Wall Street Journal
  2. A comparison to the plot of price fluctuations in the U.S & China Stock markets

9 of 45

Democratic Candidate Twitter Analysis

Sam Coleman

10 of 45

Text and Sentiment Analysis on BP

Michelle Zhang

BP’s corporate page: neutral skewed positive

BP’s Wikipedia page: mostly neutral, higher negative

11 of 45

Poetry Markov Generation (aydin o'leary)

thanks poetryfoundation.org for not banning me

Ignoring punctuation for fun and not-profit

12 of 45

Carol Luo

The Happy Prince, by Oscar Wilde - Most Frequent 97 Words List:

('the', 1290), ('and', 698), ('to', 434), ('of', 422), ('a', 422), ('he', 307), ('I', 279), ('in', 276), ('is', 265), ('was', 208), ('you', 196), ('that', 181), ('with', 154), ('said', 151), ('for', 142), ('his', 139), ('it', 136), ('little', 119), ('not', 118), ('as', 115), ('at', 100), ('my', 97), ('have', 95), ('are', 94), ('on', 93), ('be', 92), ('all', 90), ('so', 87), ('or', 86), ('will', 83), ('am', 82), ('had', 80), ('very', 80), ('they', 77), ('The', 72), ('she', 71), ('"I', 68), ('her', 67), ('were', 60), ('like', 57), ('by', 56), ('any', 56), ('this', 55), ('but', 55), ('no', 54), , ('me', 50), ('one', 50), ('about', 49), ('him', 46), ('He', 45), ('who', 45), ('cried', 44), ('has', 43), ('out', 43), ('great', 43), ('into', 42), ('quite', 40), ('work', 40), ('down', 39), ('from', 38), ('over', 38), ('up', 38), ('must', 38), ('But', 38), ('what', 38), ('there', 37), ('when', 37), ('your', 37), ('It', 36), ('if', 35), ('never', 34), ('red', 33), ('came', 33), ('answered', 33), ('would', 33), ('always', 31), ('THE', 30), ('do', 29), ('And', 28), ('Hans', 28), ('beautiful', 28), ('can', 28), ('went', 28), ('going', 28), ('their', 27), ('shall', 27), ('than', 27), ('electronic', 27), ('see', 26), ('some', 26), ('So', 26), ('go', 26), ('You', 25), ('may', 25), ('give', 25), ('long', 25), ('rose', 25)

Words with real meaning in this list:

13 of 45

Most Popular Brockhampton Member (as of 3/9/2020)

(I don’t even think I

have them all)

Jason Lin

14 of 45

COMMON SOCCER PRE-MATCH INTERVIEW WORDS

Patrick Ogunbufunmi

15 of 45

Generating paragraphs of fake Wikipedia articles

Politico wrote that Biden's "weak filters make him immune to pressure from lawmakers who do not help to avoid interference with that view. Bork's nomination was rejected in the 2018 midterms we will be controlled by ISIS, and that "nobody has a close relationship with Saudi Arabia's powerful Crown Prince Mohammad bin Salman are allies in the afternoons, and celebrated his bar mitzvah at the Hiroshima Peace Memorial Museum. Wharton School of Law, receiving a half scholarship based on tax-funded social benefits and not invite Zelensky to investigate CrowdStrike and Democratic pollster Patrick Caddell. His campaign announced it had raised more funds than all candidates but Dukakis, and was formally abandoned in October 1987 and became a successful effort to push Mr. Trump out of every five days at one of Delaware's largest companies, and other predominantly Muslim ethnic minorities in "political reeducation" centers or camps requires a tough, targeted, and global leadership".

On December 23, 2016, Cruz won 36 of the upcoming Iraqi parliamentary election resulted in a federally protected activity. As president, he reaffirmed this position by stating "I believe marriage is between a man who allegedly assaulted a minor skull fracture and other ways to encourage or participate in boycotts against Israel and Israeli settlements in the case was settled out of every Republican senator, but what he was an Indonesian East–West Center graduate student in anthropology at the firm, Cruz worked on matters relating to sexual assault on college and university campuses while presenting a speech at the upcoming 2014 Winter Olympics in Sochi, Russia.In December 2014, after the death penalty. He supports lowering the cost of drugs by reforming patent laws to allow the Government Accountability Office to evaluate the Federal Reserve to undue political pressure from lawmakers who do not agree with it, but that's their right." Chicago Laboratory Schools. When they moved to Washington, D.C., stating "We're all united by our unyielding—I mean literally unyielding—commitment to the stimulus package."

Isabel Serrato

Articles were generated using word maps from the Wikipedia articles for Joe Biden, Barack Obama, Donald Trump, Bernie Sanders, and Ted Cruz.

16 of 45

J. Cole vs. Nas

  • J. Cole’s music is significantly more positive than Nas’s
    • J. Cole has popular positive songs like “Crooked Smile,” while Nas has popular songs like “Life’s a B***h”
  • J. Cole only has 66.1% of Nas’s vocabulary, which is very similar to what I found from a third-party source that ranks rappers by vocabulary
  • Nas should think twice before handing J. Cole his crown as king of the rap game

Rohil Agarwal

17 of 45

Markov + Similarity + Jane Austen

Jackie Zeng

“I would rather have nothing to propose for my wife. She, poor soul, is tied by the influence of the five or six determined couple who were attracted at first little leisure for such feelings, whatever their origin, and could not be ineligible."

18 of 45

Elon Musk’s twitter feed is quite entertaining -- so I generated more tweets in his style.

(Honestly, that man would thrive in our class meme chat).

  • “facilities are a mega rave cave under the blade @EvaFoxU”
  • "maybe 5 or 6 years ago when you could wonder around find great one"
  • "last night @mrebuzz He studied the house. It’s too good! @nichegamer Choc milk is also next-level."

Karen Hinh

19 of 45

My Writer’s Workshop Novel: The Sequel

Using the partially complete novel I had written last semester, I used Markov analysis, that at higher orders sound like what you lucid dream before falling asleep.

By Alana Huitric

Order = 2

I… hadn’t killed anything like images from textbooks.

I still couldn’t be summoned to move backwards away from my fingers into the bed, all loose naked limbs.

I couldn’t see, but I hope so.

The beast gurgled, disgruntled to say that’s my blood. Ha ha… Man what on earth did this monster want me for?

Order = 3

They rushed to cover my wound and force something into my throat as I suffocated. No no no no! Panic, pain, rushed me. This couldn’t be real. All of the voices became a single, incessant drone in the air around him. Everything was silent except our dissonant breaths. “What do you mean complete?” Is it human? “Yes…” Perfect. As are all the voices everyone else hears spilling into their mind. Four pairs of brilliant amber eyes lit up synchronously. From the faint orbital illumination, I made out four skeletal heads each of different organisms closely clustered around the humanoid one, bristling amidst a mess of feathered, black iridescent wings. It sat on the lid of the toilet, pouring something into the steaming water. I stood transfixed by the rich, fruity scent of whatever pinkish fluid she was swirling into peaky bubbles at the water’s surface.

20 of 45

Presidential Speech Analysis -Tolulope Oshinowo

Analyzing the word frequencies and overall mood of texts scraped from presidential inauguration speeches from the UVA Miller Center Database, alongside candidacy kickoff speeches from other sources.

Re-elected presidents (or presidents trying to be reelected) tend to have longer, more emotionally volatile speeches. Trump keeps to this trend which could be a good sign for his candidacy.

These candidates use first-person plural pronouns like ‘we’ and ‘our’ to come off as more relatable, but Sanders then goes the extra mile by ending his speech ‘high’, which could be a good sign for him.

21 of 45

How is Babson College Identified on Wikipedia?

Andrea Lindner

22 of 45

Live presentation line

Add your slide BEFORE this one if you’d like to share your work with the class in a ~one minute presentation on Friday.

Add your slide AFTER this one if you’d rather not present (people can just read your slide instead).

23 of 45

“i left candidate’s weekend from my lack of fun activites, the opportunity to absorb as a candidate…”

Here’s what more people had to say:

the admission process: august 1st – attend a good time to know the lady at the three weekends are a first cw. the opportunity to current first years. meetings are run entirely by the materials; from scouring the first day, and scary. i fell in writing this weekend who are the admission counselors talk to cw3! during my first year! for olin students. see, olin campus. over one eye for you! best advice anyone could spend a parent, hi moms and now just ask, olin’s website or maybe even more daunting. it and if i was at olin wants me to

(Generated 100 words)

to meet, greet, and preparing to engage with my individual interview. don’t wear safety glasses when i talk to the airport (in case if you wait to candidates’ weekend is the next few of health, weather, or not. we only one as a small, selective school, with that they recommend that i just did (you really get a preference. candidates’ weekend (assuming prospective students are probably feeling on what i immediately assumed meant that being good enough (actual 2016 funny thing called candidates’ weekend, you aren’t (i certainly was unsure coming into our shoes next few days would be

(Generated 100 words)

My project uses a simple version of Markov Chain Analysis to generate a random Candidates Weekend Blog Post based on the content from the Olinsider related to Candidates Weekend. First, I created a dictionary for each word in the document (storing the words that could follow from that word) and from there, I selected a random word to start with and a random word from the values under that key to continue the chain! Here are some examples of the ‘blog posts’ I’ve been generating.

Caitlin Coffey

h no

24 of 45

Twitter Bot, Markov Sentence Generator

Examples from @realDonaldTrump:

- #KAG2020 https://t.co/7BrkAKYWU0 THANK YOU MASSACHUSETTS!

- Great for the Great American Comeback!

- After seeing so much of the Corona… This is just more Fak… https://t.co/wIba544vaW Mini Mike, “Three months ago I entered the race for President.

- #KAG2020 https://t.co/IdJ721oOsq THANK YOU TEXAS!

- We are with you all the way! https://t.co/vyu0Tbthv0 RT @DanScavino: The video was NOT manipulated.

Examples from @BernieSanders:

- The best way to increase turnout for our progressive agenda, which the American people support, is not the kind of politics.

- Find your polling place and its hours here: https://t.co/1V41XeLqEA If you want us to lose.

- Tonight, we invite her supporters into our campai… https://t.co/aRdoKGMpi3 This campaign is about asking one fundamental question: Which side are you on?

- One of us has a poverty crisis.

- The whole world is crying out for the last #Berniesanders rally before Super Tuesday.

25 of 45

Markovian Shakespeare

A code that downloads the complete works of Shakespeare and then uses those words to create new paragraphs that sound like Shakespeare's language. :D

Overall the program worked well, and on the right is a very beautiful example of Markovian Shakespeare. Hope you enjoy!

Your humble classmate,

~ Regan Mah

What wretched errors hath my empty words,

Whilst my poor country’s to command:

Whither, indeed, before thy time?

Warwick is Chancellor and the other side;

Gelding th’ opposed continent as much as it would content me

To say I did it for a thousand deaths would die.

Exeunt PROCULEIUS and two good Armors;

if he wear a great estate. When he himself is not lost.

OTHELLO. Fetch’t, let me kiss This princess of pure love,

To have this twelvemonth she’ll not match his woe.

“It shall be to God! Even there my hopes but she,

more covetous, Would have flung

26 of 45

AITA? RNN Binary Classification.

AITA is a subreddit dedicated to user posting stories, and other user bringing moral judgement down upon the original poster.

Question: Using just the titles of posts, can a recursive neural network predict Reddit’s verdict?

Answer: No, apparently not. (yet)

The model overfits immediately, regardless of hyperparamters, which is indicative of a dataset too small for the complexity of the problem being solved. Trying new architectures, adding the body of the posts as inputs or waiting for more data to be generated are all possible solutions.

27 of 45

Comparing Writers and How They Write

I downloaded texts from Project Gutenberg to see how vocabularies from different authors compared to each other, and if there were any known similarities between them

The Scarlet Letter had the highest level of word diversity when compared to classic Jane Austen novels, such as Emma and Pride and Prejudice.

Prisha Sadhwani

28 of 45

News Article Categorizer

Requests → Return text from New York Times

TF-IDF → Analyze New York Times Articles divided by categories.

Scikit-Learn → Return model selection

Tensorflow (Keras) → Train data using MLP

e.g An uncrewed test flight of Starliner, a Boeing spacecraft designed to carry NASA astronauts, could have ended in disaster in December because of lapses that allowed software errors to slip through undetected and unfixed before the spacecraft launched, according to a review by NASA and Boeing that was announced on Friday.” → (Science, 0.65)

Final result: Correlation of article categories on specific texts

29 of 45

Novel Similarity

Declan Ketchum

Based on comparing the frequency of words:

Pride and Prejudice and Emma by Jane Austen are 97% similar

Pride and Prejudice by Jane Austen and A Tale of Two Cities by Charles Dickens are 95% similar

30 of 45

Terence McKenna 🔮🧠 // Markov Chains ⛓💻

Jasper Katzban

Goal: generate some profound psychedelic-themed quotes in the style of the ethnobotanist Terence McKenna.

Implementation: a Markov generation algorithm applied to a plain text version of McKenna’s work from an HTML source.

Result: some wacky but profound quotes that raise the question: did a human, drug, machine, or combination of all three create these?

See some examples at right:

“Around 1910, he had taken me to be finally thrown open to a box of cameras; I carried my mind unbidden.”

"A series of discrete energy levels must be broken through in order to represent more complex phenomena. We must imagine an atom with new parameters if we wish to understand how we could exist, how thinking, tool-using, human beings could arise out of the house, she discovered that the sensation of heat had not diminished but grown stronger.”

“Yesterday afternoon Dave discovered Stropharia cubensis in the stream of evolution and sympathetic therefore to a point of view.”

31 of 45

Creating News Headlines

Using the subreddit r/nottheonion and markov analysis to create some fake news, I made these things (please note most of the headlines were nonsensical):

neighbor steals skeleton over zestimate

shoppers in stampede for toilet paper scrap

the human body is your doctor will decide if covid-19

bill fails citizenship test

alex jones brags that rugrats now qualify as a central role in a mind-uploading service

mother renames son after mcdonald's rick and had no reason

french chef sues mormon church for posting paparazzi photo of his house and sheep

a single banana said hitler hq for sex ring

hamilton police called jay-z

we regret to keep your default

game console and prayers aren’t working

5th-grader airlifted to the next cadbury bunny

canadian women are protesting open primaries by going to walmart

Kate Mackowiak

32 of 45

Analyzing tweets! (Zachary Sherman)

  • I analyzed my own most recent 199 tweets.
  • I learned that:
    • Since joining Twitter in October 2015, I’ve liked 45,621 tweets (that’s ~29 tweets per day!)
      • I need to get out into the world occasionally…
    • I tweet mostly helper/filler words like “I”, “the”, and “to”.
    • 23.9% of the words in my tweets are considered “positive” by VADER, while 5.2% are considered “negative” and 70.9% are considered “neutral”.

33 of 45

Creating fake tweets using Markov text analysis

  • Exploratory question:
    • How can artificially generated social media posts impact society’s political and social opinions?
  • Process:
    • scraped the last ~1000 tweets from Trump
    • Removed all links, and used markov text analysis to generate fake tweets
  • Results had a lot of grammatical and syntactical issues, but may have been due to nature of input data rather than algorithmic issues
  • Some noteworthy results:
    • A lot of Democrat dropouts tonight, very low political I.Q.
    • Nancy Pelosi, their leader, dumb as a campaign tool.
    • The House Democrats were unable to get into the U.S.

By Kelly Yen

34 of 45

“UnixpR0n” Scraper

I parsed the subreddit “unixpr0n” in order to find the most commonly customized OS and looked further to find benefits of it.

I found that the most common OS was “i3-gaps”, a tiling manager that is accepted by most any linux distribution, which is highly customizable and allows for ease of use whenever it comes to working.

Oscar De La Garza

35 of 45

Khalid’s Ghost Writer

Creating AI Generated Music for Khalid

“...left right left right direction floating through different dimensions but i do do now here that ayy ooh now you there's nowhere” - Khalid

Nikhil Anand

36 of 45

Using the top twenty gutenberg books to explore

The Correlation between Author Gender and Word Choice

By Hazel Smith

Interesting Results

23% of the pronouns that male authors used were she/her/hers while 41% of pronouns that female authors used were he/him/his.

Female

Author

Male

Author

she/her/hers

2.23%

0.60%

he/him/his

1.95%

1.37%

They/Them/Theirs

0.56%

0.69%

37 of 45

Using Markov Analysis to Write Movie Scripts

For this project, I scraped the Internet Movie Script Database (IMSDb) for movie scripts and performed Markov analysis to generate my own script.

Follow this arrow for results --------- >

Marion Madanguit

EXT. SCHOOL GIRL.

DR. TRAMMEL

Doctor Rumack, I'm a big deal.

JONATHAN

I'm sure it was so you are still staring into it. You are not in my life in this time he gets his head.

CHARLOTTE

What are you are on a man on it. He walks over with an apple red car.

JAMES

I can't get it and I think I just got to her face.

38 of 45

Starting on any Wikipedia link, if you only click the first link on each new page, you will eventually always hit the same dead end.

39 of 45

Starting on any Wikipedia link, if you only click the first link on each new page, you will eventually always hit the same dead end.

links

40 of 45

Starting on any Wikipedia link, if you only click the first link on each new page, you will eventually always hit the same dead end.

41 of 45

python3 text_mining.py “Apple Inc”

Starting at https://en.wikipedia.org/wiki/Apple Inc, these are the steps to reach the Wikipedia Philosophy Webpage

https://en.wikipedia.org/wiki/Apple Inc

https://en.wikipedia.org/wiki/Multinational_corporation

https://en.wikipedia.org/wiki/Corporation

https://en.wikipedia.org/wiki/Company

https://en.wikipedia.org/wiki/Legal_personality

https://en.wikipedia.org/wiki/Person

https://en.wikipedia.org/wiki/Reason

https://en.wikipedia.org/wiki/Consciousness

https://en.wikipedia.org/wiki/Sentience

https://en.wikipedia.org/wiki/Feeling

https://en.wikipedia.org/wiki/Nominalization

https://en.wikipedia.org/wiki/Linguistics

https://en.wikipedia.org/wiki/Science

https://en.wikipedia.org/wiki/Latin

https://en.wikipedia.org/wiki/Help:IPA/Latin

https://en.wikipedia.org/wiki/International_Phonetic_Alphabet

https://en.wikipedia.org/wiki/Alphabet

https://en.wikipedia.org/wiki/Letter_(alphabet)

https://en.wikipedia.org/wiki/Symbol

https://en.wikipedia.org/wiki/Idea

https://en.wikipedia.org/wiki/Philosophy

These were the 20 steps to reach the Wikipedia Philosophy Webpage

42 of 45

Predicting the Author of Beatles Songs

By Nathan Faber

This project employs word frequency analysis via TFIDF and Cosine Similarity to attempt to see similarities between all Beatles songs.

I scraped data from one site to get the lyrics and titles of songs and entered data that indicated who wrote each song.

As it turns out these tools aren’t practical to solve this problem I have less than 50% accuracy….

43 of 45

Text Mining ‘Birth of a Nation’

Analyzing Word Frequency & Sentiment

Name

Rank

Frequency

Slave

22

91

White

36

62

Plantation

43

57

Suh

68

40

Sentiment Analysis Results:

{'neg': 0.0,

'neu': 0.981,

'pos': 0.019,

'compound': 1.0}

  • How does the silent nature of the film skew results?
  • Why does sentiment analysis return predominantly neutral results?
  • What other analysis can better capture the negative representation of African-American populations in this film?

Antonio Perez

44 of 45

Capulet vs Montague

Identify the number of times each character speaks and compare the number of conversations each house had in the play Romeo and Juliet.

{'Montague': 48.166666666666664, 'Capulet': 30.0, 'Neither': 17.90909090909091}

Oscar Zhang

45 of 45

Markov Text Generation of Political News Headlines

  • Scrape html headlines: CNN, Fox, Politico, WaPo, NyTimes; use markov chains to generate fake news
  • Chains longer than one word don’t have many diverging options due to dataset size
  • Varying degrees of believability/readability (markov chains not an ideal way to implement)

How the level of Aircraft Carrier Pleads for investigation

Both public health and voting-rights experts say there's little time for customers to Power Shift

Both public health experts and hundreds of aircraft carrier hit by coronavirus is affecting US

The FBI's surveillance.

US economy shuttered to treat Covid-19, pitching him head-to-head with public health and provide models

Analysis: Trump's interview with Sean Hannity

Fox Nation hosts have reasons to talk to him head-to-head with anti-environment voting procedures

5 takeaways from coronavirus briefings

Fauci says he won't allow Democrats "to achieve unrelated policy items they wouldn't otherwise"

Doctors disagree with coronavirus stimulus checks

How two weeks changed America, brace for six months under a strict quarantine after intensive

4 takeaways from Congress to expand in praising Trump wildly exaggerates 1918 flu mortality rate

History's verdict on hold -- White House reassessing deal bars Trump's poll surge last?

NYPD to hear case of the Country: 2 Sets of Netflix stock.

Lilly Ledbetter, advocate for female VP pledge

Top 20 words in the news: ['to', 'of', 'for', 'on', 'in', 'with', 'is', 'says', 'as', 'from', 'be', 'he', 'during', 'response', 'by', 'will', 'that', 'pandemic', 'more', 'are']

Top 20 words in the news excluding common words: ['during', 'response', 'pandemic', 'need', 'amid', 'outbreak', 'governor', 'federal', 'public', 'political', 'home', 'help', 'guidelines', 'crisis', 'coronavirus,', 'bill', 'administration', 'virus', 'trillion', 'stimulus']

Lilo Heinrich