CLIMATE-RELATED TWITTER FEEDBACK ANALYSIS Using Bert
By: Yihang Hu, Qingcheng Wei, Zinan Zhang, Chengze Xie
GOAL OF THIS PROJECT
Why we choose this project?
GOAL OF THIS PROJECT
1. Facilitate user retention and interaction
2. Prone to receive positive Tweets
Benefit of this project
1. Facilitate user retention and interaction
Whenever users post a tweet, system will use our algorithm to identify whether the tweet will be liked or not.
Based on the result, the system will push tweets that will be liked a lot forward than others when users search for climate or environment related topic
And it will facilitate user interaction with the platform and improve the user stickiness
GOAL OF THIS PROJECT
2. Prone to receive positive Tweets
Popular tweets are likely to contain positive and profound environmental-related topics.
With positive tweets pushed to the top, public awareness on climate change can be raised.
DATA GATHERING
1. We use the <Tweepy> package to request for twitter API by providing its specific ID number
2. We clean the table by removing all the Nah values in the table, organizing into Texts and Labels which are our associated X and Y values.
3. We save our data into csv file for future references and feed it into our pre-training model
DATA GATHERING Example
Given: Twitter ID number: 1028957353322762240
2. Use the ID as our parameter and put into API request functions from Tweepy package
get_tweets() & get_liking_users()
3. We get the twitter texts and liking numbers associated with this ID:
Text: “An eye-opening article. This further reinforces the need to switch to a more enviroment friendly lifestyle.\n@EamonRyan thank you for sharing this!”
Label: 0
DISPLAY THE DATASET
Pre-training Model
Bert_Tokenizer
With the help of this model, we transform the text into numbers by the Lookup table
With the help of this Bert specific tokenizer, we are able to
Why choose Bert transformer?
Bert Transformer
RNN/LSTM
MODEL
FINAL OUTPUT
The figure on the right represents the average training losses in each epochs. At last based on our experiment, we got the testing accuracy of 83.82%
Difficulties while fitting the model
At every request using API, we can only retrieve a limited amount of tweets, since we only have the most basic developer tool. Hence the downloading process is tremendously long.
The data tokenization part is time-consuming, since we have to convert the original tweet post, after data processing such as removal of special characters, into list of numbers at both the training and testing process.
The optimizer has to be carefully determined for NLP; due to high variability of original post, such as Emoji or languages other than English, error handling such as Try-Catch has to be implemented.
Reference
Pre-trained model acquisition:
1: https://github.com/avishreekh/Depression-detection-using-Twitter-posts
Data acquisition:
2: Climate Change Tweets Ids - GWU Libraries Dataverse (harvard.edu)
Q & A