1

Topic Enhanced Word Embedding for Toxic Content Detection in Q&A Sites

Do Yeon Kim, Xiaohang Li, Sheng Wang, Yunying Zhuo, and Roy Ka-Wei Lee

SNAA’19

Good afternoon everyone, I am Roy, originally from Singapore Management University but I have just joined University of Saskatchewan 10 days ago. This work was actually from a group of undergraduate students from SMU but due to the long flight and school term, they are able to present it so they nominated me to present this work instead.

2

Background

  • Users are increasingly using Q&A websites are for information exchange

Users are increasingly using Q&A websites for information exchange. Anyone here has use Quora before? If you have not, and if you are bored traveling or waiting, you just want something to read to kill time, I highly recommend you to read some of Quora questions and answers. They can be pretty interesting, although I have to caution that don’t believe everything that are posted there since well, it is part of the Internet and they are user generated content.

3

Background

  • Users are increasingly using Q&A websites are for information exchange

  • But not all users play well...

While Quora can be fun to read and sometimes it helps to answer our questions, but like many community-based websites, not all users actually play well… So take for example, these are actual posts that I have extracted from Quora. I am not too sure if they are still around if they are, you can search for it. Sometime people ask question which seems to be more like making a statement which can be sometimes offensive. Example “Are the BTS members gay?” BTS is a famous Korean boy band, and well if you are a fan of BTS, you will find it offensive. At least I think my student find this offensive that’s why they use this example for this desk of slides.

4

Toxic Content in Q&A Site

  • Quora’s challenge to detect insincere questions1:
    • Has a non-neutral tone. Has an exaggerated tone to underscore a point about a group of people or is rhetorical and meant to imply a statement about a group of people.
    • Is disparaging or inflammatory. Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype.
    • Isn’t grounded in reality. Based on false information, or contains absurd assumptions
    • Uses sexual content. Uses sexual content
      (incest, bestiality, pedophilia) for shock value,
      and not to seek genuine answers.

That brings me to the Quora’s challenge in detecting such insincere questions. This was originally an interesting Kaggle challenge. And the challenge defines an insincere questions as a questions which has…

5

Ideation Questions

  • Do some topics (e.g. politics) inherently has more insincere questions than the others?

  • If so, can we use the topical information to improvement insincere question classification?

So while the this group of students was working on this Kaggle challenge, we also came together to formulated some interesting research questions. We were wondering if some topics (e.g. politics) inherently has more insincere questions than the others? And if so, can we leverage on these topical information to improve insincere question classification?

6

Empirical Study

  • Dataset

  • Apply Latent Dirichlet allocation (LDA) to learn the topics in the questions

So we first perform some empirical study and data exploration. This is the dataset release by Quora. If you are interested, this dataset should still be publicly available. From the data summary table, we see that this in a imbalance dataset, such that, well fortunately, we have much more sincere questions than insincere ones (well, people are still generally nice online). As we are interested in extracting topical information, we apply LDA to learn the topics in the question.

7

Topics in Questions

  • Topic distributions for sincere and insincere questions (base on 140 topics, most optimal based on coherence score)

This is a the topic distributions for sincere and insincere questions. The blue chart for sincere questions and the orange one for insincere questions. The distribution is based on 140 topics, which is the optimal number of topics based on coherence scores. Looking at these two charts, we are seeing some interesting patterns.

8

Topics in Questions

  • Topic distributions for sincere and insincere questions (base on 140 topics, most optimal based on coherence score)

Sincere questions are quite uniform in their topic distributions...

Firstly, we notice that the sincere questions are quite uniform in their topic distributions. That is to say, sincere questions can come from all sorts of topics.

9

Topics in Questions

  • Topic distributions for sincere and insincere questions (base on 140 topics, most optimal based on coherence score)

Sincere questions are quite uniform in their topic distributions...

Spikes in insincere questions...

For the insincere questions, not so uniform. More interesting, we are observing some spikes in certain topics. That means that some topics do have more insincere questions than the others. So what are these topics? Any one wants to make a guess?

10

Topics in Questions

  • Topic distributions for sincere and insincere questions (base on 140 topics, most optimal based on coherence score)

Sincere questions are quite uniform in their topic distributions...

Politics

Gender

Religion & Culture

Race & Politics

Race

We examine the posts assigned with these spike topics and we manually label the topics. So we found that these spikes are covering these few topics: politics, no surprise there since politics can get quite nasty, gender, religion and culture, race, are all topics which we find more insincere questions being asked.

11

Topics in Questions

  • Keywords in top 5 topics for sincere and insincere questions

Sincere

Insincere

This table shows the top 5 popular topics for sincere and insincere questions. On sincere questions, we see more fact-based or science-based or technical topics such as software engineering, mathematics, nutrition, environment as popular topics. Well I guess these topics are more fact-based so it is less likely you see some insincere questions from this group… Interestingly, we observed that questions on dating topics are mostly sincere too, that was a surprise to do since it is such a subjective topic. On the other hand, we have the insincere questions, which covers more controversial and sensitive topics like politics, gender, race and religion.

12

Ideation Questions

  • Do some topics (e.g. politics) inherently has more insincere questions than the others?

  • If so, can we use the topical information to improvement insincere question classification?

Now that we have learned that indeed some topics inherently has more insincere questions than the others, can we use this topical information to improve the insincere question classification?

13

Problem Formulation

  • Given a question q, which contains a sequence of words Wq = (wq1, wq1, ..., wqN), the task is to predict if a question is insincere based on the word sequence used in the question, i.e., ƒ(Wq): → {0,1} such that:

ƒ(Wq)

1, if sincere

0, if insincere

Before we get into that, we first want to define the insincere question classification problem. We formulate the problem as such: given a question q, which contains a sequence of words W, the task is to predict if a questions is insincere based on the word sequence used in the question where we define a function f such as that f(W) is 1 if it is sincere and 0 for otherwise

14

Proposed Framework

  • Existing BiLSTM model for text classification

A possible or perhaps currently popular way to handle text classification is using deep neural networks such as the BiLSTM model. For those of you who are not familiar with the BiLSTM model, essentially this is how it works. We first use a pre-trained word embedding such a word2vec, again, these word emebdding is just some latent vectors which we use to represent each word such as words with similar semantics will be have more similar embeddings or closer in the embedding space. In anycase, the word embedding representation of each word in questions will be used as input into the BiLSTM model. The BiLSTM model, through the use of some gates and recurrent neural networks, will output a final softmax value which is tell us the likelihood of whether the questions is sincere or insincere.

15

Proposed Framework

  • Augment topic representations of words to pre-train word embedding as BiLSTM input

Here’s where we modify and improve the existing model. We basically enhance the word embeddings by augmenting the topic representation of each word with the word embedding! This enhanced word embedding is then used as the input to the BiLSTM model. Let’s get into more details on how this is done.

16

Topic-Enhanced Word Embedding

  • Word-Level Topic Acquisition. Obtain topic representation Ƭw of a word w

  • Augment Topic on Pre-trained Word Embedding.

Num of time word w is assigned to topic z in training

element-wise concatenation of the pre-train word embedding xnq and topic representation Ƭnq of the words in Wq

We first acquire the word-level topic representation Tw of a word w using the formula. Basically it is just finding the number of times word w is being assigned to a particular z topic in training, this is then average over all the topic assignments for this particular word w.

Next we concatenate the topic representation of this word to the to the pre-trained word embedding of each word.

17

Topic-Enhanced Word Embedding with BiLSTM

  • Rewrite the recurrent state transition step in BiLSTM by replacing the word embedding with the topic-enhanced word embedding

Finally we just re-write the BiLSTM recurrent state transition step with out new topically enhanced word embedding

18

Results

  • BiLSTM might have capture most of the essential features in text classification task

We implemented the BiLSTM with different types of popular pre-trained word embeddings, namely Paragram, FastText, Glove and Word2Vec.. Now for the results,it is a little anti-climax. Turns out that the the topically enhanced word embedding did improve the original word embedding with BiLSTM. But the improvement is really really small. A possible explanation for this could be because of that the BiLSTM might have already capture most of the essential features for text classifications.

19

Takeaways

  • Empirically shown that insincere questions tend to concentrates on certain topics.

  • Proposed topic-enhanced word embedding for insincere question classification

  • Future work:
    • Consider other more advanced techniques to topically enhanced the word embedding (e.g., attention techniques, etc.)
    • Extend the study to other antisocial behaviors (e.g. cyberbullying, hate speech, etc.)

To sum up, in this work, we have empirically...

SNAA19 - Topic Enhanced Word Embedding for Toxic Content Detection in Q&A Sites - Google Slides