2 of 17

Agenda

Attention Mechanism

Clustering Techniques

NLP pipeline

Methodology

Results

Discussion

Summary

Clustering Text Using Attention

06/07/2021

3 of 17

Attention Mechanisms

In simple terms, attention mechanism can be thought of an additional layer somewhere in a network architecture which gives the deep learning model extra controlling parameters to refine its learning by paying attention to different parts of the input as per the requirement.
Fig 1. depicts attention mechanism used in a simple feed forward neural network. This has been mentioned by Colin et. al. [11] where they support that feed forward neural network with attention can solve some long term memory problems
Galassi et. al. [10] have done a great work and given a comprehensive, structured, in-depth, and sound analysis of attention in natural language processing

Clustering Text Using Attention

06/07/2021

4 of 17

Clustering Techniques

Most of the clustering techniques leverage distance metric to partition data into clusters after representing data in a n-dimensional hypercube
Many clustering algorithms also require number of clusters as an input. If our experience or knowledge about data is limited, estimating number of clusters can be challenging
Though there are techniques available to estimate number of clusters, usually Elbow method or Silhouette score analysis. Sometimes, researchers also use square-root of number of data points as an estimate for number of clusters.

Clustering Text Using Attention

06/07/2021

5 of 17

NLP pipeline

The pre algorithm treatment done on text either to get some vector representation or a feature set for a given piece of text plays a significant part in the ultimate performance of the clustering technique.
This complete pipeline of pre-algorithm treatment opens up possibilities of various permutations and combinations apart from the final clustering algorithm to improve overall performance
We discuss the use of attention mechanism in this pipeline to improve the overall performance of clustering

Clustering Text Using Attention

06/07/2021

6 of 17

Methodology

We use Hierarchical Attention in the pre-algorithm treatment before application of any clustering technique. The intuition is taken from the work done on document classification using Hierarchical Attention Networks by Zang et. al. [12]
Each document is made of a number of sentences and each sentence made of a number of words. We leverage the same architecture to get the Document Vector and use it as input vector for the clustering techniques
Words are the smallest atomic unit in our architecture. Each word is represented using a fixed dimension embedding

Clustering Text Using Attention

06/07/2021

7 of 17

Hierarchical Attention Network

Clustering Text Using Attention

06/07/2021

8 of 17

Methodology

The weights in the network architecture in original classification task are learned while training and these weights help to generate meaningful document vectors before a final softmax layer while testing
Since clustering is an unsupervised algorithm, it does not have any training before testing. How do we get weights in different layers to get document vector before passing it to clustering ? We use classification problem to learn these weights before jumping into clustering

Clustering Text Using Attention

06/07/2021

9 of 17

Methodology

The concept used to learn the weights goes as follows. During preliminary data analysis, we separate out some fraction of data from the complete data. We perform manual annotation on this fractional data and segregate it into different classes. Then, we train a classification model using Hierarchical Attention Network on this fractional data. This training helps us to learn weights and get insight into the structure and partitioning present in the data. Finally we save these learned parameters by our model. We use these learned parameters in order to get document vectors of remaining fraction of data before passing it to a clustering algorithm

Clustering Text Using Attention

06/07/2021

10 of 17

Methodology

The intuitive idea behind this reflects the real world scenario. In real world scenarios, whenever we are faced with some challenges which require bucketization of data, we usually look at the data and analyze it before jumping on to the clustering techniques. In this preliminary data analyses, the data scientist tries to manually annotate some fraction of data to capture various parameters, one important parameter being the number of clusters. Since the data scientist is already performing such an analyses, a little more effort in this direction to annotate fraction of data into separate classes could help us learn attention weights and other model parameters of Hierarchical Attention Networks which could be leveraged to improve performance of clustering algorithms

Clustering Text Using Attention

06/07/2021

11 of 17

Results

Clustering Text Using Attention

06/07/2021

Avg.Ev. = (homo+comp+var+ari+ami+silh)/6

12 of 17

Results

Clustering Text Using Attention

06/07/2021

13 of 17

Results

Clustering Text Using Attention

06/07/2021

14 of 17

Discussion

We can see some general trends in experiments. In general, attention clustering performed better than plain clustering
Within the attention clustering variations, in both AS and AP, clustering algorithm performance metric improved with increase in fraction of data used for attention training. This is quite intuitive, as more data was used for attention training, the better the attention weights were able to capture the signal present in the data.
Within attention clustering, results of both variation AS and AP were not very different, i.e., with both pre trained word embeddings and self-trained word embeddings, we got similar results. This is because drugs dataset has a lot of common English words, even though it has some specific jargon in it, pre trained word embeddings were still able to capture that. Additionally, attention weights helped the clustering algorithm in capturing the differences in both AP and AS variations almost equally well
This leaves an intuitive suggestion that even random vector space representations without any contextual meaning attached may give some reasonable performance rather than completely flawed performance metrics because of attention training involved in the pre-clustering pipeline

Clustering Text Using Attention

06/07/2021

15 of 17

Summary

It is evident that clustering using attention mechanism indeed help in the overall performance of the clustering algorithm. The performance improves with increase in fraction of data used for attention training. We have used Hierarchical Attention Networks for our experiment, there could be other ways to incorporate attention mechanism in the pre-clustering pipeline. Self-attention and attention used in Transformers could be another possible way. This paper tries to shed light into the less explored possibilities in the clustering field.

Clustering Text Using Attention

06/07/2021

16 of 17

References

[1] Aggarwal C.C., Zhai C. (2012) A Survey of Text Clustering Algorithms. In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_4
[2] Cao, Jianping & Wang, Senzhang & Wen, Danyan & Peng, Zhaohui & Yu, Philip. (2019). Mutual Clustering on Comparative Texts via Heterogeneous Information Networks.
[3] C. Shi, Y. Li, J. Zhang, Y. Sun and P. S. Yu, "A survey of heterogeneous information network analysis," in IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 1, pp. 17-37, 1 Jan. 2017, doi: 10.1109/TKDE.2016.2598561.
[4] Karol Grzegorczyk. (2019). Vector representations of text data in deep learning. arXiv:1901.01695
[5] K. Babić, S. Martinčić-Ipšić, and A. Meštrović, “Survey of Neural Text Representation Models,” Information, vol. 11, no. 11, p. 511, Oct. 2020 [Online]. Available: http://dx.doi.org/10.3390/info11110511
[6] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR, 2015, pp. 1–15.
[7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
[8] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT.
[9] Radford, A., & Narasimhan, K. (2018). Improving Language Understanding by Generative Pre-Training.
[10] A. Galassi, M. Lippi and P. Torroni, "Attention in Natural Language Processing," in IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2020.3019893.
[11] Raffel, Colin & Ellis, Daniel. (2015). Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems.
[12] Yang, Zichao & Yang, Diyi & Dyer, Chris & He, Xiaodong & Smola, Alex & Hovy, Eduard. (2016). Hierarchical Attention Networks for Document Classification. 1480-1489. 10.18653/v1/N16-1174
[13] Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH '18). ACM, New York, NY, USA, 121-125. DOI: https://doi.org/10.1145/3194658.3194677

Clustering Text Using Attention

06/07/2021

1 of 17

2 of 17

3 of 17

4 of 17

5 of 17

6 of 17

7 of 17

8 of 17

9 of 17

10 of 17

11 of 17

12 of 17

13 of 17

14 of 17

15 of 17

16 of 17

17 of 17