Abstract
There is lot of text content being generated in online and social media and that text contains lot of abuse, threats and malicious content. This paper presents the usage of two Natural Language Pro- cessing (NLP) models to identify and detect whether the text is abusive, threatening or targeting any individual or a group or not. The kind of problem that this paper focuses is on text classification and multi-class multi-label text classification using deep learning models. BERT and XL- Net are popular language models which achieved state-of-the-art results on many NLP tasks.These two models are pre-trained models and this paper make use of them and fine tune these models to abusive language detection task.
Introduction
Offensive language identification or hate speech recognition are computational natural language processing tasks that detect the abusive language in text. The text may be tweets, posts and comments in various social media platforms or speeches given by famous persons, celebrities, politicians etc. The term abusive refers to threat, targeted abusing (an individual or a group or a community), criticism, discrimination, insulting etc. So the group of individual who are facing a lot of abusive content can use the proposed automated detecting model to hide that type of content from their social accounts. Now-a-days social media is accessible to everyone including children and the sensitive information and language is being posted on internet so the model can detect abusive content and hide it from sensitive users and children.
Save your time!
We can take care of your essay
- Proper editing and formatting
- Free revision, title page, and bibliography
- Flexible prices and money-back guarantee
Place an order
There is tons of text data that is being generated in social media and the NLP model becomes complex in training and handling such huge corpus. So pretrained models are used and these are transfer learned on downstream tasks like text classification, text summarization, question-answering etc. BERT[4] and XLNet[5] are pre-trained language models on huge text corpus based on transformer architecture and are fine-tuned on other datasets for abusive con- tent detection. These two models achieved state-of-the-art results on many NLP tasks by training on task specific data. We use these two models in this paper.
Related Work
There are many statistical models that deals with text classification (whether text is abusive or not) such as Naive Bayes and Support Vector Machines (SVM). But deep learning is used for many NLP tasks because deep learning works good with large amount of data and to handle the sequence nature of text Recurrent Neural Networks (RNN)[1] (special type of neural network) are used. RNNs are used to get the final representation of sentence as output context and it is fed through softmax layer for text classification. As RNNs cannot represent longer sentences properly due to fixed-length context as output of RNNs. RNNs also suffer from vanishing and exploding gradient descent problems.
Variants of such as LSTM and GRU are used but as they are sequential in nature they cannot be parallelised and computation became hard with large amount of training data. Then a parallel architecture called Transformer[2] was introduced with attention and self-attention mechanism which became a game changer in NLP field. Transformers completely replaced RNNs and LSTMs, it eventually beacame basic building blocks of modern NLP tasks. Latest SOTA models like BERT, XLNet are built on transformer architecture and attention mechanism.
Baselines
As mentioned the two baseline NLP models that we used and implemented are BERT and XLNet which are built on transformer architecture.
Transformer
Attention is All you Need” ( Vaswani, et al., 2017)[2] is one of the most respected paper in field of Natural Language Processing. The “transformer” model is com-pletely built on the self-attention mechanisms without using sequence-aligned re- current architecture. Further advancements like BERT, XLNet, OPENAI-GPT are based on transformer architectures.
Transformer contains 6 layers of encoder and 6 layers of decoder.Encoder en- codes the input sentence and it contains three layers. They are adding positional encoding to word embeddings to preserve the order of words and the resultant embedding is sent into self-attention layer which looks at different part of same sentence. The self-attention layer is multi-headed so that the model focuses on different positions of input i.e.,multiple representation of input. The outputs Abusive or Offensive Language Detection using fine-tuned BERT and XLNet 3 from self-attention layer is passed through feed-forward neural network and it is continued for 6 layers of encoder. The transformer views the encoder output as key-value pairs.
Decoder on the other hand is similar to encoder but with one more additional layer called masked-multihead attention layer to prevent the future leak of information in inference mode. The ouptut of this layer is fed to next self-attention layer as query and key-values come from last encoder output and finally the feed-forward neural network.
BERT
Transformers are uni-directional and the key innovation of BERT is the bi- directionality. The model captures information from both the left and right context of the token from the very first layer to all the other layers. BERT only uses the encoder part of the transformer. BERT uses masked Language Modelling (LM) by masking out some words or tokens in input and predicting them. It also performs another task which is next sentence prediction (whether second sentence is actually next sentence of first sentence).When training the BERT model, the masked LM and next sentence prediction are trained together with the goal of minimizing the combined loss function of the two strategies. Pre-trained BERT model is available publicly in two ways:
- – BERT base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters.
- – BERT Large – 24 layers, 16 attention heads and, 340 million parameters
We are using both the pre-trained models in this paper for sentence classification. When using language model for other NLP taks a task-specific layer is added on top of these models.In this paper the layer consists of dropout, linear transformation and softmax. Fine Tuning replaces the output layer which is originally trained to recognize different number of classes, with, a layer that recognizes the number of classes we require.The new output layer that is attached to the model is then trained to take the lower level features from the front of the network and map them to the desired output classes using SGD. Once this has been done, the extra layer jointly gets trained with the BERT model. So, language models like BERT can achieve SOTA results on downstream NLP tasks.
XLNet
BERT model is not auto-regressive and predictions are independent when multiple words are masked and it also suffers from fine-tune discrepancy problem. XLNet is auto-regressive and XLNet use Transformer XL as a feature extracting architecture, which is better than other type of Transformers, since Transformer- XL[8] added recurrence to the transformers which makes the XLNet to under- stand the context in depth basically uses two mechanisms:
- Permutation Language Modelling
- Two Stream Self Attention
Permutation language modelling is a task of predicting a word in a sentence given any other number of words from sentence in random order. So the prediction takes among different permutations of the order of the words in sentence Two-Stream Self Attention Each token position ’i’ in the sentence is asso- ciated with two vectors at each self-attention layer. To avoid the future leak of information this method is used.
XLNet pre-trained is also public available in two ways (base and large). We used two models on our task. XLNet is fine tuned in the same way as BERT, by adding one additional layer. XLNet outperforms BERT in many NLP tasks.
Dataset
We used two datasets for this task one is Offensive Language Identification Dataset (OLID)[9] dataset and the other is Toxic Comment Classification dataset which has multiple (six) target features. OLID contains a collection of annotated tweets using an annotation model that encompasses three sub tasks.
- Task-A Offensive Language Detection (text is OFF offensive or NOT not offensive)
- Task-B Categorization of Offensive Language (text is TIN Targeted Insult and UNT Untargeted)
- Task-C Offensive Language Target Identification (text is targeting IND an
Individual, GRP a Group or OTH ot others) The other dataset contains a text or tweet and contains six types of toxicity such as toxic, severe toxic, obscene, threat, insult and identity hate.
Implementation and Results
BERT and XLNet are implemented with tensorflow as pipeline. Bert is trained on cloud TPUs in google colaboratory notebook.Google Cloud stated that ’Cloud TPU is designed to run cutting-edge machine learning models with AI services on Google Cloud. And its custom high-speed network offers over 100 petaflops of performance in a single pod — enough computational power to transform your business or create the next research breakthrough’.So the pre-trained models are loaded and fine tuned on TPUs, and then the model checkpoints and results are stored on cloud storage buckets. Because of the usage of TPUs it took decent amount of time for fine-tuning on both OLID dataset and Toxic Comment Clas- sificaton dataset. Due to lack of computational resources we could not fine tune XLNet large on these datasets.
Abusive or Offensive Language Detection using fine-tuned BERT and XLNet 5
Applications
Abusive language detection can be applied in many fields, like, Online social media contains posts, tweets and content that is abusive, toxic or harmful for children so based on age the social media platforms can hide such type of content for certain age group children.YouTube, Netflix etc have billions of videos and subtitles when streaming them, this method can be used to detect abusive content and intimate to users that the video contains abusive or sensitive speech.
Conclusion
The paper discussed two of the most recent and widely used state-of-the-art models in Natural Language Processing field and their application for abusive detection task. Unfortunately due to time constraint and lack of computational power we couldn’t train XLNet on few tasks. Furthermore, a proper cleaning and preprocessing of text data would have brought much more greater results.So, the main aspect of this paper is to solve the problem of abusive content detection using state-of-the-art pre-trained language models and moreover the paper also discussed about running such type of large models in cloud TPUs.
References
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans- lation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre- training of Deep Bidirectional Transformers for Language Understand-ing. arXiv preprint arXiv:1810.04805, 2019
- Zhilin Yang , Zihang Dai, Yiming Yang , Jaime Carbonell , Ruslan Salakhutdinov , Quoc V. Le. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237, 2019
- A.Radrof, J.Wu, R.Child, D.Luan, D.Amodei, I.Sutskever, Language Models are Unsupervised Multitask Learners , (2018)
- Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, Llion Jones. Character-Level Language Modeling with Deeper Self-Attention arXiv preprint arXiv:1808.04444v2, 2018.
- Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Marcos Zampieri , Shervin Malmasi , Preslav Nakov , Sara Rosenthal Noura Farra, Ritesh Kumar 6.Predicting the Type and Target of Offensive Posts in Social Media. arXiv preprint arXiv:1902.09666, 2019.