Hearing and speech impairment is disability which affects individual communication with outer world by verbal communication and in turn they use sign language to communicate. Usually deaf and mute people are trained to sign language use and can communicate with their own communities, whereas other community uses spoken language and don’t have knowledge about sign language. Spoken language user mostly needs translator to understand sign languages and they cannot directly communicate with deaf and mute people. Recently, deep learning techniques like CNN, RNN are used for gesture recognitions of sign languages, but converting into proper spoken language grammar and rule have not been analysed. As part of this study, I will be using NLP techniques like syntactic, semantic and sentiment analysis to get proper grammar and rule text from American Sign Language (ASL) converted text from gesture recognition. The dataset used is the American Sign Language Lexicon Video Dataset.
Impaired hearing and speech communities are struggling to communicate with outer world in real time. They are facing many challenges to gel with societies and it adversely affected their confidence and moral. Communication is important factor and it bridges the gap between individuals and groups through flow of information and understanding between them. It not only helps to facilitate the process of sharing information and knowledge, but also helps people to develop relationships with others. Sign languages born to enable Impaired hearing and speech communities to interact with spoken language users. Sign languages are in visual nature and communicate with outer world with different signs by fingers / hand gestures. American Sign Language (ASL) one of the widely used sign language in the world.
American Sign Language (ASL) is predominantly used sign language of Impaired hearing and speech communities in U.S and Anglophone Canada. Unfortunately, there is a communication gap between spoken language and ASL users as their way of talking is completely different and person need to be trained sign reader to understand American Sign Language (ASL). ASL also have visual nature whereas spoken languages depend on sound which creates two different languages with different way of communication.
Unlike spoken languages, ASL have their own grammar, let’s take English as an example. Spoken language users converts ASL’s signs with one to one-word translation to understand, but it is in-accurate way of ASL translation. There is common misconception ASL’s grammar is like English language, but it is not and as mentioned earlier, it has their own grammar and is not like English or any other spoken language.
ASL Translators are used in world to communicate between ASL and spoken language. In current technology age, numerous approaches have been developed for analyzing and understanding ASL. Machine Learning and Artificial Intelligence are latest technology entrant in the field to recognize sign gestures and translate or interpret into spoken language like English. Although, with ML & AI, it is relatively easy to understand Impaired hearing and speech communities, but very little attention has been focused on start and end of sentence in American Sign Language (ASL) which constitutes its grammar and rules. Sign to word translation by recognizing static finger/hand gestures is inaccurate and there is requirement of an algorithm which translates every sentence instead of words or alphabets.
How and why grammar and linguistic structures are ignored while translating American Sign Language (ASL) to Spoken language (English in this case) using ML & AI algorithms? How to separate sentences from several sign finger/hand gestures? Will translation of whole sentence instead of word is effective? These are few questions will be researched in the study and answers explored using different model/algorithm with Natural Language Processing (NLP).
Identifying end of sentence in ASL is a challenge in a communication which don’t have any sign like full stop in English sentences. With spotting start and end of sentence of ASL, structuring sentences in English language with its grammar will be a huge task. These questions will be major research questions.
Developing the algorithm / model which automatically translates sign language and generates English language sentences. As mentioned earlier in research question, identifying start and end word (sign) of a sentence in ASL which will help dynamically recognize gesture of whole sentence. This algorithm will enable Impaired hearing and speech communities to improve communication and help English users to understand ASL better and easily without any translator. Restoring English grammar after translation of ASL into English language.
I am not too familiar with ASL and learning while exploring research question’s answer and improve my ASL sign knowledge as well. I will try to cover almost all expects of ASL translation into English language, but this study won’t be able to tap at all points of ASL translation.
Several approaches and researches had been performed to address this problem using gesture recognition in video and images using several different methods.
Authors Murat Taskiran, Mehmet Killioglu and Nihan Kahraman published study on real time system for recognition of “American Sign Language by using Deep Learning” in 2018. They had attempted to create a real-time translator for those who do not know sign language. They used Convolutional Neural Network (CNN) models for feature extraction and classifier for hand / facial expression signs. They have mainly classified the gestures and converted to words/alphabets without dealing with conversation language’s grammar and rule. They have used Tensorflow and Keras libraries in Python for image classification and recognition.
Authors Kshitij Bantupalli and Ying Xie published study on “American Sign Language Recognition using Deep Learning and Computer Vision”. They had used a Convolutional Neural Network (CNN) model for Sign Language Recognition (SLR) and a Recurrent Neural Network (RNN) model to extract feature from video sequences. Here also, authors have focused mainly on recognition of gestures using video frames, converting into spoken language inaccurately. They used Inception CNN model which was developed by Google for image recognition. While experimenting they have faced problem while executing this model with different skin tone.
Authors Aradhana Kar and Pinaki Sankar Chatterjee published study on “An Approach for Minimizing the Time Taken by Video Processing for Translating Sign Language to Simple Sentence in English” in 2015. In this study, ASL grammar have been differentiated with spoken language grammar and difference had been explained. This study has explained ASL grammar with example with past, present and future tenses. This study had been divided in three parts, Video Processing, Natural processing language and text to speech conversion. It provides a good insight how ASL Grammar works and conversion to English Grammar.
The ASL Lexicon Video Dataset had been published by Vassilis Athitsos, Carol Neidle, Stan Sclaroff, Joan Nash, Alexandra Stefan, Quan Yuan, and Ashwin Thangali from Computer Science and Engineering Department, University of Texas at Arlington, USA / Linguistics Program, Boston University, Boston, Massachusetts, USA / Computer Science Department, Boston University, Boston, Massachusetts, USA. In this study, we will be using ASL Lexicon Video Dataset for training and testing CNN models for gesture recognition. ASL Lexicon Video Dataset has a large public dataset containing video sequences of thousands distinct ASL signs as well as annotation of those sequences including start / end frames class label of every sign. These video and annotations in this dataset are used in sign language recognition, gesture recognition and can be used to build models and train classifiers for a set of ASL signs. I will be using this dataset for building and training models for recognition of sign language.
Analysis AND DESIGN
In past studies, authors were mostly focused on high performance, feature classification and accuracy of model recognition for gestures by hand, fingers and facial expression using deep learning and neural networks techniques like CNN, RNN, etc. But after recognition and translation of signs into words, there was a problem which was not tackled. Meaning of translated sentences as sign language had ASL’s grammar / rules and converting it into word by word doesn’t have proper spoken language grammar (in this case English). Translated sentences still had ASL’s grammar and rules and different than English grammar. These studies still don’t solve the problem fully of real time translation for sign language of deaf and mute communities as English users won’t be able to clearly understand and they need to guess some part of missing adjectives, verbs, etc.
I will be using below two phases to get the translated English sentence i.e.
- ASL Gesture Recognition by video/image/frame
- Natural Language Processing
ASL Gesture Recognition by video/image/frame
This is first part of the study, where I will be using The ASL Lexicon Video Dataset to train the models for gesture recognition using deep learning techniques like CNN, RNN. It will generate ASL sign gestures into text as output. It will include gesture detection and gesture classification which will be mapped with text. We can use available deep learning algorithm and can add later few steps to get proper recognition
Natural Language Processing
Pre-processing and exploratory steps such as POS tagging, word frequencies, stop words, will be performed to clean the received ASL’s text after gesture recognition. Natural Language Processing (NLP) is second part which drives a meaningful sentence by ASL’s recognized text. It deals with the structural roles of words in the sentence and using parsing language to produce a tree like structure which syntactic relationship with ASL’s converted raw text. This is Syntactic Analysis, and it includes speech tagging, chunking and sentence assembling.
After an ASL sentence is parsed with understanding syntax, Semantic Analysis will come into picture to drive the meaning of the sentence in a context free form. Semantic is not able to determine context of the word, for example Boy had an apple, here boy could have “ate” apple or “owned” apple. It is an example to explain context free form of word, it doesn’t mean word ends up in multiple context, but there are chances.
Sentiment Analysis is next step to understand the sentiment behind every sentence. Sentiment will include emotions, opinions and context. This step provides the purpose of sentence and feeling have related to sentence. It is very tough to pull out sentiments from plain sentence.
- Author – Murat Taskiran, Mehmet Killioglu, Nihan Kahraman, A Real-Time System for Recognition of American Sign Language by using Deep Learning Published in 2018 41st International Conference on Telecommunications and Signal Processing (TSP), 4-6 July 2018 Author – Kshitij Bantupalli, Ying Xie
- American Sign Language Recognition using Deep Learning and Computer Vision Published in 2018 IEEE International Conference on Big Data (Big Data), 10-13 Dec 2018 Author – Aradhana Kar, Pinaki Sankar Chatterjee
- An Approach for Minimizing the Time Taken by Video Processing for Translating Sign Language to Simple Sentence in English Published in 2015 International Conference on Computational Intelligence and Networks, 12-13 Jan 2015 Author – Vassilis Athitsos, Carol Neidle, Stan Sclaroff, Joan Nash, Alexandra Stefan, Quan Yuan, and Ashwin Thangali
- The American Sign Language Lexicon Video Dataset Published in IEEE Workshop on Computer Vision and Pattern Recognition for Human Communicative Behaviour Analysis, June 2008