The method of communication with the people having hearing and speech impairments is based primarily on sign languages and the lack of knowledge about the various sign languages makes this communication difficult. This project focuses on developing a system where user input based of hand sign gestures will be converted to the corresponding alphabets. Some challenges associated with this field are useful feature extraction and classification of various signs, extraction of the hand boundaries and identification of signs which involve a motion of the hand since these require the extraction of temporal features. This project is focused on optimizing the 2-D convolutional neural networks for extraction of spatial features in the hand sign images for Sign Language Recognition.
Sign language is the primary mode of communication used by people having hearing and speech impairments. A lack of knowledge about these languages among the majority of people makes a communication with a person having such impairments difficult. Due to the difficulties in such communications it becomes very to share ideas and knowledge with such individuals which eventually leads to people having hearing and speech impairments not being a part of the intellectual mainstream.
Sign language recognition means the process of recognising sign language gestures made by the user by means of a system which can identify and classify the input hand gesture. The most common methods implemented in the SLR systems is the use of a neural network for the process of creating a model which will have the knowledge of all the hand signs within a specified sign language and the generation of a classifier which will be able to differentiate and classify the different input hand sign images into the corresponding text.
 Used a cheap 3D motion sensor called the leap motion sensor for extracting the direction of motion, position and the motion velocity of the hand and then k-nearest neighbour and support vector machines were applied on these features for the purpose of sign language recognition. Four separate data sets were considered and in each iteration three data sets were selected for the process and the fourth data set was selected for testing. The distance between the tips of different fingers was calculating to extraction gestures which included pinch and grab.
 Proposed a CNN model architecture for a selfie-based sign language recognition system. The dataset contained 5000 images consisting signs from 5 sign language users of 200 signs in different orientations. Out of the layers in the proposed CNN architecture the layers for feature extraction consisted of four convolutional layers, four ReLu (rectified linear units) and two stochastic pooling layers while the layers for the classification were a dense layer, a ReLu layer and one softmax layer. Out of the four convolutional layers, the first two layers were able to extract the low-level features while the last two were able to extract the high-level features. The model was successful in giving a better outcome than some of the commonly used algorithms such as Adaboost and normal ANNs.
 Proposed a selfie-based sign language recognition system. The two major implementation problems which are only one hand of the user was available to make gestures because of the second hand being used to hold the selfie stick and the disturbances created in the background due to the shaking of the selfie stick. To extract the hand sign being made by the user, Gaussian filter along with Sobel gradient and filling were used to extract the hand and head regions and, in the end, morphological subtractions were done on the output to get the hand and head contours. Since, the distances between the fingers and differences between various hand gestures in the sign language are minute, Euclidean distance and normalized Euclidean distance failed to give a better output and Mahalanobis distance was used for classification.
 Proposed a system which consisted of an Artificial Neural Network and a HOG (Histogram of Gradient) feature descriptor. The HOG feature descriptor used finds the gradient in the intensity or edge direction in the input image and the occurrence of a sudden change in the gradient is used to find the edges and contours of the ROI (Region of Interest) in the input image. Once the ROI is extracted, it is given as the input to the neural network, which works on it and uses it for learning and classification by feature vector generation. The output image is fed to the neural network and the neural model is created for the classification of user input images. It eliminates the use of sensor-based systems which commonly use sensor gloves or special coloured gloves for proper identification of the hand ROI and makes the sign language recognition system more accessible to people.
 Proposed Statistical Dynamic Time Wrapping for time wrapping. Proposed novel classification techniques – Combined Discriminative Feature Detectors and Quadratic Classification on DF Fisher Mapping which performed better than the conventional Hidden Markov Models accompanied with SDTW. Dimensionality was reduced using Fisher mapping.
 Used Kalman filter and improved histogram backpropagation for the purpose of hand and face extraction using skin colors. Motion difference images were calculated and streak features were extracted for pattern recognition. The signer was supposed to wear colored gloves. One problem was that the head of the signer was supposed to be kept still.
 Proposed the use of Multi-class Support Vector Machines (MCSVM) on the features extracted by a Convolutional Neural Network. The system used non-linear MCSVM as a normal SVM can only distinguish between two classes with the help on a hyperplane and for the generation of a classifier for a non-linear dataset, the use of non-linear kernel functions becomes necessary and the proposed system uses the Gaussian Radial Basis Kernel function for this purpose
 Used kurtosis position, principal component analysis as the descriptor, motion chain coding for the purpose of hand movement representation and hidden markov model for classification of user input images. A hidden markov model classifier was used to test the weightage of the proposed feature combinations. When only one feature was used, PCA has shown to be the best feature with error rate 13.63%, while if two features are used the combination of PCA and kurtosis position has improved error rate to be 11.82% with a decrease in error rate 1.81%. When a combination of three features is employed the error-rate improved to be 10.90% with a decrease in error rate 2.73%.
 Proposed the use of transfer learning and a deep learning library called the fastai for the process of easier data preparation and training from the Resnet-34 along with a Convolutional Neural Network. Transfer learning was implemented in order make the process of model training require less data and time by reusing the weights from preceding layers in the calculation of the weights of the succeeding layers. The data was augmented using fastai as it performs some augmentations on its own such as random rotation, zooming, lighting, warps and horizontal flipping. Fastai also includes the defaults values for various variables such as the learning rate and certain data transformations.
 Demonstrated the use of a two-channel Convolutional Neural Network with both the channels working independently on the YCbCr and SkinMask segmentations of the input image and later combine the feature vectors generated by the two channels of the CNN to form a fully-connected layer and this output was used for the general of MCSVM (Multi-class Support Vector Machine) classifiers. In the YCbCr segmentation of the input image the Cr component of the image was extracted and then a threshold was applied to Cr component image to produce binary images. In the SkinMask segmentation of the input image, the threshold values were used and then Morphological Processing was done to help remove the noise and create the binary image. The two channels of the CNN worked on these two sets of images and extracted feature vectors for each set separately which was combined for the operation of MCSVM.
 Demonstrated the use of a Densely Connected Convolutional Neural Network for the purpose of eliminating the vanishing gradient which generally happens with deep neural networks and causes the initial hidden layers of the neural network to become weak because of the lack of a difference in the cost function during the back propagation and this leads to the later hidden layers of the neural network to provide inaccurate and ultimately leads to a loss of previously learned knowledge. The proposed system also implemented data augmentation to improve the quality of the dataset, the augmentation applied consisted of random rotations, shearing and zooming.
 Presents a system which implements skin masking on the image so as to extract only the part of the image which contains the hand, then applies Scale Invariant Feature Transform on the masked hand image to extract the feature descriptors and then implement k-means clustering to get the clusters of the required feature descriptors and then represent them as a histogram using the Bag of Features. The histograms are given as input to the Convolutional Neural Networks and a model is trained for the generation of classifiers which will help in the process of the classification and identification of user input hand signs.
 Proposed a 3D convolutional neural network model called the i3d inception or the inflated inception which helps in the extraction of the temporal features and helps in learning the actions made in the input video which is then learned by the modules and a model is developed for the classification of the further user inputs. A 3D CNN helps in learning the motion made by the signer a model based by such a system will help in the process of learning the user input gestures along with motion for real time estimation.
 Introduced a novel Parallel Hidden Markov Model which was used for modelling the parallel process independently which solve the process of scaling it to large-scale ASL applications. They have developed the recognition algorithm for PaHMMs and show that it runs in time polynomial in the number of states, and in time linear in the number of parallel processes.
 Proposed the use of image sampling from a video stream so as to operate on 2-dimensional images only and to reduce the size of the training data. The sampled images from the input video are concatenated and then used to make up the training data which is then used as the input for learning process through the Convolutional Neural Network. Some benefits of the proposed model are it can be implemented using a low-resolution camera as it uses only 2-dimensional images and the size of the training gets reduced because of the sampling of input images. A total of six sign language actions were learned.
i) Hand sign extraction from the user input image
The input image has 3 channels which is converted to 1 channel i.e. the input image is converted from RGB channels to Gray channel. A Gaussian blur is applied to the image with the kernel size of 7×7. After applying the Gaussian blur, the background image is detected using the concept of running averages where 30 frames are captured without the hand in the input image frame and then the average of these 30 frames is calculated after converting them into arrays and the output image is detected as the background. Once the background was recognized, the hand is moved into the frame and then this input frame is converted into an array and the difference is calculated from the background array to find the parts of the image which have changed. Once the difference was calculated, the image is thresholded so that the background is converted to black so that it is not visible and the foreground i.e. the part of the image containing the hand is converted to white. Once the image is thresholded, contours are found in the image and the contour with the largest area is selected as the hand of the user.
ii) Creation of dataset
The dataset was created by collecting 2400 images for each of the 26 ASL gestures from 1 signer with new signs were assigned to the alphabets J and Z since these have a gesture which require a motion and a 2-D CNN will not be able to understand the correlation between the moving frames of these gestures.
iii) Optimized convolutional neural network
Our Convolutional Neural Network consisted of the following layers, two convolution layers, two rectified linear units, two max pooling layers, two dense layers which have rectified linear unit and softmax layer respectively and one dropout layer. The two convolution layers have kernel sizes 2×2 and 5×5 respectively.
iv) Real-time predictor
The real-time prediction is done by running an infinite loop which will be started and stopped upon the user’s input and while the control is inside the loop, the hand sign is extracted from the input frame and the class corresponding to that hand-sign is predicted by the CNN model.
We implemented the background elimination technique to successfully extract the hand sign from the user input image which was then used to train the neural network and prediction was made possible.
- C. Chuan, E. Regina and C. Guardino, ‘American Sign Language Recognition Using Leap Motion Sensor,’ 2014 13th International Conference on Machine Learning and Applications, Detroit, MI, 2014, pp. 541-544.
- G. A. Rao, K. Syamala, P. V. V. Kishore and A. S. C. S. Sastry, ‘Deep convolutional neural networks for sign language recognition,’ 2018 Conference on Signal Processing And Communication Engineering Systems (SPACES), Vijayawada, 2018, pp. 194-197.
- G. Anantha Rao, Pvv Kishore, D. Anil Kumar, and Ascs Sastry. “Neural network classifier for continuous sign language recognition with selfie video”. Far East Journal of Electronics and Communications, 17 (1): 49, 2017.
- Hema B, Sania Anjum, Umme Hani, Vanaja P, Akshatha M, “Sign Language and Gesture Recognition for Deaf and Dumb People”, International Research Journal of Engineering and Technology (IRJET), Vol:6 Issue:3, pp. 3399-3402, 2019.
- J. F. Lichtenauer, E. A. Hendriks and M. J. T. Reinders, ‘Sign Language Recognition by Combining Statistical DTW and Independent Classification,’ in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 11, pp. 2040-2046, Nov. 2008.
- K. Imagawa, Shan Lu and S. Igi, ‘Color-based hands tracking system for sign language recognition,’ Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, 1998, pp. 462-467.
- M. R. Islam, U. K. Mitu, R. A. Bhuiyan and J. Shin, ‘Hand Gesture Feature Extraction Using Deep Convolutional Neural Network for Recognizing American Sign Language,’ 2018 4th International Conference on Frontiers of Signal Processing (ICFSP), Poitiers, 2018, pp. 115-119.
- Mahmoud M. Zaki and Samir I. Shaheen. 2011. Sign language recognition using a combination of new vision based features. Pattern Recogn. Lett. 32, 4 (March 2011), 572–577.
- P. Kurhekar, J. Phadtare, S. Sinha and K. P. Shirsat, ‘Real Time Sign Language Estimation System,’ 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 2019, pp. 654-658.
- Rahim, M.A.; Islam, M.R.; Shin, J., “Non-Touch Sign Word Recognition Based on Dynamic Hand Gesture Using Hybrid Segmentation and CNN Feature Fusion”, Appl. Sci. 2019, 9, 3790.
- R. Daroya, D. Peralta and P. Naval, ‘Alphabet Sign Language Image Classification Using Deep Learning,’ TENCON 2018 – 2018 IEEE Region 10 Conference, Jeju, Korea (South), 2018, pp. 0646-0650.
- S. S. Shanta, S. T. Anwar and M. R. Kabir, ‘Bangla Sign Language Detection Using SIFT and CNN,’ 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Bangalore, 2018, pp. 1-6.
- Suharjito, H. Gunawan, N. Thiracitta and A. Nugroho, ‘Sign Language Recognition Using Modified Convolutional Neural Network Model,’ 2018 Indonesian Association for Pattern Recognition International Conference (INAPR), Jakarta, Indonesia, 2018, pp. 1-5.
- Vogler, Christian & Metaxas, Dimitris. (1999). Parallel Hidden Markov Models for American Sign Language Recognition. The Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999. Vol. 1. 10.1109/ICCV.1999.791206.
- Y. Ji, S. Kim and K. Lee, ‘Sign Language Learning System with Image Sampling and Convolutional Neural Network’, 2017 First IEEE International Conference on Robotic Computing (IRC), Taichung, 2017, pp. 371-375.