Short on time?

Get essay writing help

Sign Language Translation Using Deep Learning

  • Words: 1976
  • |
  • Pages: 4
  • This essay sample was donated by a student to help the academic community. Papers provided by EduBirdie writers usually outdo students' samples.


Sign language is the way of communication for hearing impaired people. There is a challenge for common people to communicate with deaf people which makes this system helpful in assisting them. This project aims at implementing computer vision which can take the sign from the users and convert them into text in real time. The proposed system contains four modules such as: image capturing, pre-processing classification and prediction. By using image processing the segmentation can be done. Sign gestures are captured and processed using OpenCV python library. The captured image is pre-processed to filter the noise and convert them to grey scale one dimensional image. The classification and predication are done using convolution neural network.


The aim of this system is to elevate people with hearing disability and help them socialize with common people. It is a form of non-verbal communication. Sign language is the structured form as each gesture represents a unique element or a character. With the advent of advancement in science and engineering many researchers are working on different methodologies that could take the event of human computer interaction to a much higher extent. The computer is trained in such a way that it could translate the sign to text for static as well as dynamic frames. The system is designed and implemented to recognize sign language gestures. The signs are captured using web cam and are pre-processed. In pre-processing stage, we use background subtraction to eliminate the background which makes this system to adapt to any dynamic background. The main difficulty while implementing in software based is that the image must be properly captured and filtered.


The related work on this project shows that there have been several methods of implementing the system under different domains namely vision-based approach, glove-based approach, fuzzy logics, soft computing like neural network or using MATLAB etc. Vision-based approach requires camera to capture image in 2D or 3D format. [1] the proposed system states about using canny edge detection algorithm. It gives better accuracy by consuming less time. Canny’s algorithm is better in removing noise and detecting clear and accurate input required to system. Canny’s algorithm gives low error rate, localized edge points, and single edge point response. He has implemented the system for static images using java NetBeans.

[2] this system is implemented to perform a task of interpreting the gestures and to decode them. The decoded or in other words the translated gestures are expressed in English. The reference video is extracted into frames and the individual frame is pre-processed. In preprocessing, several filters are applied so as to enhance the useful content of the frame information and to reduce the unwanted information as much as possible. All the features of processed frames are then extracted using Fourier descriptor method of feature extraction and stored in database associated with that gesture.

[3] this system is implemented using SVM (Support Vector Machine). For recognition, they have extracted simple features from images. Authors have collected data from 20 students whom were given prior training to sign gestures using camera of 1.3M pixels. The accuracy obtained in this work is 100% as only few signs were considered.

[4] this system proposes a new training method for Haar-like features based on AdaBoost classifier, including a hand detector which combines a skin-color model, Haar-like features and frame difference based on AdaBoost classifier for detecting moving right or left hand and a new tracking method which uses the hand patch extracted in the previous frame in order to create a new hand patch in the current frame. The detecting rate of the system is 99.9% and the rate of tracked hands which are extracted in proper size is more than 97.1%.

[5] this system recognizes sign language using multilayer perceptron neural network was implemented in Python using SciPy libraries. The performance evaluation of the proposed system was computed by different parameters such as accuracy, precision, recall and F1 score. Pruning technique which means trimming size of network by nodes was used to improve performance of the system. So hidden layer size of the system was set to 10 as a starting point and increased to 120.

[6] The hand sign recognition system is implemented in a board containing ARM CORTEX A8 processor. The software tool used is Open CV which contain real time image processing capabilities. It uses Haar training features to predict on both positive as well as non-positive images.

[7] The system has been trained using 300 images of each Indian Sign language numerals captured using RGB camera. The images are trained on the GPU system NVIDIA GeForce 920MX having 2GB of graphics memory, i5 processor of speed 2.7 GHz and 8GB of RAM. The system attained 99.56% accuracy in 22 epochs. The system has experimented with different learning rate changing from 0.01 and activations has been updates during the training state. The method uses Keras API with Tensor Flow as backend. The model has checked with static symbols and showing good results while testing with 100 images of test dataset. Here selective search algorithm has tried on the system, but it is found to be complex and more number of bounding box are created apart from the object and not useful in this case.

[8] The proposed CNN architecture uses four convolutional layers with different window sizes followed by an activation function, and a rectified linear unit for non-linearities. Three kinds of pooling strategies were tested via mean pooling, max pooling, stochastic pooling and found that stochastic pooling is suitable for our application. The feature representation is done by considering two layers of stochastic pooling. Only two layers of pooling is initiated to avoid a substantial information loss in feature representation.

[9] The proposed system uses a CNN (Convolutional Neural Network) model named Inception to extract spatial features from the video stream for Sign Language Recognition (SLR). Then by using a LSTM (Long Short-Term Memory) and an RNN (Recurrent Neural Network) model, we can extract temporal features from the video sequences via two methods: Using the outputs from the SoftMax and the Pool layer of the CNN respectively. They evaluate the CNN and RNN independently using the same training and test samples for both. This ensures that the test data is not seen during training by either the CNN and the RNN. Both the models were trained to minimize loss by using cross-entropy cost function ADAM optimizer.

Save your time!
We can take care of your essay
  • Proper editing and formatting
  • Free revision, title page, and bibliography
  • Flexible prices and money-back guarantee
Place Order

[10] This paper proposes a basic 2-layer convolutional neural network (CNN) to classify sign language image datasets. The classification was performed and compared by creating two different models. The SGD optimizer and Adam optimizer are used for optimizing, where the cost function used is Categorical Cross entropy. The classifier was found to perform with varying lighting and noisy image datasets. This model has classified 6 different sign languages using two different optimizers, SGD and Adam with an accuracy of 99.12% and 99.51% respectively. More accuracy is obtained when using the Adam optimizer.


We have created dataset containing thousands of images of each category and converted the images into a CSV file so as to get speed and accuracy in training the system. For rare gestures we tried to capture the signs using our mobile camera. The images were re-sized and rotated at random as part of the augmentations.


The foremost aim of our system is to provide communication between common people and those with hearing aid without need of any specific color background, hand gloves or any sensors. Other systems used image dataset as such in ‘.jpg’ format. But in our system the pixel values of each images are stored in csv file which reduces the memory requirement of the system. Also, the accuracy of prediction is high when csv dataset is used. The four modules of this system are Image capturing, Preprocessing, Classification and Prediction.


Python OpenCV library can be used to capture sign gestures from computer’s internal camera. The dataset for various signs is collected. To predict gestures with high accuracy, around 3000 images are collected for each sign. The collected dataset images are converted to csv file containing the pixel values for higher accuracy.


The primary focus of the system is to support detecting gestures in dynamic background condition. To achieve this, the frames are preprocessed and converted to gray scale image and then background subtraction algorithm is used. The camera first captures around 10 frames to identify the background and compares current frame with previous frame. If a hand is detected then the background is subtracted and only the hand gesture is converted to gray scale. Later the converted hand portion is resized to 28*28 size to extract features.


After collecting and processing the image dataset, they have to be classified. Convolutional neural network is used to analyze and classify visual imagery. They are widely used in image and video recognition, recommender systems, image classification, medical image analysis, natural language processing, etc. CNN is regularized versions of multilayer perceptron. It consists of convolutional layer, pooling layer, flattening and fully connected layer along with activation function.


It is performed on an image to identify certain features in an image. It takes an input signal and applies filter on it. Convolution layer uses image matrix and kernel matrix to extract features from image. The output is dot product of image matrix and kernel matrix.


A convoluted image can be too large or too small and therefore needs to be reduced without losing its features. Two types of pooling are max and min pooling. Max pooling is based on picking up maximum value from selected region and min pooling is based on picking up minimum value from selected region.


This layer transforms multi-dimensional matrix to 1-d array so that it can be fed into a classifier.


The activation function decides whether a neuron has to be activated or not based on its output. There are different activation functions like sigmoid, tanh, ReLu, etc. In our system the activation function used is ReLu. It is the most widely used activation function. If the input of any neuron is negative then it is considered as 0 else 1. Because of this, the accuracy of ReLu is high and computation is simple. The ReLu activation function can be represented as f(x) = max(0 , x). i.e [image: ]


The proposed system translates the sign gestures to text using CNN. The accuracy of the model is 99.91%. The system does not focus on facial expressions although it plays a vital role in communication. The accuracy of the model was less with poor lighting. As future enhancements, more dynamic video signs can be trained involving facial features and expressions.


  1. Amitkumar Shinde, Ramesh Kagalkar in National Conference on Advances in Computing (NCAC 2015). Sign Language to Text and Vice Versa Recognition using Computer Vision in Marathi.
  2. Purva C. Badhe, Vaishali Kulkarni in 2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS). Indian Sign Language Translator Using Gesture Recognition Algorithm.
  3. Rajesh Mapari, Govind Kharat in International Journal of Computer Science and Network. Hand gesture recognition using Neural Network,December 2012.
  4. S. Wu and H. Nagahashi, in 2013 8th International Conference on System of Systems Engineering. Real-time 2D hands detection and tracking for sign language recognition.
  5. Tülay KarayÕlan, ozkan KÕlÕç, in sign language recoginition.
  6. Geethu G Nath, Anu V S in 2017 International Conference on Innovations in information Embedded and Communication Systems (ICIIECS). Embedded Sign Language Interpreter System for Deaf and Dumb People.
  7. Sajanraj T D, Beena M V in 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018). Indian Sign Language Numeral Recognition Using Region of Interest Convolutional Neural Network.
  8. G.Anantha Rao, K.Syamala, P.V.V.Kishore, A.S.C.S.Sastry. Deep Convolutional Neural Networks for Sign Language Recognition
  9. Kshitij Bantupalli, Ying Xie, in 2018 IEEE International Conference on Big Data (Big Data). American Sign Language Recognition using Deep Learning and Computer Vision
  10. Surejya Suresh, Mithun Haridas.T.P, Supriya M.H in 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS). Sign Language Recognition System Using Deep Neural Network.

Make sure you submit a unique essay

Our writers will provide you with an essay sample written from scratch: any topic, any deadline, any instructions.

Cite this Page

Sign Language Translation Using Deep Learning. (2022, February 17). Edubirdie. Retrieved February 2, 2023, from
“Sign Language Translation Using Deep Learning.” Edubirdie, 17 Feb. 2022,
Sign Language Translation Using Deep Learning. [online]. Available at: <> [Accessed 2 Feb. 2023].
Sign Language Translation Using Deep Learning [Internet]. Edubirdie. 2022 Feb 17 [cited 2023 Feb 2]. Available from:
Join 100k satisfied students
  • Get original paper written according to your instructions
  • Save time for what matters most
hire writer

Fair Use Policy

EduBirdie considers academic integrity to be the essential part of the learning process and does not support any violation of the academic standards. Should you have any questions regarding our Fair Use Policy or become aware of any violations, please do not hesitate to contact us via

Check it out!
search Stuck on your essay?

We are here 24/7 to write your paper in as fast as 3 hours.