Abstract
Developing a system to detect, interpret and translate spoken language to SIGN LANGUAGE and vice versa for efficient communication with people of special vocal needs. This system can be integrated with any communication medium utilizing video calling feature. The spoken language detection and interpretation is mainly based on Natural Language Processing using the N-Gram model to convert speech to text. The Linear Classification model then screens the translated text into hand sign-based communication context language. These are pre-recorded gestures by the person communicating (i.e. their own images). The converse of it is an imaging hardware like camera that records gestures described by a person and a Convolution Neural Network can be used to convert it into Speech by implementing Natural Language Processing model frame by frame to complete the cycle of communication.
Introduction
How deaf-mutes communicate ?
Deaf-mutes communicate with the help of sign language. American Sign Language (ASL) is a complete, natural language that has the same linguistic properties as spoken languages, with grammar that differs from English. ASL is expressed by movements of the hands and face. It is the primary language of many North Americans who are deaf and hard of hearing, and is used by many hearing people as well.
Save your time!
We can take care of your essay
- Proper editing and formatting
- Free revision, title page, and bibliography
- Flexible prices and money-back guarantee
Place an order
State of the art (LITERATURE SURVEY)
A. Speech to American Sign Language
To convert speech to on-screen text or a computer command, a computer has to go through several complex steps. When you speak, you create vibrations in the air. The analog-to-digital converter (ADC) translates this analog wave into digital data that the computer can understand. To do this, its samples, or digitizes, the sound by taking precise measurements of the wave at frequent intervals. The system filters the digitized sound to remove unwanted noise, and sometimes to separate it into different bands of frequency (frequency is the wavelength of the sound waves, heard by humans as differences in pitch). It also normalizes the sound, or adjusts it to a constant volume level.
Early speech recognition systems tried to apply a set of grammatical and syntactical rules to speech. If the words spoken fit into a certain set of rules, the program could determine what the words were. However, human language has numerous exceptions to its own rules, even when it's spoken consistently
Rules-based systems were unsuccessful because they couldn't handle these variations. This also explains why earlier systems could not handle continuous speech -- you had to speak each word separately, with a brief pause in between them. Today's speech recognition systems use powerful and complicated statistical modeling systems. These systems use probability and mathematical functions to determine the most likely outcome
The two models that dominate the field today are the Hidden Markov Model and neural networks. These methods involve complex mathematical functions, but essentially, they take the information known to the system to figure out the information hidden from it. The Hidden Markov Model is the most common model used for this task. each phoneme is like a link in a chain, and the completed chain is a word. However, the chain branches off in different directions as the program attempts to match the digital sound with the phoneme that's most likely to come next. During this process, the program assigns a probability score to each phoneme, based on its built-in dictionary
B. American Sign Language to Speech
Gesture recognition is an important part of human-robot interaction. In order to achieve fast and stable gesture recognition in real time without distance restrictions, people have used the Microsoft Kinect to achieve this. The method combines the depth information and color information of a target scene with hand position by the spatial hierarchical scanning method; the ROI in the scene is thus extracted by the local neighbor method. In this way, the hand can be identified quickly and accurately in complex scenes and different distances. Furthermore, the convex hull detection algorithm is used to identify the positioning of fingertips in ROI, so that the fingertips can be identified and located accurately. The experimental results show that the hand position can be obtained quickly and accurately in the complex background by using the improved method, the real-time recognition distance interval can be reached by 0.5 m to 2.0 m, and the fingertip detection rates can be reached 98.5% in average. Moreover, the gesture recognition rates are more than 96% by the convex hull detection algorithm. [image: ]
Proposed Work
Before you begin to format your paper, first write and save the content as a separate text file. Complete all content and organizational editing before formatting. Please note sections A-D below for more information on proofreading, spelling and grammar.
Keep your text and graphic files separate until after the text has been formatted and styled. Do not use hard tabs, and limit use of hard returns to only one return at the end of a paragraph. Do not add any kind of pagination anywhere in the paper. Do not number text heads-the template will do that for you.
A. Speech to American Sign Language
The method of recording voice will remain the same. The only enhancement would be to use noise cancellation microphone that doesn't capture background noise but on proximity-based voices, even if multiple.
We would single out voice more prominent in the audio signal and similar base frequencies. Then from this digital signal we would take lower mean of the cumulative frequencies and eliminate all the signals which are lower than that of the mean.
Here the digital voice signal is broken down into small burst snaps of 500 milliseconds. Each snap is then filtered for silence utterances or the frequency ranging below threshold of “NORMAL VOICE” frequency. The reduced signal waves are mapped to all words utterances frequency to find out the maximum overlap intersection set. Here any word frequency that overlaps for more than 90% is taken and rest all are discarded. This is assumed to be the best result for the word uttered.
The word mapping is present in the form of prerecorded data, this can either be from different dictionaries ,accents or vernacular tones.
B. American Sign Language to Speech
A deep learning algorithm that takes in an input image which is assigned to some emphasis (weights or predictions), and learns to differentiate between that and emphasis that is not assigned to the image is done so, by the help of convolution neural networks. The working of a convolutional neural network is similar to that of neurons present in the brain. We propose the use of a CNN to extract features and teach the system what exactly a gesture used in ASL looks like. Apart from this for further enhancement, we propose the use of OpenPose algorithm to estimate the position of fingers for more augmented predictions.
OpenPose is an algorithm used to detect keypoints in the human body which include face, legs, hands, body structure. We propose to use the same algorithm to predict keypoints of the fingers. This can be done to increase the enhancements of the predictions that is made by just the CNN.
Implementation
A. Speech to American Sign Language
The audio signal received is processed manually by filtering out lower mean frequency of the signal. This modulated signal data is then mapped to a speech audio dictionary. the word detected has to be of highest probability. This mapping is mainly audited by using NLP processing where the context is extracted using basic root words mapped in the speech. Once a sentence is formed, it’s context has to be fine tuned to understand the text. This is done by method of lexeme analysis.
B. American Sign Language to Speech
The data which user will be giving as an input will be processed by a neural network and will be training that data in the CNN model. Once the data is successfully trained it can take sample data from the user and make it recognize and make the model remember and work on the model to perform an operation when the same gesture is invoked.
The system includes four parts: acquiring a gesture sample, gesture sample processing, run-time gesture recognition and a control system. For the first step in acquiring a sample from user, use OpenCV to activate the system camera to get the original image. The plotting and how the system can see the image and the model is made to perform operation accordingly. And the downloaded data set will be having different data sets. The system is made to work on each and every one of the sample data that is provided by the dataset we made.
The model will open a tab of the camera which takes input from the user. The model is designed for the recognition and further working is done by the commands given to the system and how the user wants the gestures to make recognize. At the same time, the mouse pointer movement will be captured and made operate without any human interaction.
This is a sample image of how the system will convert the input into computer understandable images [image: ]
Results discussion
The final performance of the hand gesture recognition system using deep learning models on the hand gesture recognition dataset. User uses data from the dataset and will be checking for the accuracy of the model how it is being performed. The model created for the training is given by [image: ]
Conclusion
From the models that are developed we can conclude that it is able to handle some hand gestures provided by any person and help us to identify what the gesture is. So the main point which we can look into is that the machine is able to understand on the images and is able to identify what the images are that is really helpful in many ways and help us to identify what the gesture is. So the main point which we can look into is that the machine is able to understand on the images and is able to identify what the images are that is really helpful in many ways.
References
- “REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS” - Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh
- “Hand Gesture Recognition and Implementation for Disables using CNN’S” - Kollipara Sai Varun, I. Puneeth and T. Prem Jacob