Abstract
Since the outset of human communication, sign language has changed and evolved into our society. Sign language gives the deaf and mute people a chance to communicate with other people, but the ignorance of sign language often results to more miscommunication and misinterpretation. The language that is supposed to be connecting people creates a gap instead. This study focused on the design and implementation of the translator from Sign Exact English to Voice and Text. The system was designed to be used by the mute and deaf people to connect with the non-signers. The Kinect Sensor serves as an image capturing device for the hand gestures that will undergo image processing using Convolutional Neural Network which will produce an poutput of text and voice translation.
INTRODUCTION
For physically challenged individuals like deaf and mute Sign Language acts as their mother tongue giving them a linguistic identity, with its own grammar, morphology, and syntax that uses body movements to convey meaning. One of the various forms of sign language is the Signing Exact English (SEE) or sometimes called as Signed Exact English; a system that uses articulation of the person’s hands which aims to arbitrate a message in an exact delineation of English grammar and vocabulary. With the amount of time needed to learn this language, many people do not want to invest their time in learning it, resulting to a communication gap between signers and the non-signers. Since very few people understand sign and written communication is cumbersome to many deaf and mute people, there is a challenge to make this more understandable and communicable to many. The struggles that these people are experiencing in interacting with other people let them experience constraints to their professional and social lives. Furthermore, this type of communication is impersonal and slow in face-to-face conversations especially in emergency situations where there is a need to communicate quickly. To ease this gap, this paper proposes the use of text and voice translator for sign language recognition using Kinect sensor and artificial neural network.
Save your time!
We can take care of your essay
- Proper editing and formatting
- Free revision, title page, and bibliography
- Flexible prices and money-back guarantee
Place an order
The main objective of the study is to develop Sign Exact English sign language to text and voice translator utilizing Kinect Sensor and image processing to help narrow the communication gap encountered by mute, deaf and speech impaired and those who are not conversant with sign language.
RELATED WORKS
Several inventions were made possible with the continuous discoveries about sign languages— the language of the deaf. Different systems and processes that could recognize gestures were developed with the help of modern technologies such as sensors, Kinect technology and various algorithms [1], [2], [3], [4], [5]. Sign languages use sign-hand shape with their location and motion to build new words or meanings [6] and a system that uses a combination of an IMU (6DOF) motion sensor and nine flex sensors with image processing that focuses mainly on hand articulations were invented to translate FSL for medical purposes [1]. This would be helpful for Deaf community, who is in favor of the world view of deafness, declines the “pathological” view of deafness [7], [8]. From the pathologic perspective, deafness is heeded as a medical health condition that must be compensated and corrected. Although people who don’t have any kind of this disability know nothing and find difficulty to seize the concept of Deaf communities and Deaf culture Another invention can also recognize finger spelling and hand gestures by considering the appropriate number of templates per gestures and applying the template matching algorithm [2]. A glove-based recognition system by Kadous only uses a simple power glove from a nintendo game for capturing gestures and recognizing isolated signs [3].
Manifold Learning were also used for recognition of sign languages [9], [10]. Although there were a lot of technologies that are available in the market, Kinect technology is the most widely used for recognizing hand gestures because of its capabilities [11] and it is very useful for tracking hand motions and motion of the fingers [12]. Several studies used Kinect but with different processes. One of them suggested the use of 3D depth and BoostMap embedding [13]. Another study also suggested the use of Kinect sensor for tracking the feature joint points of human body [14]. Researchers also developed an application interface that can control a lego robot for recognizing human body gestures [4]. Combining the application of CNN, Keras and Kinect sensor into a system that recognizes sign languages can greatly increase the recognition rate because as the system trains the collected data, the accuracy rate for recognition increases [15], [5].
METHODOLOGY
This section shows the methods that the researchers used in conducting the study. A constructive research method with methodological approach is used for the research.
A. System Architecture
As shown in Fig. 1, the sign language gesture is the input data to be manipulated in the system. The gesture will be captured by the Kinect Sensor that is connected on the computer. The acquired gesture will undergo image processing using Convolutional Neural Network. If the gesture matches a stored sign in the database, it will be translated to voice and text equivalent on the monitor.
Kinect Sensor serves as an image capturing device for the sign language. The Kinect Sensor is connected to a computer that identifies and analyzes the captured sign language using Convolutional Neural Network for feature extraction. Match images is translated to its corresponding meaning to text and voice.
B. Data Collection Method
Hand gestures based on Signing Exact English (SEE) were used for gathering the image template database. Thirty (30) volunteers from Philippine Association of Interpreters for Deaf Empowerment (PAIDE) were exposed to the Kinect Sensor and asked to perform 14 hand articulations that correspond to 14 common word. Each hand gestures as shown in Figure 2.a – 2.n, is taken with 1200 images, with the dataset comprising a total of 16,800 images. The collected gestures are stored in the database of the system as shown in Figure 2 to serve as the reference for sign language translation.
C. Image Acquisition
1) Capturing image for translation
The file will be executed to set the hand in order to get the skin color of the user that will use the translator. The user simply needs to fill up with his hand all the small squares on the screen. When the hand already fits to the rectangular shape then the user should press “c” to capture his hand.
2) Acquiring User’s Mask
A part that has been captured in set hand code will be converted to grayscale with the skin color being converted to white and the other colors to black. This is used to specify the color of the hand for gesture identification.
3) Translation
The figure above shows the translation process featuring the gesture identification and a display of its equivalent text.
D. Image Pre-processing
Due to the embedded interface and environment in Kinect Sensor, images captured are susceptible to unwanted noise. Image de-noising as part of pre-processing procedure in computer vision needs to be employed. First, to maintain the depth of the original image, a 2D- convolution filter is used to convolve the image in a kernel of 5x5 in size. This kernel is kept within the above pixel by adding all the and keep the size of the image of its original size, the kernel is kept within the above pixel and by adding all the other remaining filter. Next, to blur the image a low-pass Gaussian filter is applied to perform convolution operation to pixels and its neighborhood. This results in a blur image with more refined boundaries and edges. Finally, an adaptive median blur is used to as describe by (2011) for noise suppression. This method in each position of the kernel frame, a pixel of the input image contained in the frame is selected to become the output pixel located at the coordinates of the kernel center. The kernel frame is centered on each pixel (m, n) of the original image, and the median value of pixels within the kernel frame is computed. The pixel at the coordinates (m, n) of the output image is set to this median value. The method effectively reduces noise without compromising significant features in the image.
E. Image Feature Extraction and Classifier
Convolutional Neural Network is neural network model is like human nervous system. As the human learn things by experience and practice or by repetition, the computer should be able to have a memory and find the associated images in the most efficient way. CNN contains three different components that have multiple filters to extract the pixels of each frame to achieve the higher-level features for classification. It includes:
Convolutional layer depicts the numbers of convolutional filters that are being applied to each frames. It also activates the ReLU function to instigate the model's output. Pooling layer trims down the sample images of the data sets. It can be obtained by the used of convolutional layers to lessen the dimension map of each frames.
CONCLUSION AND FUTURE WORKS
The rapid increase in numbers of powerful-capability devices in our daily life presents great opportunities and challenges to integrate speech and hearing-impaired individuals to society. In this study, a system for sign language to speech and text was developed.
- It is concluded that the combination of text and voice output is the most efficient way of translating the sign language so that non-signers cannot just read but can also listen to the translation.
- Since the Kinect Sensor is, at its core, a camera, it has provided a better visual in capturing the gestures compared to the normal web camera. Using a Kinect Sensor, the noise on the mask is minimized.
- The use of Sign Exact English has worked well for the system as it is syntax-based and is widely used in the Deaf Community but less complicated compared to other types of sign language.
- It is concluded that the epoch and batch size affect the accuracy of the system, the closer the value of the epoch to the batch size, the higher the accuracy. The highest accuracy percentage attained is 96.96% with the epoch value of 600 and a batch size of 600. The system has also passed the performance evaluation in Alpha and Beta Testing with the average accuracy percentage of 94.65% in Alpha Testing and an 86.44% in Beta Testing.
Future works should focus on increasing the database of recognized words by adding dynamic gestures to allow efficient communication which can be done using video processing, improving the accuracy of the system using different parameters that can be identified by conducting more experimental evaluation and porting the system to other Sign Languages such as American Sign Language and using the latest Graphics Processing Unit (GPU) to improve the training speed for the gestures.
REFERENCES
- I. Lim, J. Lu, C. Ng, T. Ong and C. Ong, 'Sign-language Recognition through Gesture & Movement Analysis (SIGMA),' in DLSU Research Congress 2015, De La Salle University, Manila, Philippines, 2015
- K. C. Carrera, A. P. Erise, E. M. Abrena, S. J. Colot and R. Tolentino, 'Application of Template Matching Algorithm for Dynamic Gesture Recognition of American Sign Language Finger Spelling and Hand Gesture,' Asia Pacific Journal of Multidisciplinary Research, vol. 2, no. 4, 2014
- M. W. Kadous, 'GRASP: Recognition of Australian Sign Language Using Instrumented Gloves,' School of Computer Science and Engineering, University of New South Wales, 1995
- D. Maraj, A. Maraj and A. Hajzeraj, 'Application Interface for Gesture Recognition with Kinect Sensor,' 2016 IEEE International Conference on Knowledge Engineering and Applications (ICKEA), Singapore, 2016
- T. D. Sajanraj and M. Beena, 'Indian Sign Language Numeral Recognition Using Region of Interest Convolutional Neural Network,' in 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 2018
- S. Goldin-Meadow and D. Brentari, “Gesture, sign and language: The coming of age of sign language and gesture studies,” Behavioral and Brain Sciences, 2017
- M. Regan, “Deaf Communities,” 2009, [Online]. Available: www.multilingua l-matters.com [Accessed: 01-Aug-2018].
- P. Ladd, “Understanding Deaf Culture in Search of Deafhood,'Paperback, pg. 528, 2003
- E. P. Cabalfin, L. B. Martinez, R. C. L. Guevara and P. C. Naval, 'Filipino Sign Language Recognition using Manifold Projection Learning,' in TENCON 2012 IEEE Region 10 Conference, Cebu, Philippines, 2012
- Y. Lin, X. Chai, Y. Zhou and X. Chen, 'Curve Matching from the View of Manifold for Sign Language Recognition,' in ACCV 2014: Computer Vision - ACCV 2014 Workshops, 2014
- R. Sarkar (2010). Getting Started with Kinect Sensor.
- F. Destreza, 'Sign Language to Voice Translator,' IJACTT, 2012
- HD Yang, 'Sign Language Recognition with the Kinect Sensor Based on Conditional Random Fields,' in Sensors, Semantics Scholar, 2014
- V. Sombandith, A. Walairacht and S. Walairacht, 'Recognition of Lao Sentence Sign Language using Kinect Sensor,' in 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Phuket, Thailand, 2017 M. ElBadawy, A. S. Elons, H. Shedeed and M. F. Tolba, 'Arabic Sign Language Recognition with 3D Convolutional Neural Networks,' in 2017 Eighth International Conference on Intelligent Computing