Digital Image Authentication of Holy Quran Using Arabic OCR

Topics:
Words:
2858
Pages:
6
This essay sample was donated by a student to help the academic community. Papers provided by EduBirdie writers usually outdo students' samples.

Cite this essay cite-image

ABSTRACT

Data Integrity is one of the primary concepts in information security. Through this project users will be able to authenticate digital Holy Quran copies (Image based) found on the internet. Manually this could take a lot of time and the users who have not memorized the Holy Quran would not be able to detect any tempering. This software has two major Phases: (1) in this phase, OCR system is used to extract text from input (Digital images of Holy Quran). (2) In the second phase, extracted text from first phase is compared with verified database of Holy Quran through string matching algorithm to analyze the input for authenticity. If there is no tempering, then software will return verified otherwise the copy is tempered.

Introduction

Optical Character Recognition (OCR) is a field of research in man-made reasoning, design acknowledgment, & PC vision. OCR is a typical technique for digitizing pictures of printed or written by hand messages so they can be electronically altered, looked, and put away more easily & proficiently. In spite of a century long innovative work in this field, machines are still not even close to human's perusing capacities. The objective of an OCR framework is acknowledgment of content (same as people) in a mind-boggling report.

Save your time!
We can take care of your essay
  • Proper editing and formatting
  • Free revision, title page, and bibliography
  • Flexible prices and money-back guarantee
Place an order
document

OCR stands for Optical Character Recognition. It is an across the board innovation to perceive message inside pictures, for example, examined archives and photographs. OCR innovation is utilized to change over essentially any sort of pictures containing composed content (composed, written by hand or printed) into machine-comprehensible content information.

These days, numerous multilingual duplicates of the blessed Quran are accessible on the web. The Magnificent Qur’an, the Muslims’ sacred and the most authentic Book, was revealed in Arabic the most immense language all over the world. Prophet Muhammad (PBUH) upon who this book was reveled in twenty-three (23) years make complete arrangements for its authenticity by transferring Allah’s message to His Companions through reciting exact word by word preserving the accurate order. Nowadays with huge advancement in information technology (IT) the preferences are given to the new advance and smart devices for reading, reciting and memorizing the Quran instead of Printed copy of the Holy Quran that is considered more reliable and authentic. No doubt this advancement has make it easy to access the Quran anytime, anywhere for everyone but has open a real threat in the authenticity of digital Quran. In the past a lot of work has been done to define a proper mechanism for authenticity of digital Quran/online Quran, The blessed Quran is initially written in Arabic language. The definite wording and explanations in Quran refrains are indistinguishable for every single Arabic rendition of the Quran.1 However, one issue with interpretation is that it might change, deliberately or unexpectedly the importance of certain stanzas when meant another dialect. This is a general issue when managing interpretation. Then again, for a similar language, it is conceivable that a similar stanza be written in various words because of purposeful misrepresentation or because of language interpretation issues.

Because of the affectability and the idea of Quran sections there is an indispensable need to consistently screen Quran refrains and sections composed through the Internet sites and pages to make sure that they are verified and not changed or misrepresentation.

Now days it is the big challenge for the users to identify the valid copy of digital (online) Quran. The Muslims everywhere in the world are facing deficiency of attentiveness in distribution of fake digital versions of Quran without acknowledgment of approved Muslims scholars. Muslims around the world individuals as well as groups have been putting huge effort to detect and eradicate illegal copies of Holy Quran. There should be one committee of Muslims scholars and IT experts which has both type of technological and Islamic knowledge about Quran. We therefore are going to propose digital Quran centralized authentication system by using latest authentication approach. The system is aimed to combining sophisticated knowledge of our outstanding Muslim scholars and extraordinary technological experts to provide the authentic, valid and error proof digital Quran to every Muslim.

Related work

Although, there are a large number of printed Arabic characters recognition approaches have been proposed in the last few years, there still needs to enhance recognition rate in Arabic OCR systems. This section overviews some of these approaches:

  1. Prasad, et al. proposed technique having HMM-based OCR system for machine printed documents with Arabic language. He applied combo of script independent & script-specific technique to glyph models & language mode. This method was tested on machine-printed Arabic documents and it showed reduction error rate of 40% on baseline configuration.
  2. Alma’adeed, et al. came up with system in which he proposed OCR model, which was based on using HMM for recognizing of Arabic printed words of 100 different writers, but 1st normalization processed were applied. Next skeleton & edge of word are found and then they are used as features for implemented system. Then classification process based on HMM approach is used.in the end words extracted are compared with entities in dictionary. It obtained accuracy rate of 45%.
  3. Sakhr developed Sakhr OCR for Arabic character recognition. It uses ANN( ARTIFICIAL NEURAL NETWORK) with segmentation accuracy of 98% and recognition accuracy of 99.8% for printed text.
  4. Fehri and Ben Ahmed came up with a hybrid system which used Radial Basis Function Networks & HMM to recognize printed Arabic text after identifying used font.
  5. Sabbour & Shafait presented OCR system for Arabic script language in which system was trained to recognize bot Urdu Nastaleeq & Arabic Naskh fonts. For Arabic text it had accuracy of 86%.
  6. Elgammal, et al. proposed OCR system that worked on graph based segmentation in order to produce sub-characters & a classifier in order to recognize these sub-characters. This system was applied on printed-text dataset. It produced classification rate of 94%.
  7. Cheung, et al. presented OCR system which used recognition-based segmentation technique. Arabic word segmentation algorithm was used to separate horizontally overlapping Arabic words/sub-words. It showed recognition accuracy of 90%.

About Arabic language

Arabic text has 28 characters and 10 numerals. Each character has up to four forms depend on its position in the word (isolated, beginning, middle, and end). Therefore, it is expected that there are 120 different character forms in each font after adding the new character ( ال ), which is created by writing ALIFON ( أ) after LAMON ( ل ). Unfortunately, each character may have different forms in the same location in the same font. As shown in Fig. 5, BAAON ( ب) has up to five forms at the beginning of the word in the same font (Traditional Arabic font). Its form does not only depend on its neighbor but also depend on the neighbor of its neighbor as in BEMA ( بما ) and .( بم ) BEM.

Arabic language recognition difficulties

The Arabic language is not an easy language for automatic recognition. Some of the particular difficulties are:

  • Characters are cursive and not separated as is the case with Latin script. Hence recognition requires a sophisticated segmentation algorithm.
  • Characters change shape depending on their position in the word, and much of the distinction between isolated characters is lost when they appear in the middle of a word.
  • Sometimes Arabic writers neglect to include whitespaces between words when the word ends with one of the six letters.
  • Repeated characters are sometimes used, even if this breaks Arabic word rules; especially in online “chat” sites, for example while it is actually.
  • There are two ending letters which sometimes indicate the same meaning but are different characters. For example, and have the same meaning, the first is correct but the second form is often encountered. The same problem exists with the character pair.
  • There is often misuse of the letter ALEF in its different shapes.
  • The individual letter which means 'and' in English is often misused. In this case it is a word and should have whitespace after it, but most of the time Arabic writers neglect to include the whitespace.

About OCR

Pre-processing:

Pre-processing technique is used to make the input image in the processable condition for the later stages in OCR system. It includes a lot of operation including gray scaling, thresholding, resizing, binarization, dilation, erosion etc.

In our system we first take input from the user which is either a verse or a single page of Digital Holy Quran. Pass it for further processing that is pre-processing. Pre-processing steps are written bellow:

RESIZING

Pre-processing technique is used to make the input image in the processable condition for the later stages in OCR system. It includes a lot of operation including gray scaling, thresholding, resizing, binarization, dilation, erosion etc.

In our system we first take input from the user which is either a verse or a single page of Digital Holy Quran. Pass it for further processing that is pre-processing. Pre-processing steps are written bellow:

GRAY SCALE

In this process the image will be converted into a gray scale image that means different shades of gray colors. The result of gray scale is used in the next step of preprocessing that is binarization.

BINARIZATION

In this step gray scale image is converted into binarized image which uses only black and white color. Black is used to represent pixel value that is 0 and white is used to represent pixel value that is 255. The basic purpose of conversion of image into gray scale and binary is space issue. Since, RBG image requires quite enough space to store each pixel so we have space constraint that’s why we use this approach to save the space. Due to large space computation will be slow that causes the overall system less efficient. In our system we use 127 and 255 values to binarize the image.

DILATION

In this process, the foreground of the binarized image will be spread. So, the region of the text will be increases that will be useful for segmentation. All the pixels which are in the under of kernel window will be spread. This technique is also useful when image has a noise element in it. First apply erosion which is the inverse process of dilation. It shrinks out white spaces which causes noise removal then apply dilation which increase ours text region. It has many iterations depends on the condition. In our OCR system, we have used only a single iteration. Iteration will be in integer not in floating point.

SEGMENTATION

Dividing a source image into sub-components is known as segmentation. Segmentation is used to segment and locate the text when used in Optical Character Recognition system. There are different types of segmentation require before actual classification and recognition.

  • a. Line segmentation
  • b. Word /ligature segmentation
  • c. Character segmentation

Line Segmentation

Line Segmentation is the process of splitting or separating out line from a text script or a document. This is the initial step in segmentation for an OCR. This is done after applying necessary preprocessing steps. It can be implemented through different techniques such contour approach or even through machine learning.

In our project, the dilated image was used in further processing. On the dilated image contours were drawn to segment out lines from the page. The kernel window in our system is kernel = np.ones ((5,100), np.uint8). This will exaggerate the text present in the line with size 5 on y-axis means top to bottom and 100 on x-axis means from left to right. If we change the window size, then the technique will be changed, and it produces the different results. By increasing the x-axis it mashes up all the words together to make a perfect line. The cropped-out line segments are in random order. The cropped-out lines are then sorted in order from top to bottom of the image.

Word segmentation:

Word segmentation is the process of segment out meaning full words from the text line. This step is done after segmentation of line from the text script. To perform this step first apply image manipulation step. There are various techniques to perform the word segmentation like histogram, vertical profile projection, contouring etc.

Ligature segmentation

Ligature segmentation is a process of splitting words into connected characters. Ligature is a sub part of word. Ligature segmentation has different various techniques to segment out. Just like connected component, which uses inter space of word, contouring, heuristic approach etc. It improves the result of recognition, classification etc. ROI means region of interest. It plays an important role in segmentation. It is quite difficult but gives efficient and accurate result which will be very useful in the furthermore steps in OCR implementation. ROI basically selects the region of interest in the image by drawing rectangular or bounding boxes with the help of OpenCV in python. . Now, by using cv2.findContours() function it finds the region where we want to draw contours. cv2.findContours() search the all joined pixels that are white in the dilated image. Contours will be drawn where the white pixels in the dilated image will be found. Contours will be drawn by using OpenCV function cv2.drawContours(). In the last, after finding and drawing contour the most important step will be performed that is sorting. Sorting is performed when the bounding boxes was drawn. Each bounding box gets a number which will be used in sorting. The sorting technique is same as applied earlier in the line segmentation technique.

Character segmentation:

In character segmentation, we segment out only characters from word. The most difficult step for OCR is character segmentation as it separates out all connected and disconnected characters. Each character has different shapes like more than 4 shapes. This method will segment out the text into individual characters including the adhi ashkal. An inefficient segmentation process leads to incorrect recognition, classification or rejection segmentation process carried after out only after the preprocessing of the image. According to Casey and Lecolinet.[2]

AUTHENTICATION OF QURANIC VERSE USING STRING MATCHING

The last but definitely not the least stage of our proposed system is authentication of text. In this stage we are achieving the main purpose for our proposed system which was to authenticate Holy Quran verses on the images.

We are going to authenticate text on images with the verified data base in the system. The text which was extracted from the images through OCR system is then compared with the verified text stored in the database.

The extracted text should match with the verified text in the database completely to ensure its authenticity and that there was no tempering done. If the extracted text matches the verified text, then it is authenticating and if not then it is unauthenticated.

This task is carried out by simple finding the text in the verified database, if same string is found then returns true if not then false. Verified Holy Quran Document is from http://tanzil.net in quran-simple-clean script.

CONCLUSION & FUTURE WORK

We achieved all the mile stones of our project. Now our system is capable to extract Quranic verse from the DHQ into editable form using Arabic OCR. The basic purpose of OCR is to authenticate the Quranic verse that is either it is a perfect verse or tempered. This system is developed to check the integrity of the DHQ. Our system composed of two mainly functions that is Arabic OCR and string matching. String matching is used to compare the extracted text from the OCR to the authenticated Quranic data which is stored in the database. After matching we will conclude the authentication result. Now we can feel free recite the DHQ from the PDF or images.

Although, there is an already exist some OCR engines like Tesseract, ABBYY FineReader Asprise OCR SDK etc. but these engines also have some limitations.

To cover up the short comings of this project, future work can be done to make the OCR system more efficient and scaled up. Because this OCR system is only tested for one font style and only works for text without diacritics, future work can be done to implement this OCR system for different types of script and font styles also for text with diacritics.

Also, the training of the OCR is done for one chapter due to time limitation, future work can be done to train the system for the complete Holy Quran. Authentication is done by looking for the extracted text in the verified data but not sorted out chapter by chapter; this authentication system can also be improved. Authentication can be made efficient by exploiting it into machine learning or some other technique which gives result as efficient, effective, real time and accurate.

REFERENCES

  1. R. Prasad, S. Saleem, M. Kamali, R. Meermeier, and P. Natarajan, 'Improvements in Hidden Markov Model Based Arabic OCR,' (in English), Proceedings /, vol. 2, no. Conf 19, pp. 769-772, 2008.
  2. S. Alma'adeed, C. Higgens, D. Elliman, and R. Proceedings of 16th International Conference on Pattern, 'Recognition of off-line handwritten
  3. Sakhr Software, www.sakhrsoft.com, 2003. [4] LeCun Y., Boser B., Denke J., and Jackel L.,“Back-Propagation Applied to Handwritten Zip Code Recognition,” Neural Computation, pp.541-551, 1989.
  4. Sabbour and F. Shafait, 'A segmentation-free approach to Arabic and Urdu OCR,' (in English), Document Recognition and Retrieval, vol. 8658, pp. 86580N-86580N-12, 2013.
  5. A. M. Elgammal, M. A. Ismail, A. Proceedings of Sixth International Conference on Document, and Recognition, 'A graph-based segmentation and feature extraction framework for Arabic text recognition,' (in No Linguistic Content), pp. 622-626, 2001.
  6. A. Cheung, M. Bennamoun, and N. W. Bergmann, 'An Arabic optical character recognition system using recognition-based segmentation,' (in English), PR Pattern Recognition, vol. 34, no. 2, pp. 215-233, 2001.
Make sure you submit a unique essay

Our writers will provide you with an essay sample written from scratch: any topic, any deadline, any instructions.

Cite this paper

Digital Image Authentication of Holy Quran Using Arabic OCR. (2022, February 21). Edubirdie. Retrieved November 21, 2024, from https://edubirdie.com/examples/authentication-of-digital-image-based-holy-quran-using-arabic-ocr-string-matching/
“Digital Image Authentication of Holy Quran Using Arabic OCR.” Edubirdie, 21 Feb. 2022, edubirdie.com/examples/authentication-of-digital-image-based-holy-quran-using-arabic-ocr-string-matching/
Digital Image Authentication of Holy Quran Using Arabic OCR. [online]. Available at: <https://edubirdie.com/examples/authentication-of-digital-image-based-holy-quran-using-arabic-ocr-string-matching/> [Accessed 21 Nov. 2024].
Digital Image Authentication of Holy Quran Using Arabic OCR [Internet]. Edubirdie. 2022 Feb 21 [cited 2024 Nov 21]. Available from: https://edubirdie.com/examples/authentication-of-digital-image-based-holy-quran-using-arabic-ocr-string-matching/
copy

Join our 150k of happy users

  • Get original paper written according to your instructions
  • Save time for what matters most
Place an order

Fair Use Policy

EduBirdie considers academic integrity to be the essential part of the learning process and does not support any violation of the academic standards. Should you have any questions regarding our Fair Use Policy or become aware of any violations, please do not hesitate to contact us via support@edubirdie.com.

Check it out!
close
search Stuck on your essay?

We are here 24/7 to write your paper in as fast as 3 hours.