Recognition of Off-line Handwritten Arabic Words

Size: px

Start display at page:

Download "Recognition of Off-line Handwritten Arabic Words"

Alexia Tate
5 years ago
Views:

1 Recognition of Off-line Handwritten Arabic Words by Somaya A. S. Al-Ma adeed, BSc, MSc Thesis submitted to The University of Nottingham for the degree of Doctor of Philosophy, June 2004.

2 ABSTRACT The main steps of document processing have been reviewed, especially those implemented on Arabic writing. The techniques used in this research, such as Vector Quantization (VQ), Hidden Markov Models (HMM), and Induction of Decision Trees (ID3) have been considered, as well as reviewing pre-processing and feature extraction used in Arabic writing. Applications, which usually include some pattern recognition require the use of large sets of data. Since there are few Arabic databases available, none of which are a reasonable size or scope, this research built the AHDB database in order to facilitate the training and testing of systems that are able to recognize unconstrained handwritten Arabic text [AHE02a] [AHE03a]. The approach used in this thesis for counting the most popular written Arabic words is a very useful step in the area of Arabic handwritten recognition. The process of the recognition of Arabic characters that are extracted from words contains several parts and deals with the Arabic words from the beginning of slanted words to segmented characters [AHE01], that are entered as inputs to HMMs, ID3, or Multible HMMs for recognition. At first the HMM is used to classify handwritten words [AHE02b]. Then a global classifier is used to recognize the whole words. The last stage is to combine general and local classifiers to classify the Arabic words [AHE02c]. The main result is a new Multi-HMM approach proposed for handwritten recognition [AHE03b] [AHE04]. Finally, possible further work has been examined to consider where this approach to off-line handwriting recognition is leading. This work presents an offline cursive Arabic words recognition classifier system, which deals with several writer samples. ii

3 ACKNOWLEDGMENTS My sincere and deepest gratitude to Nottingham University, Faculty of Computer Science and to my Supervisors Professor Dave Elliman and Dr. Colin Higgins for always being supportive and encouraging throughout my thesis, and for their assistance in the preparation of this manuscript. Thanks to Arabic writers who filled out the forms, that being the core of the database I developed. Thanks also to Qatar University for sponsoring my study and research. Thanks to my mother and my father for always being there for me and constantly providing me with love and encouragement. They have done so much that a simple thanks to them will not suffice. In addition, special thanks to my husband, Sultan, for all his support during the long period that the thesis took up, to my wonderful sons for their patience, to my sisters for their continuously supportive phone calls, and to my brother who showed interest in my research, always asking when (not if) I will finish this thesis. Lastly I acknowledge the reader, who I hope will find its contents useful and easy to read. iii

4 TABLE OF CONTENTS Abstract...ii Acknowledgments...iii Table of Contents...iv List of Figures...ix List of Tables...xii Chapter 1: INTRODUCTION OPTICAL CHARACTER RECOGNITION THE HISTORICAL BACKGROUND TO OCR RESEARCH BASIC MODEL FOR PROCESSING THE CONCRETE DOCUMENT RECOGNITION STRATEGIES PROBLEM DEFINITION Difficulties from Characteristics of the Arabic Writing System Difficulties in Handwritten Arabic Characters and their Differences from Latin Off-line Versus On-line THE OBJECTIVES OF THIS RESEARCH CONTRIBUTION THE THESIS ORGANIZATION Chapter 2: THEORY AND LITERATURE REVIEW INTRODUCTION SURVEY OF OFF-LINE HANDWRITTEN WORDS RECOGNITION Databases Data Capture Pre-processing The Binarization of Scanned Images Skew Detection Segmentation iv

5 2.2.4 Feature Extraction Classification Post-processing OFF-LINE HMMS FOR AN HWR SURVEY ARABIC OCR USING HMM A SURVEY OF OFF-LINE HANDWRITTEN ARABIC WORDS RECOGNITION CONCLUSION Chapter 3: METHODOLOGY: USEFUL TECHNIQUES OFF-LINE ARABIC WORDS RECOGNITION METHODS Feature Extraction Methods Segmentation Methods Recognition Methods VECTOR QUANTIZATION VQ Mathematic Definition Optimality Criteria Nearest Neighbour Condition Centroid Condition HIDDEN MARKOV MODEL (HMM) Implementation Strategies HMM Theory Scoring Problem Training Problem Recognition Phase Post-processing ID3 CLASSIFIER CONCLUSION Chapter 4: A DATABASE FOR ARABIC HANDWRITTEN TEXT RECOGNITION RESEARCH v

6 4.1 A NEW ARABIC HANDWRITTEN DATABASE (AHDB) ARABIC WORD COUNTING FORM DESIGN DATA STORING DATA RETRIEVAL CONCLUSION Chapter 5: A PRE-PROCESSING SYSTEM FOR THE RECOGNITION OF OFF-LINE ARABIC HANDWRITTEN WORDS OVERVIEW PRE-PROCESSING STEPS Image Loading Slope Correction Slant Correction Thinning Normalization FINDING HANDWRITING FEATURES Outer Contour and Loops Locating Dots Locating Endpoints Junctions Turning Points Right and Left Disconnection Detect Strokes Pixel Distribution Moments Features Zonal Features SEGMENTATION STAGE CONCLUSION vi

7 Chapter 6: RECOGNITION OF OFF-LINE HANDWRITTEN ARABIC WORDS USING A HIDDEN MARKOV MODEL SYSTEM OVERVIEW PRE-PROCESSING FEATURES USED HMM CLASSIFIER States and Symbols for Handwritten Words The Calculation of Model Parameters THE SCORING PROBLEM THE TRAINING PROBLEM RECOGNITION PHASE CONCLUSION Chapter 7: MULTIPLE HIDDEN MARKOV MODELS CLASSIFIER ID3 CLASSIFIER Training and Testing Sets MULTIPLE HIDDEN MARKOV MODELS Global Classifier Local Classifier LOCAL GRAMMAR CONCLUSION Chapter 8: EXPERIMENTAL RESULTS EXPERIMENTAL TOOLS SOFTWARE USED EXPERIMENTAL DETAILS Forms Scanning Data Capture and Image Loading Pre-processing vii

8 8.3.4 Baseline Detection Slant and Slope Correction Thinning Feature Extraction Segmentation Normalization CLASSIFICATION USING HMM CLASSIFICATION USING ID CLASSIFICATION USING MULTIPLE HMM CONCLUSION OF THE EXPERIMENTAL RESULTS Chapter 9: CONCLUSIONS AND SUGGESTIONS FOR FUTURE RESEARCH CONCLUDING REMARKS CONTRIBUTION TO ARABIC HANDWRITTEN RECOGNITION FUTURE WORK The Database Pre-processing Feature Extraction Classification Post-processing CONCLUSION Bibliography Appendix A Appendix B viii

9 LIST OF FIGURES Figure 1-1: Basic Model for Document Processing... 5 Figure 1-2: Different shapes of the Arabic letter ( - A in ) in: (a) beginning, (b) middle, (c) final and (d) isolated... 7 Figure 1-3: Some Arabic characters that differ only by the position and number of associated dots Figure 1-4: A handwritten word that can problematic to segments Figure 1-5: Three Arabic words with constituent sub-words (a) - flower, - Maqdess, - Cairo Figure 1-6: Different Arabic sentences in different styles Figure 1-7: Arabic Ligatures Figure 1-8: Ligatures found in the Traditional Arabic font Figure 2-1: Steps involved in the Optical Character Recognition System. 25 Figure 3-1: Vertical and horizontal scanning of the character (a) character (b) horizontal scanning (c) vertical scanning Figure 3-2: Major segments of character Figure 3-3: An example of segmentation of the Arabic word into characters (a) Arabic Word (b) Histogram (c) word segmented into characters Figure 3-4: An example of the Arabic word and its segmentation into character (a) Arabic word (b) Histogram (c) word segmented into character Figure 3-5: Segmented Arabic word and the corresponding contour heights, for words (a) Mahal and (b) Alalamy Figure 3-6: An example of a segmented sub-word, with start point A, endpoint E, and horizontal lines 2-3 and Figure 3-7: Example of an Arabic word and different techniques of the segmentation Figure 4-1: One form filled in by one writer Figure 4-2: Handwritten Arabic words in the AHDB written by three different writers (a, b, and c) Figure 4-3: Examples containing sentences used in cheque writing in Arabic Figure 4-4: Examples of free-handwriting Figure 5-1: The pre-processing operations ix

10 Figure 5-2: Different examples of pre-processing stages (a) baselines detection (b) slant and slop correction (c) features extraction (d) width normalization Figure 5-3: (a) The word before the operation of slope correction. (b) The word after its slope is corrected horizontally. (c) The same word after slant correction. (d) The operation of thinning Figure 5-4: The two baselines of the word - five. (a) the second baseline (b) the main baseline Figure 5-5: Two words with the features written on them Figure 5-6: The blobs of the Arabic word ahad Figure 5-7: Four turning points in different directions (a) top, (b) down, (c) left, and (d) right Figure 5-8: The four stroke directions detected in this research for an Arabic word (a) horizontal (b) vertical (c) positive or back diagonal (d) negative or diagonal Figure 5-9: Arabic word - five after (a) contour extraction and thinning, (b) width normalization, and (c) segmentation Figure 5-10: Horizontal histogram and segmentation of words into frames Figure 6-1: Feature vector for HMM classifier Figure 6-2: Training and testing phases in the HMM classifier Figure 6-3: Examples of feature vectors in different Arabic words Figure 7-1: The Arabic word nine written in different allographs and styles Figure 7-2: The Arabic word one written in different allographs and styles Figure 7-3: eighty written in different allographs and styles Figure 7-4: fifty written in different allographs and styles Figure 7-5: hundred written in different allographs and styles Figure 7-6: ninety written in different allographs and styles Figure 7-7: no written in different allographs and styles Figure 7-8: ID3 classifier Figure 7-9: Global features vector Figure 7-10: Recognition of off-line handwritten Arabic words using Multiple Hidden Markov Models Figure 7-11: A word recognition using local and global features Figure 8-1: Stages of this research x

11 Figure 8-2: Colour dropout using software a) scanned image b) after applying blue channel mode c) the image after stamp filter Figure 8-3: Colour dropout using hardware Figure 8-4: Words with touching characters Figure 8-5: Dot above the last left character noon and below the real baseline Figure 8-6: Over-segmented words Figure 8-7: Error from file transformation Figure 8-8: Wrong baseline for different Arabic words Figure 8-9: Dots inside loops in character waw in word -Wahed one Figure 8-10: Arabic letter Alef was mistakenly classified as a complementary character Figure 8-11: Complementary characters above Arabic letter Alef. 139 Figure 8-12: Example of overwritten dots or unwritten dots in the word Twenty Figure 8-13: ID3 tree to classify words into four groups Figure 8-14: The relation between words, groups, and the percentage of each word in each group for Table Figure 8-15 Recognition rate decrease as number of iterations increases for all groups (codebook =90, and twenty states) Figure 8-16: Recognition rate and codebook size relation for groups two to eight when number of iteration is constant xi

12 LIST OF TABLES Table 1-1: Arabic alphabet in all its forms... 9 Table 1-2: Supplementary characters ( - Hamza and ~ - Madda ) and their position in respect to the main character ( - Alif, - Waow and - Ya )... 9 Table 1-3: Diacritical markings in Arabic writing Table 1-4: Example of an Arabic word with different diacritics indicates different meanings Table 1-5: Differences between Latin and Arabic Writing Table 3-1: A comparison between PD-HMM and MD-HMM strategies Table 4-1: The twenty most used words in written Arabic, with their meanings in English Table 5-1: The curve categorization using the coordinates Table 6-1: Arabic words without dots and other diacritical markings Table 7-1: Group names and a list of each group Table 8-1: Result of series of tests using HMM Table 8-2 Recognition rate basic statistics Table 8-3: ID3 classifier results Table 8-4: The relation between words, groups, and the percentage of each word in each group for some words in the dictionary Table 8-5: The recognition rate for the global Word Feature Recognition Engine Table 8-6: Recognition rate for each group and the total recognition rate156 Table 8-7 The mean of 20 recognition rates for group six results from different states and codebook sizes Table 8-8 The std. Deviation of 20 recognition rates for group six results from different states and codebook sizes Table 8-9 The mean of 20 recognition rates for group two results from different states and codebook sizes Table 8-10 The std. Deviation of 20 recognition rates for group two results from different states and codebook sizes Table 8-11 The mean of 20 recognition rates results from different states and codebook sizes for group three Table 8-12 The std. Deviation of 20 recognition rates for group three results from different states and codebook sizes xii

13 Table 8-13 The mean of 20 recognition rates results from different states and codebook sizes for group four Table 8-14 The std. Deviation of 20 recognition rates for group four results from different states and codebook sizes Table 8-15 The mean of 20 recognition rates results from different states and codebook sizes for group five Table 8-16 The std. Deviation of 20 recognition rates for group five results from different states and codebook sizes Table 8-17 The mean of 20 recognition rates results from different states and codebook sizes for group seven Table 8-18 The std. Deviation of 20 recognition rates for group seven results from different states and codebook sizes Table 8-19 The mean of 20 recognition rates results from different states and codebook sizes for group eight Table 8-20 The std. Deviation of 20 recognition rates for group eight results from different states and codebook sizes xiii

14 T he handwriting recognition problem arouses great interest in researchers, since there is a high level of ambiguity and complexity in such kind of images, and because of the importance of Optical Character Recognition (OCR) in office automation and many other applications. Recognition of cursive handwritten text is one of the most diffecult cases in the domain of OCR. However, the large number of potential applications results in it being a very popular research subject. Much less research has been undertaken on the task of recognizing Arabic script influenced perhaps by the lack of an international database in this field. The objective of this thesis is to provide a better way to recognise Arabic handwritten words. This chapter describes the concept of OCR and its importance. It provides an overview of the Document structures: both the geometric structure and the logical structure. In addition, there is a discussion of the algorithms used for word recognition. They are classified into three categories, namely the holistic approach, the analysis approach, and feature sequence matching. In section 1.5, the off-line Arabic

15 1: Introduction 2 handwritten character recognition problem is defined. The particular problems of this application are a result of Arabic writing characteristics, the nature of Arabic handwriting, and the use of offline recognition. This chapter also summarizes the thesis objective of building an off-line Arabic handwritten character recognition system. Since the proposed system involves several processing steps, it is useful to summarize the stages involved in optically handling a handwritten document, from pre-processing to post-processing. The optical character recognition system comprises five processing steps, namely data capture, pre-processing, feature extraction, classification, and postprocessing. An outline of the research approach and the contribution points are discussed. Finally, there is a summary of how this thesis is organized. 1.1 Optical Character Recognition What is Optical Character Recognition (OCR) and why do we need it? OCR is a process that attempts to turn a paper document into a fully editable form, which can be used in word processing and other applications as if it had been typed through the keyboard. The constant development of computer tools leads to the requirement for simpler interfaces between man and computer. The automatic recognition of handwritten text could be applied in many areas, for example form-filling applications (including handwritten postal addresses, cheques, insurance applications, mail order forms, tax returns, credit card sales slips, custom declarations, and many others). All these applications generate handwritten script from an unconstrained population of writers and writing, which must subsequently be processed off-line by computers [ND94].

16 1: Introduction The Historical Background to OCR Research Character recognition is an area of pattern recognition that has been the subject of considerable research during the last three decades [Na68]. Since the 1960s, much research on document processing has been carried out using OCR [AA94]. Surveys of the underlying techniques have been made by several researchers [Ma86] [IOO91] [MSY92] [Sa94]. Studies of automatic text segmentation and discrimination have been widely conducted since the early 1980s [AWS81][WCW82]. Since then, the application of document image analysis has been growing rapidly due to developments in hardware enabling processing to be performed at a reasonable cost and speed [OK95]. Today, effective OCR packages can be bought for as little as $100 [CL96]. However, these are only able to recognize high quality printed text documents or neatly written hand printed text [CL96]. To date, lots of methods have been proposed and many document processing systems have been described. About 750 papers have been presented at The International Conferences On Document Analysis And Recognition-ICDAR 97, ICDAR 99 and ICDAR 01 [ICDAR97, ICDAR99, ICDAR01]. Nine articles have been published in the special issue of the Journal for Machine Vision and Applications concerned with document analysis and understanding. Many papers have been published describing new achievements in research in these areas [IWF02, ICDAR03]. Several books on these topics have also been published [DI97, OK95, BWB94]. The current focus of the research area in the subject of OCR is now for systems that can handle documents that are not well recognized by current systems. As improvements in technology continue, document-processing systems will become increasingly common. The automatic acquisition of knowledge from documents such as technical reports, government

17 1: Introduction 4 files, newspapers, books, journals, magazines, letters, and bank cheques using OCR has become a commercial imperative. 1.3 Basic Model for Processing the Concrete Document There are two types of document in the Romance or Anglo-Saxon languages, machine-printed text and handwritten text, which may also be divided into hand-printed words and cursive words. This research concentrates on the automatic recognition of handwritten Arabic text, which is more similar to cursive Latin handwritten text and cursive words. The objective of automatic document processing is to recognize text, graphics and digital image pictures and extract the desired information, in an acceptable format for humans [Ob94]. The following principal concepts were proposed in a basic model for processing the concrete document [MSY92]. A Concrete Document is considered to have two structures: a geometric (layout) structure and a logical structure. The geometric structure represents the objects of a document based on the presentation, and connections among these objects. The logical structure represents the objects of a document and connection among these objects, as they would be classified by a person. Document processing is divided into two phases: document analysis, which refers to the extraction of the geometric structure from a document; and document understanding, which refers to mapping the geometric structure into a logical structure.

18 1: Introduction 5 Once the logical structure has been captured, AI or other techniques can attempt to decode its meaning. In some cases, the boundary between the analysis and understanding phases is not clear. For example, the logical structure of bank cheques may also be found using an analysis by knowledge rules. In Figure 1-1, the relationships among the geometric structure, logical structure, document analysis and document understanding are depicted. Extraction Mapping Figure 1-1: Basic Model for Document Processing 1.4 Recognition Strategies Word recognition algorithms may be classified into the following categories: The holistic approach The analysis approach Feature sequence matching The holistic approach generally utilizes shape features extracted from the word image in an attempt to recognize the entire word. It is usually

19 1: Introduction 6 accepted that holistic methods are feasible only when a small number of words are to be recognized. The analytic approach segments the word image into primitive components (typically characters). Character segmentation prior to recognition is called external character segmentation, while concurrent segmentation and recognition is called internal character segmentation. Feature sequence matching extracts features sequentially and derives word identity from this sequence. For a review on Statistical Pattern Recognition see [JDM00]. The Hidden Markov Model (HMM) has been used widely for recognition based on feature sequences. It must be concluded that recognition based on HMM is often classified as the holistic approach [Na92]. 1.5 Problem Definition The problem of recognizing off-line Arabic handwritten words is important in office automation, as well as in many other applications. Using the analytical approach to extract features included in Arabic characters seems to be most appropriate due to the nature of Arabic handwritten characters. The Handwritten Arabic character has no fixed pattern, but has fixed geometrical features. The shapes of handwritten Arabic characters differ between writers, but the geometrical features are always the same. An important difference of Arabic handwritten characters from Latin ones is the existence of dots. Dots differentiate between characters with the same geometry. Another difference is that there is no one baseline on which the characters are written, but two or more baselines, which makes recognition more difficult. This research deals with the recognition of off-line handwritten Arabic characters. As described by the title of this thesis, the problem of Arabic handwritten recognition is a result of many factors, which can be summarized as follows:

20 1: Introduction 7 The thesis studies cursive handwritten Arabic characters, which differ from the machine printed case (section 1.5.1). The study also addresses Arabic writing, which differs from English writing in many ways. Readers can see the difference between English and Arabic writing in sections It deals with off-line recognition, which differs in important respects from the on-line recognition system (section 1.5.3) Difficulties from Characteristics of the Arabic Writing System The main characteristics of Arabic Writing can be summarized as follows: Arabic text (machine printed or handwritten) is written cursively and in general from right to left. Arabic letters are normally connected to the baseline. Arabic writing uses letters (which consist of 28 basic letters), ten Hindi numerals, punctuation marks, spaces, and special symbols. (a) (b) (c) (d) Figure 1-2: Different shapes of the Arabic letter ( - A in ) in: (a) beginning, (b) middle, (c) final and (d) isolated

21 1: Introduction 8 An Arabic letter might have up to four different shapes, depending on its relative position in the word and this increases the number of classes from 28 to 100 (Table 1-1). For example, the letter ( - A in ) has four different shapes, at the beginning of the word, the middle, and the end of the word and one in isolation as a standalone word. These four shapes of the letter ( - A in ) are shown in Figure 1-2. Furthermore, there are two supplementary characters that operate on vowels to create a kind of stress (Hamza ) and elongation (Madda ); the latter operates only on the character Alif (Table 1-2). The character Lam-Alif () is created as a combination of two characters, Lam ( ) and Alif ( ), when the character Alif is written immediately after the character Lam. This new character, together with the combination of Hamza () and Madda (), increases the number of classes to 120. This is made clear in Table 1-2

22 1: Introduction 9 Table 1-1: Arabic alphabet in all its forms Name Isolated Start Middle End Alif Ba! Ta " # $ Tha % & ' Jeem ( ) * + Hha, -. Kha / 0 1 Dal 2 Thal 3 4 Ra 5 Zay 6 7 Seen 8 9 : Sheen ; < = Sad A Dhad B C D E Tta F G H I Za J K L M Ain Gain N O P Q Fa R S T Qaf U V W Kaf X Y Z Lam [ \ ] Meem ^ _ ` Noon a b c Ha d e f Waow g Ya h i j Table 1-2: Supplementary characters ( - Hamza and - Madda ) and their position in respect to the main character ( - Alif, - Waow and - Ya ) Name Isolated Start Middle End Alif Alif Alif LamAlif LamAlif LamAlif LamAlif Waow Ya

23 1: Introduction 10 Table 1-3: Diacritical markings in Arabic writing Diacritics Figure Single diacritics: Double diacritics: Shadda: Combined diacritics: In the representation of vowels, Arabic uses diacritical markings (Table 1-3). The presence and absence of vowel diacritics indicates different meanings of the same word. If the word is isolated, diacritical marks are essential to distinguish between the two or more possible meanings, i.e. (! ). Table 1-4 gives an example of an Arabic word with different diacritics indicating four different meanings. If diacritical markings occur in a sentence, contextual information inherent in the sentence can be used to infer the appropriate meaning. In this research, the issue of vowel diacritics is not addressed, since it is more common for Arabic writing not to employ these diacritics.

24 1: Introduction 11 Table 1-4: Example of an Arabic word with different diacritics indicates different meanings Arabic word English meaning he studied a lesson he taught it was studied "#$ & % '( ) (a) (b) Figure 1-3: Some Arabic characters that differ only by the position and number of associated dots Different Arabic characters may have exactly the same shapes, and so are distinguished from each other by the addition of complementary characters (the position and number of the associated dots). Hence, any thinning algorithm needs to deal efficiently with these dots without changing the identity of the character (Figure 1-3). In the segmentation process of the handwritten Arabic word, the characters are more difficult to segment if the dots are not allocated exactly under or above the character body (Figure 1-4).

25 1: Introduction 12 Arabic writing is cursive and words are separated by spaces. Some Arabic characters are not connectable with the succeeding character. Therefore, if one of these characters exists in a word, it divides that word into two sub-words. These characters appear only at the tail of a sub-word and the succeeding character forms the head of the next subword (Figure 1-5). Figure 1-4: A handwritten word that can problematic to segments 2V^ 5dV[ (a) (b) (c) Figure 1-5: Three Arabic words with constituent sub-words (a) - flower, 2V^ - Maqdess, 5dV[ - Cairo Arabic writing contains many fonts and writing styles. The letters are overlaid in some of these allographs and styles. Furthermore, characters of the same font have different sizes. Hence, segmentation which is based on fixed size or width cannot be applied to Arabic [Ob94]. In Arabic writing sometimes it is difficult to separate words from each other, especially when people write with calligraphy. See the following example (Figure 1-6) taken from [Sa03].

26 1: Introduction 13 Figure 1-6: Different Arabic sentences in different styles Figure 1-7: Arabic Ligatures

27 1: Introduction 14 Ligatures are combinations of two, or sometimes three characters into one shape (see Figure 1-7). Ligature selection is dependent not only on the characters themselves but also on the selected Arabic font. Some allographs do not use ligatures at all and others may have as many as 200 different ligatures defined. Note also that ligatures affect the positioning of diacritical marks [AG02]. Figure 1-8 lists ligatures found in one Arabic font. Figure 1-8: Ligatures found in the Traditional Arabic font Difficulties in Handwritten Arabic Characters and their Differences from Latin Arabic handwritten characters suffer not only from scale, location and orientation variation, but also from person-dependent deformations. These variations are neither predictable nor can they be formulated mathematically. Therefore, research on handwritten character recognition has always been challenging. However, the variation problem needs to be solved before it can be used to automate certain applications such as handwritten mail sorting, handwritten check processing, and so on. All of these applications require both high recognition rates and high reliability. In

28 1: Introduction 15 the system described in Chapter 4, some trials for solving the problem of Arabic handwriting recognition are implemented in pre-processing steps. The basic problems of handwriting recognition are common to all languages, but the special features, constraints, etc. for each language also need to be considered. It seems that Arabic and cursive connected English handwriting are similar, but researchers [AA92, AH95] have found many differences in the recognition of each handwritten language. Some of these are listed in Table Off-line Versus On-line Handwritten Character Recognition Systems can be divided into two broad types: Optical character readers (OCR): A whole page of handwritten, or handwritten and machine printed text (e.g. forms) are processed On-line character recognition (OLCR): Characters are converted and recognized interactively as they are formed Abuhaiba et al. [AHD94] mentioned that on-line recognition is less difficult than off-line recognition, since the temporal information in the script is available. Also pen speed and even pressure information may be available. For a comprehensive survey of on-line and off-line handwriting recognition see [PS00].

29 1: Introduction 16 Table 1-5: Differences between Latin and Arabic Writing English Arabic Direction from left to right from right to left Connection In general each character is connected to the next character with diagonal strokes Arabic letters are normally connected to the baseline with horizontal strokes Character versions English characters have few shape variations Arabic letter might have up to four different shapes, depending on its relative position in the word Features English Writing has specific geometrical features Arabic writing has a unique feature for each character, especially curves and dots Segmentation Any analytical segmentation approach can segment the handwriting into different letters or sub-letters The letters or segmented sub-letters are different from segments in English

30 1: Introduction The Objectives of this Research This research deals with the pre-processing steps and classification of offline handwritten Arabic words. In this system, some of the methods applied to handwritten Arabic writing such as; HMM after segmenting words into frames, has not been applied before, as can be seen from the literature survey. The feature extraction process includes locating endpoints, junctions, turning points, loops, generating frames, and detecting strokes. Also, more features are extracted from the characters such as moments. Future work, as well as suggestions to improve the overall accuracy of the systems, are discussed at the end of the context section. Before discussing the proposed system, it is necessary to make a quick revision of the nature of handwritten Arabic characters and, hence, the challenges that must be faced when attempting automatic recognition. The thesis objectives can be summarized as follows: A survey of off-line handwritten Arabic character recognition A review of the difficulties involved in the recognition of Arabic handwritten characters Since there is no well known database containing Arabic handwritten words for researchers to test, one of the objectives has been to build such a database. The words were collected from several writers Building a pre-processing system for recognizing off-line handwritten words. First, the system involves a new implementation of slant correction techniques for off-line handwritten Arabic words.

31 1: Introduction 18 Second, implementing a slope correction procedure for the first time, and finally, thinning the word into a skeleton Constructing a feature extraction process which is implemented by extracting geometrical features from each zone of the word which represents the characters present Implementing a segmentation procedure that divides any word into characters or sub-characters using a histogram calculation, and also extracts other features such as moments Building a suitable codebook using Vector Quantization Building the HMM for the body of Arabic words Training the system Testing the system Developing a lexicon reduction operation, through a global recognition system which uses a simple classifier Further training and testing of the system Presentation from the results and conclusion from the experiments 1.7 Contribution An important contribution of this research lies in the provision of a much needed database. This offers practical benefits for researchers on handwritten Arabic, by providing a testbed to facilitate training and testing.

32 1: Introduction 19 This research develops a new database for the collection, storage and retrieval of Arabic handwritten text (AHDB), which supersedes previous databases both in terms of the size of the database and the number of different writers involved. With this research the most popular words in Arabic writing have been identified for the first time, using an associated program. A second contribution is to the field of pre-processing and feature extraction. A novel set of handwritten features are combined and tested in the classification stage. A third contribution is in the field of classification: a new HMM approach is used to train and test Arabic-Handwritten words taken from around 100 different writers. A fourth contribution is in the use of a global approach, which is an inexpensive method of features classification which avoids the problematic segmentation stage. The combination of using global and local features to recognize words also improves the recognition rate and has not been used previously in Arabic word recognition. 1.8 The Thesis Organization As previously mentioned, this chapter describes the concept of OCR and its importance in office automation and other applications, and gives a brief general background of OCR research. It summarizes the basic model for processing any document, forms an overview of the basic two types of OCR, namely on-line (OCR) and off-line (OLCR) and discusses the nature of handwritten Arabic characters and, hence, the problems that could be faced when automatically (optically) recognizing them. The main characteristics of the Arabic writing system and its difficulties are discussed. The chapter also summarizes the thesis objective of building an off-line Arabic handwritten character recognition system. The general

33 1: Introduction 20 approach of this research is described and the contribution of the work described in this thesis is evaluated. Chapter 2 discusses the steps involved in the OCR system, the contents of which are summrized as data capture, pre-processing (binarization of scanned images, skew detection, segmentation), feature extraction, classification, and post-processing. It also surveys existing systems and research results in this field. Since this research uses HMM, a survey of HMM for handwritten recognition is presented first, followed by a survey of HMM used in Arabic OCR. The chapter closes with a review of some of the previous trials in the field of off-line handwritten Arabic character recognition. Chapter 3 reviews useful techniques used in the automatic recognition of off-line handwritten Arabic character research (feature extraction methods, segmentation methods, and recognition methods) and then discusses the three main techniques used in this research vector quantization, HMM, and the ID3 classifier. In Chapter 4 the generation of a database of off-line Arabic hand-printed words generated from the handwriting of more than 100 writers is described. That database is one of its kind in Arabic handwriting. And it is a very useful stage in Arabic handwritting research. Also in Chapter 4 the most used words in Arabic writing have been counted for the first time. Chapter 5 describes the operation of the complete pre-processing system for the recognition of a single handwritten Arabic word, from the scanned document to the output of a segmented and connected word. Chapter 6 describes the recognition of handwritten Arabic characters classified by HMMs. In Chapter 2, there were some implementations of

34 1: Introduction 21 HMMs with Arabic OCR. The trials do not include the implementation of HMMs on handwritten Arabic words. This chapter includes implementation of HMMs on Arabic handwriting. Chapter 7 discusses a lexicon reduction system, and further classification using different hidden Markov models. The overall engine of this combination of a global feature scheme with an HMM module is a more capable system. Chapter 8 presents the Experimental Results, discussing the detail of the experiments done throughout this Thesis, as well as the results. Chapter 9 presents the conclusions of this research. The objective of this concluding chapter is to provide an overview of the research, to analyze some useful development opportunities in the research and to offer some suggestions about how future research on the topics related to it could be carried out.

35 T his Chapter discusses the steps involved in OCR in general, and surveys the systems and research trials in this field. The steps involved in developing on OCR system include the construction of a testing database, data capture, pre-processing (binarization of scanned images, skew detection, segmentation), feature extraction, classification, and post-processing. The training using prior data is described followed by a description of the past trials of recognition of handwritten words in general, and then the state of the art of recognition of handwritten Arabic text. The remainder of the chapter briefly reviews research that has greatly influenced the evolution of handwriting recognition especially that on using HMMs, and it then surveys individual ways in the automatic recognition of off-line handwritten Arabic characters. Most of the published work on the recognition of off-line handwritten Arabic characters assumes that the characters are already segmented. However, this research assummed that the word is not segmented into characters, as the Arabic characters cannot be written seperately.

36 2: Theory and Litrature Review Introduction While the early experimental OCR systems were often rule-based, by the 1980s these had been completely replaced by systems based on statistical pattern recognition. For clearly segmented printed materials, such techniques offer virtually error-free OCR for the most important alphabetic systems, including variants of the Latin, Greek, Cyrillic, and Hebrew alphabets. However, when the number of symbols is large, as with the Chinese or Korean writing systems, or the symbols are not separated from one another, as in Arabic or Devanagari text, OCR systems are still far from the error rates achieved by human readers, and the gap between the two is also evident when the image quality is compromised, for example with fax transmission. Until these problems are resolved, OCR is unable to play the central role in the transmission of cultural heritage to the digital age that it is often assumed it can. In the recognition of handprint, algorithms with succesive segmentation, classification, and identification (language modelling) stages are still the most succesful. For cursive handwriting, HMMs that make segmentation, classification, and identification decisions in parallel, have proved to be superior. However, their performance still leaves much to be desired because they do not necessarily synchronize spatial and the temporal aspects of the written signal (that is discontinuous constituents arising for example at the crossing of t s and when dotting of i s), and because the inherent variability of handwriting is far greater than that of speech, to the extent that we often see illegible handwriting but rarerly hear unintelligible speech. A comprehensive reference for cursive machine-print is, Bazzi et al. (1999) [BSM99]. The state of the art in handwriting recognition is closely tracked by the International Workshop on Frontiers of Handwriting Recognition

37 2: Theory and Litrature Review 24 (IWFHR) [IWF02]. For language modelling in OCR see Kornai (1994) [Ko94]. A good general introduction to the problems of page decomposition is offered by O Gorman and Kasturi (1995) [OK95], and to OCR in general by Bunke and Wang (1994) [BWB94]. A contribution to document image analysis of about one hundered papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) were summarized in [Na00]. In the next section a general review of the trials done on off-line handwritten recognition will be discussed, and section 2.3 includes a review of papers published on off-line handwritten recognition using HMM. 2.2 Survey of Off-line Handwritten Words Recognition An OCR system consists of the following processing steps [OK95]: Data Capture: Grey scales level scanning at an appropriate resolution (typically dpi) Pre-processing (Pixel-Level Processing): Constitutes the following: o Binarization (two level thresholding), using a global or a locally adaptive method o Determining the skew (any tilt at which the document may have been scanned) o Document layout analysis: Finding columns and paragraphs; Line, word, and character segmentation: extracting text lines, words, and characters

38 2: Theory and Litrature Review 25 Feature extraction Classification Contextual verification, or post-processing PAPER GRAY LEVEL SINGLE FEATURE CLASSIFIED CLASSIFIED Figure 2-1: Steps involved in the Optical Character Recognition System

39 2: Theory and Litrature Review 26 In Figure 2-1, the first step, and a part of the second, may be termed geometric structure, or document analysis. The following steps are termed document understanding, or mapping the geometrical structure into logical structure. In the following sub-sections, each of the steps involved in the OCR system (shown in the previous Figure) are briefly discussed [TJT96,TLS96]. The next sections summarize research and trials conducted in each area of handwriting recognition Databases A standard database of images is needed to facilitate research in handwritten text recognition. A number of existing databases for English off-line handwriting recognition are summarized in [MB99-MB02], and also in [Na92-JLG78]. For machine-printed Arabic, the Environmental Research Institue of Michigan (ERIM) has created a database of machine-printed Arabic documents. These images are extracted from typewritten and typeset Arabic books and magazines [Sc02] Data Capture Data capture is usually carried out by optically scanning a paper document. The resulting data is stored in a file of picture elements (pixels) that are sampled in a grid pattern throughout the document. In general, the grey-level scanning will be performed at a resolution of dots per inch. In this research, the researcher used

40 2: Theory and Litrature Review 27 samples of Arabic handwritten data, and stored samples in files to use them off-line later [OK95] Pre-processing Pre-processing is a step that enhances the quality of feature extraction because it enhances the quality of the image. Pre-processing includes steps such as 1) Binarization, 2) Skew detection, 3) Segmentation, 4) Dissection etc., which are discussed in the following sub-sections The Binarization of Scanned Images The resultant images from the optical scanning process are usually in grey scale format. There is a need to binarize these images, i.e. to turn them into two level formats, to enable the subsequent processing steps. The two levels are usually black for character pixels, and white for background pixels. Binary scanners which combine digitization with thresholding may not produce images, with a clear separation between the foreground and background components. There are two solutions to improving binarization. Firstly, one can empirically determine the best binarization setting each time the scanning process is to be done. Alternatively, one can start with grey scale images resulting from the digitization process and use methods for automatic threshold determination Skew Detection There have been many methods, or techniques, developed to perform the skew detection of an image [OK95]. Akiyama and Hajeta [AH90] developed an automated entry system for skewed documents, but this failed with documents that consist of text blocks, photographs, figures, charts, and tables. The Hough Transform can be applied in skew detection. Hinds,

41 2: Theory and Litrature Review 28 Fisher and D amato [TLS96] developed a document skew detection method using run-length encoding and the Hough Transform. In [HFA90], all skews have been detected correctly for the thirteen test images of five different types of documents. Nakano, Shima, and Fuzisawa [NSF+90] proposed an algorithm for skew normalization of a document image based on the Hough Transform. These methods can handle documents with limited non-text regions. Ishitani [Is93] proposed a method to detect skew for document images containing a mixture of text areas, photographs, figures, charts and tables. Yu Tang and Suen [YTS95] developed a method using the least squares to handle a multi-skew problem. All skews have been detected correctly for the thirteen test images of five different types of document. Approaches based on the horizontal projection histogram as used for Arabic text are presented by [OM02]. They present a method that is completely based on polygonally approximated skeleton processing. However, this method still does not work well with words containing isolated characters. It was also not tested on words with overlapping characters Segmentation The initial segmentation of characters can make the difference between very good and very poor results from an OCR process. The goal of a character segmentation algorithm is to partition a word image into regions, each containing an isolated complete character. In handwritten words, it is extremely difficult to segment characters without the support of recognition algorithms. Therefore, unlike the problem of machine printed character recognition, the handwritten character segmentation and recognition are closely coupled [LS96]. A character is a pattern that

42 2: Theory and Litrature Review 29 resembles one of the symbols that the system is designed to recognize. To determine such a resemblance, the pattern must be segmented from the document image. Researchers in the 1960s and 1970s observed that segmentation caused more errors than shape distortions in reading unconstrained characters, whether hand or machine printed. Three pure strategies for segmentation, plus numerous hybrid approaches that are weighted combinations of the three are mentioned in [CL96] and are outlined below: The classical approach, in which segments are identified based on character like properties. This process of cutting up the image into meaningful components is named dissection, referring to the decomposition of the image into a sequence of sub-images using general features. Recognition based segmentation, in which the system searches the image for components that match classes in its alphabet. Holistic methods (or global approach), in which the system seeks to recognize whole words, avoiding the need to segment them into characters Feature Extraction Feature extraction is defined as the problem of extracting (from raw data) the information which is most relevant for classification purposes, with the aim of minimising the within-class pattern variability whilst enhancing the between-class pattern variability [DK82]. Feature extraction is a problematic topic, often art rather

43 2: Theory and Litrature Review 30 than science, as it is difficult to predict in advance which measures will be useful. Features can be expensive to calculate [Ob94]. Feature extraction methods differ from one application to another. Methods that succeed in one application may not be very useful in another. Feature extraction is, however, an important step in an OCR system, although it is not independent of the other steps (see Figure 2-1). The choice of the feature extraction method limits or dictates the output of the pre-processing step. Some methods work on greylevel sub-images of single characters, whilst others work on solid four or eight connected symbols segmented from the binary raster image, thinned symbols or skeletons, or symbol contours. Further, the type of format of the extracted features must match the requirement of the chosen classifier. Graph descriptions or grammar-based description of the characters are well suited for structural or syntactic classifiers. A literature survey of feature extraction methods is provided by [Al99]. A discussion of feature extraction techniques used in Arabic writing is given in Chapter Classification Typical character classification systems extract several features from each character image and then, based on the similiarity of the feature vector to the character class, attempt to classify it. Many well-known pattern classification methods, as well as syntactic and structural methods, have been used [MSY92, Na92]. There are different character classifier structures for isolated handwritten character classification, such as simple linear classifiers (one classifier for the whole program), two-stage hierarchical classifiers, and tree classifiers. The results of experiments on handwritten characters show that combining multiple classifiers is an effective means of producing highly reliable decision classifiers. Intrinsically,

44 2: Theory and Litrature Review 31 neural networks are suitable to serve as combination functions because they contain the following three valuable characteristics. They: - can infer subtle, unknown relationships from data; - can generalize, meaning that they can still respond correctly to patterns that are only similar to the original training data; - are non-linear; that is, they can solve some complex problems more accurately than linear techniques do. Efforts have been made to improve the performance of OCR by using powerful character feature extraction and classification methods. Further improvement could be obtained by exploiting contextual information [Na92]. The classifiers used in such systems frequently output several classes for each input pattern and associate a degree of confidence to each label. A final class assignment is made after analyzing the outputs from a string of characters, rather than making a decision based on a single character. Because of large shape variations in human handwritting, recognition accuracy of the cursive handwritten word is hardly satisfied using a single classifier. In recent years some multiple classifier combination techniques were proposed to improve handwritten character recognition performance, and they have been shown to give promising results by a number of different researchers. [XKL02] used HMM classifiers with different architectures and different features to recognize the names of the months, giving an 85% recognition rate. Wang et al. [WBR02] introduced a framework to combine the results of multiple

45 2: Theory and Litrature Review 32 classifiers and present an intuitive run-time-weighted opinion pool (RWOP) combination approach for recognizing cursive handwritten words. Promising results have been achieved with these methods. A study of multiple expert systems for handprinted numeral recognition was discussed by [YNT97], and [LBK97] discusses handprinted recognition. A multiple classifier approach to recognizing handwritten characters was studied by [RF97], whilst [Go97] discusses several techniques for a variety of practical tasks. [GB02] introduced new methods for the creation of ensembles based on feature selection algorithms, which are evaluated and compared to the existing approach using HMM. A review of previous trials on handwritten recognition using HMM is discussed in further detail in Chapter Post-processing Post-processing systems are designed to correct OCR errors without human intervention. The well-known application of lexical knowledge for contextual post-processing compares dictionarybased (top down) and statistical approaches (bottom up). The advantage of statistical over dictionary-based methods is computational time and memory utilization. On the other hand, lexical knowledge is more accurate when using a dictionary. Finally, the contextual post-processing of OCR results can also take into account knowledge of context of words. From a linguistic point of view, a technique for contextual post-processing can incorporate a multitude of different knowledge sources, for example frequencies of single words and word combinations, compounds and idioms, and

46 2: Theory and Litrature Review 33 linguistic structures such as phrases and sentences etc. An overiew of possible knowledge sources for post-processing is presented in [Na92, Sr93]. 2.3 Off-line HMMs for an HWR Survey The discussions here are by no means exhaustive. There is a growing interest in applying HMMs to the problem of document analysis and recognition, and a large body of literature is being published in reputed journals and conference proceedings. Several promising research achievements have been presented at recent conferences and workshops. Results in HMM research for handwriting recognition can be grouped into the on-line and off-line cases. Work done in the field of off-line handwriting recognition is reviewed and divided into two groups that done on segmented handwritten words, and that done on non-segmented words. first the Single Contextual Hidden Markov Model (SCHMM) that was introduced by [KHB98] to recognize hand-printed words will be descussed, i.e. handwritten words that are naturally segmented. When the letters of the words are naturally segmented, and if these letters are identified as states [KHB98], there are a finite number of predetermined states, for example the 26 letters of the English alphabet. In general, handwritten words are usually not naturally segmented into letters, and a word segmentation algorithm is necessary for such a task. At present, no good segmentation algorithm exists which separates all the letters perfectly and without any spurious segmentation points. [CKZ94] use a more general framework that can be applied to cursive, non-cursive, naturally segmented or any other type of handwritten words. In this approach, a morphology-based segmentation algorithm is first used to divide the word image into a sequence of

47 2: Theory and Litrature Review 34 segments, which could signify a whole, partial, or joint letter. The sequence of segments is then recognized by an HMM-type stochastic network which can deal with the problems of touching and broken characters. Since touching characters are not guaranteed or required to be split by the segmentation algorithm, the number of states, which depends on the training set, may go up to over 6,000 [CKZ94] for handwritten English words. Consequently, the state assignment for a large training set is rather complicated. Furthermore, this individual segment-based recognition system might never know how well a character is formed by combining several consecutive segments. Nevertheless, the scheme described in [CKZ94] has clearly shown that the application of HMM to a large vocabulary HWR problem is, indeed, much more complex than one described in [NWF86]. To overcome the problems of the SCHMM system, a new system using a Continuous Density Variable Duration Hidden Markov Model (CDVDHMM) [CKS95] was proposed, with the help of an enhanced segmentation algorithm, which splits all the touching characters (of course, this leads to more spurious segmentation points). The CDVDHMM defines the 26 letters in the alphabet as 26 different states, and this number is fixed and much smaller than the previous systems described by [CKZ94]. Consequently, the recognition speed is much improved. The implementation and experiments for the CDVDHMM system are discussed in [CKZ94] and are outside the parameters for the CDVDHMM in HWR, the NEHMM (Non-Ergodic HMM) based system proposed by Chen and Kundu [CK94]. The NEHMM system follows the the Model Discriminant HMM (MD-HMM) strategy (see section 3.3.1). However, the model parameters can be derived from the statistics of the CDVDHMM, which appears to perform better than the CDVDHMM strategy albeit at a slower

48 2: Theory and Litrature Review 35 speed. A combination using both CDVDHMM and NEHMM can be considered as a trade-off between performance and speed. One problem with the VDHMM system is ensuring its reliable computation of model probabilities given the limited number of databases that are available at the present time. [CKZ94] has presented an interesting idea to avoid the computation of duration probabilities. By using oversegmentation, this scheme considers many different sub-sets of the segmentation points. Each sub-set leads to one distinct observation sequence. The recognition task is then to find the best segmentation; that is, find the sub-set that contains the correct segmentation points, and the associated optimal state sequence which corresponds to the letter sequence of the word. This philosophy is similar to that of VDHMM. The added complexity of computing the duration probability in each state is avoided in this approach by making a simple, but realistic, assumption which assumes that a character can be broken into (at most) four segments and, therefore, that there are four discrete duration probabilities for each state. Instead of assigning pre-computed duration probabilities to each state, only one duration will be picked (during recognition) by matching one, two, three and four consecutive segments to the symbols in the feature space, and finding the best match and its corresponding number of segments. In this way, the computation of duration probability in each state is avoided without sacrificing the advantage of VDHMM. However, the structure of the Viterbi algorithm used during recognition is substantially altered. The overall performance of this scheme, as expected, is quite similar to the VDHMM based word recognition system [CK95]. In the previous approaches (SCHMM, CDVDHMM, and NEHMM), the models are actually semi-hidden Markov models, i.e. the states of HMMs

49 2: Theory and Litrature Review 36 are transparent during training. Because re-estimation algorithms, such as the Baum-Welch product, do not preserve the correspondence of the states to their semantic meanings, it is not suitable for training the semi-hidden Markov models. Another approach is described using a Multi-Level Hidden Markov Model (MLHMM) which is a doubly embedded network of HMMs, whereby characters are modeled by an HMM and words by a higher-level HMM. The HMM belongs to the Model Discriminant HMM (MD-HMM) strategy at the character level. Since states are not assigned any semantic meaning at the character level, the re-estimation algorithm is applicable. For the word model, on the other hand, both the MD-HMM and One Path Discriminant (PD-HMM) strategies can be used. Another major difference between this new system and the previous approaches is the output-independence assumption of the HMM (see section 3.3.1). The details of this approach are described in [CKZ94]. There are many uncertainties in handwritten character recognition. Stochastic modelling is a flexible and general method for modelling such problems, and entails the use of probabilistic models to deal with uncertain or incomplete information. Cho et al. [CLK95] used another strategy for modelling and recognizing cursive words with HMM. In the proposed method, a sequence of thin vertical frames is extracted from the image, capturing the local features of the handwriting. By quantizing the feature vectors of each frame, the input word image is represented as a Markov chain of discrete symbols. A handwritten word is regarded as a sequence of characters and optional ligatures. Hence, the ligatures are also explicitly modelled. With this view, an interconnected network of character and ligature HMMs is constructed to model words of indefinite length. This model can ideally describe any form of handwritten words, including discretely spaced words, pure cursive words and unconstrained words of mixed styles. Experiments have been conducted with a standard database to

50 2: Theory and Litrature Review 37 evaluate the performance of the overall scheme. The performance of various search strategies based on the forward and backward score has been compared. Experiments on the use of a pre-classifier based on global features show that this approach may even be useful for large-vocabulary recognition tasks. Another method for off-line recognition of cursive handwriting using HMMs is implemented by Bunke et al. [BR95]. The features used in their HMMs are based on the arcs of skeleton graphs of the words to be recognized. An algorithm is applied to the skeleton graph of a word that extracts the edges in a particular order. Given the sequence of edges extracted from the graph, each edge is transformed into a ten-dimensional feature vector. The features represent information about the location of an edge relative to the four reference lines, its curvature and the degree of the nodes incident to the considered edge. The linear model was adopted as basic HMM topology. Each letter of the alphabet is represented by a linear HMM. Given a dictionary of fixed size, an HMM for each dictionary word is built by sequential concatenation of the HMMs representing the individual letters of a word. Training of the HMMs is done using the Baum-Welch Algorithm, while the Viterbi algorithm is used for recognition. An average correct recognition rate of over 98% on the word level has been achieved in experiments with cooperative writers using two dictionaries of 150 words each. Park et al. [PL96] present an efficient scheme for off-line recognition of large-set handwritten characters in the framework of stochastic models, the first-order HMMs. To facilitate the processing of unconnected patterns and patterns with isolated noise, four types of feature vectors, based on the regional projection contour transformation (RPCT), are employed. The character recognition system consists of two phases a training phase where multiple HMMs corresponding to different feature types of RPCT

51 2: Theory and Litrature Review 38 are built, and the classification phase, where the results of individual classifiers are integrated to produce the final recognition result, where each individual HMM classifier produces one score that is the probability of generating the test observation sequence for each character model. In this paper, several methods for integrating the results of different classifiers are considered so that better results can be obtained. In order to verify the effectiveness of the proposed scheme, the most frequently used 520 types of Hangul characters in Korea were considered in experiments. Experimental results suggest the proposed scheme is promising for the recognition of large-set handwritten characters with numerous variations. Other authors who have proposed a recognition system of constrained Handwritten Hangul (Korean character) and alphanumeric characters using discrete HMMs are Kim et al. [KP96]. Hangul shapes are classified into six types with fuzzy inference, and their recognition based on quantized features is performed by optimally ordering features according to their effectiveness in each class. Constrained alphanumerics recognition is also performed using the same features employed in Hangul recognition. The forward-backward, Viterbi and Baum-Welch re-estimation algorithms are used for training and recognition of handwritten Hangul and alphanumeric characters. The simulation result shows that the proposed method recognizes handwritten Korean characters and alphanumeric effectively. [SK98] proposed a Network-based approach to Korean handwriting analysis. The starting point of this research is a network of HMMs, which models whole sets of characters. These are followed by the assertion that the HMM for the on-line script can be applied to not only on-line character recognition, but also to handwriting synthesis and even to pen-trajectory recovery in off-line character images. The solutions to these problems are based on the single network of HMMs and the single principle of DP-based state-observation alignment. Given an observation sequence, the search for

52 2: Theory and Litrature Review 39 the best path in the network corresponds to the recognition whereas with character models, the search for the best observation sequence corresponds to the handwriting generation. Kundu et al. [KHC98] have published work concerning variable duration HMM in handwriting recognition (VDHMM). They showed that if the duration statistics are computed, this could be utilized to implement an MD-HMM approach for better experimental results. They also described a PD-HMM based HWR system where the duration statistics are not explicitly computed, but results are still comparable to a VDHMM based HWR scheme. In recent years, there have been several attempts to extend the onedimensional HMM to two-dimension, for example Park and Lee [PL98]. Unfortunately, previous efforts have not yet achieved a truly twodimensional (2-D) HMM because of the difficulty in establishing a suitable 2-D model and its computational complexity. Park and Lee [PL98] presented a framework for the recognition of handwritten characters using a truly 2-D model: Hidden Markov Mesh Random Field (HMMRF). The HMMRF model is an extension of a 1-D HMM to 2-D HMM, which provides a better description of the 2-D nature of characters. The application of the HMMRF model to character recognition necessitates two phases a training phase and a decoding phase. Their optimization criterion for training and decoding is based on the maximum, marginal and posterior probabilities. They also develop a new formulation of parameter estimation for character recognition. Computational concerns in 2-D, however, necessitate certain simplifying assumptions in the model and approximations on the implementation of the estimation algorithm. In particular, the image is represented by a thirdorder MMRF and the proposed estimation algorithm is applied over the

53 2: Theory and Litrature Review 40 look-ahead observations rather than the entire image. Thus, the formulation is derived from the extension of the look-ahead technique as devised for real-time decoding. Experimental results confirm that the proposed approach offers great potential for solving difficult handwritten character recognition problems under reasonable modelling assumptions. El-Yacoubi et al. [EGS99] used an HMM approach to recognize off-line unconstrained handwritten words for large vocabularies. After preprocessing, a word image is segmented into letters (or pseudoletters) and represented by two feature sequences of equal length, each consisting of an alternating sequence of shape-symbols and segmentation-symbols, which are both explicitly modelled. The word model is made up of the concatenation of appropriate letter models consisting of elementary HMMs and an HMM-based interpolation technique is used to optimally combine the two feature sets. Two rejection mechanisms are considered depending on whether or not the word image is guaranteed to belong to the lexicon. Experiments carried out on real-life data show that the proposed approach can be successfully used for handwritten word recognition. HMM based word recognition can be applied to reading the amount on cheques. Knerr et al. [KAN+98] implemented an HMM based word recognition algorithm to the recognition of legal amounts from French bank cheques. The algorithm starts from images of handwritten words, which have been automatically segmented from binary cheque images. After finding the lower-case zone on the complete amount, words are slant corrected and then segmented into graphemes. Then features are extracted from the graphemes and the feature vectors are vector quantized, resulting in a sequence of symbols for each word. The likelihood of all word classes are computed by a set of HMMs, which have been previously trained using

54 2: Theory and Litrature Review 41 either the Viterbi algorithm or the Baum-Welch Algorithm. The various parameters of the system have been identified and their importance evaluated. Results have been obtained on large real-life databases of French handwritten cheques. More recently, a Neural Network-HMM hybrid has been designed, which produces even better recognition rates. Senior and Robinson [SR98] designed a complete system for the recognition of off-line handwriting. A recurrent neural network is used to estimate probabilities for the characters represented in the skeleton. The operation of HMM which calculates the most appropraite word in the lexicon is also described. As mentioned earlier in this chapter, segmentation recognition schemes are primarily character-based approaches. This means that the basic element of recognition is the character. For small lexicons, as in the bank cheque application, most approaches are global, with words considered as individual entities [GS95]. Guillevic and Suen have published papers on recognition of legal amounts on bank cheques. The overall engine combines a global feature scheme with an HMM module. The global features encode the relative position of the ascenders, descenders and loops within a word. The HMM uses one feature set based on the orientation of contour points, and their distance from the baselines. The system is fully trainable, reducing to a strict minimum the number of hand-set parameters. The system is also modular and independent of specific languages, as they have to deal with at least two languages in Canada, namely English and French. The system can be easily adapted to read other European languages based on the Roman alphabet [GS98]. An HMM has also been used for the linguistic post-processing component of human handwriting recognition applications, by Bouchaffra et al. [BKK+96] and Hull [Hu96]. Article [BKK+96] shows that the SSS algorithm

55 2: Theory and Litrature Review 42 has a direct interpretation as an HMM whose states correspond to words that have been tagged with their parts of speech, and whose observations are discrete recogniser confidences. The HMM interpretation has the added advantage that it can be naturally extended to handle error recovery in the recogniser. Preliminary results indicate that the SSS model is successful in selecting the true path over alternate paths. Hull [Hu96] used an HMM to improve the performance of an algorithm for recognising digital images of handwritten or machine-printed text. A word recognition algorithm first determines a set of words (called a neighbourhood) from a lexicon that is visually similar to each input of the word image. Syntactic classifications for the words and the transition probabilities between those classifications are input to the Viterbi algorithm. The Viterbi algorithm determines the sequence of syntactic classes (the states of an underlying Markov process) for each sentence that has the maximum posterior probability given the observed neighbourhoods. The performance of the word recognition algorithm is improved by removing words from neighbourhoods with classes that are not included on the estimated state sequence. An experimental application is demonstrated with a neighbourhood generation algorithm that produces a number of guesses about the identity of each word in a running text. The use of zero, first and second order transition probabilities, and different levels of noise in estimating the neighbourhood are explored. Post-processing (probabilities between words) has also been used to improve performance.

56 2: Theory and Litrature Review Arabic OCR using HMM This section discusses the implementation of HMM on Arabic OCR. The following trials do not include implementation of HMM on handwritten Arabic words. Bazzi et al. [BSM99] present an omni font, unlimited-vocabulary OCR system for English and Arabic that is based on an HMM. They focus on two aspects of the OCR system. They address the issue of how to perform OCR on omni font and multi-style data (such as plain and italic) without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. This paper demonstrates mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Secondly, a method is described which enables a word-based HMM system to perform character recognition with unlimited vocabulary. This method includes the use of a trigram language model on character sequences. Using all these techniques, they have achieved character error rates of 1.1% on data from the University of Washington English Document Image Database, and 3.3% on data from the DARPA Arabic OCR Corpus. The application of HMM to Arabic OCR was first attempted by Amin and Mari [AM89]. They used HMM in the post-processing stage to improve the recognition accuracy, where each word is described by an HMM. As part of a larger project for transcription of the documents in the Ottoman Archives, Atic et al. [AM89] developed a heuristic method for segmentation, feature extraction and recognition of the Arabic script. They developed a geometrical and topological feature analysis method for the

57 2: Theory and Litrature Review 44 segmentation and feature extraction stages. Chain code transformation is applied to the main strokes of the characters that are classified by the HMM in the recognition stage. Experimental results indicate that the performance of the proposed method is satisfactory, as long as the thinning process does not yield spurious branches. Makhoul et al. [MLR+96] used a system that depended on the estimation of character models, a lexicon, and grammar from the training samples. This system was identical to their speech recognition system but replaced speech, phonemes, and phonological rules with scanned images, characters, and orthographic rules, respectively. It also describes each word with separate HMMs, which limited the number of words the system could recognize. Khorsheed and Clocksin [KC99] present a technique for the off-line recognition of cursive Arabic script based on an HMM in which it is not necessary to segment the word. After pre-processing, the thinned binary image of each word is decomposed into a number of curved edges in a certain order. Each edge is transformed into a feature vector, including features of curvature and length normalized to stroke thickness. The observation sequence presented to the HMM consists of codes derived from a vector quantization of the feature vector. The lexicon is represented by a single HMM, where each word is represented by a sequence of states. A modified Viterbi algorithm is used to provide an ordered list of the best paths, indicating candidate transliterations. The HMM was trained using the words written in one typeface and one size, and test samples were written in two different typefaces and in three sizes. Recognition rates ranging from 68% to 73% were achieved depending on the task performed. The system was less affected by distortion and variation than a system that uses the raw pixel data as an observation sequence. However it does not suit Arabic handwritten words, because dots (which are important elements of

58 2: Theory and Litrature Review 45 handwritten Arabic characters) are not written exactly below or above each character or edge feature as described in the paper. Also the result (from 68% to 73%) for the Arabic printed words using the Traditional Arabic Font is not high. Dehghan et al. [DF01] use a holistic system for the recognition of handwritten Farsi/Arabic words using HMM and a Kohonen selforganizing vector quantization was presented. The image was divided into fixed-width frames, and each frame divided into five zones, each with four features depending in the contour direction. In this way, each frame is represented as a 20-dimensional feature vector. According to the unique property of handwritten Arabic writing it is believed that these features are not enough to get a reasonable recognition rate. The recognition rate was 32% without smoothing and 65% with smoothing. With the exception of Khorsheed, and Dehghan et al. [Kh00-DF01], the above experiments using the HMM approach were tested on printed Arabic text, not on handwritten words. 2.5 A Survey of Off-line Handwritten Arabic Words Recognition Research in the field of Arabic character recognition started as early as 1975 when Nazif presented his thesis [Na75]. However, due to lack of computing power, further significant work was not performed until the 1980s [BS97]. Many papers have been published on the recognition of Latin, Chinese and Japanese characters. However, little research has been conducted towards the automatic recognition of Arabic characters, which are used in several widespread languages. This is because of the strongly cursive nature of its writing rules. In fact, the techniques applied in other

59 2: Theory and Litrature Review 46 languages are not directly applicable to Arabic characters without fundamental modifications. Even less research in the field of handwritten Arabic characters has been published [BWB94]. Amin et al. [AAF96] propose a technique for the recognition of handprinted Arabic characters using neural networks. Firstly, their technique combines rule-based (structural) and classification tests. Secondly, it is more efficient for large complex sets, such as Arabic characters. Thirdly, feature extraction is inexpensive. Finally, the execution time is independent of both the character font and size. This paper describes the neural network method applied in the classification step, the computation of intensive earlier stages being carried out by more classical approaches. Maddouri and Amiri [MA02] propose a recognition system based on combining a global and local vision modelling of the word developed for Latin word recognition by M. Cote. The drawback of this system is in its assumption that diacritical dots are naturally separated, which is not the case with handwritten Arabic, as was shown in Chapter 1. Also, loops are not naturally written in handwritten Arabic, and thus leads to a substantial difference in recognition rate. In the same study, the researchers did one of the experiments using a manual GVM, which proposed a list of possible letters and words containing these characters. Al-Ohali et al. [ACS02] used an HMM to classify handwritten words used in cheque filling applications. The authors segmented the training and testing of sub-words and characters manually. Geometrical features were used.

60 2: Theory and Litrature Review Conclusion In this chapter, various image-processing methods commonly used in the field of document image analysis and character recognition have been presented. These methods are grouped into four processing categories, namely, data capture, pre-processing, feature extraction, classification and contextual verification (or postprocessing). This represents the processing steps used in many document image analysis systems currently in use. Image acquisition describes the process of converting a document into its numerical representation. The pre-processing step for the scanned image can be divided into three sub-sections: binarization, skew detection and segmentation. The features can be fuzzy to define or difficult to extract, so the process of the feature extraction step varies and depends on many factors. The features that result from each image are classified using classification methods. Trials of offline Arabic handwritten recognition systems were also discussed and an overview of handwritten character processing has been presented. A detailed review of handwriting recognition using HMM was also presented suggesting that the field of document understanding and, in particular, this handwriting recognition using HMM is undergoing an exciting phase of research and development. Indeed, the HMM has the potential to become one of the most dominant techniques in this field. Since the HMM has been used as one of the main classification techniques in this research, a further review of research done in this area has been discussed at the end of the chapter. This chapter has reviewed the current state of the field up to the point where the actual text classification begins and described the trials coming out on Arabic text using an HMM. In the next chapter a detailed review of more specific recognition techniques is

61 2: Theory and Litrature Review 48 presented (VQ, HMM and ID3), and the pre-processing methods used for Arabic writing are discussed. Much of the important research is briefly described to present the current status of off-line handwritten Arabic Character Recognition research. It is clear that most of this work assumes that the Arabic characters are already segmented, whilst the database and pre-processing systems are built assuming that words have not been segmented into characters.

62 T his Chapter discusses the techniques that have been used in the recognition of printed and handwritten Arabic text and discusses three important tools which are used throughout this work (Vector Quantization, HMM and ID3) classifier. In section 3.2 Vector Quantization (VQ) is discussed because VQ has been used to recognize segments of words, such as characters or sub-characters in HMMs classifier in Chapter 6 and Multiple HMMs classifiers in Chapter 7. In section 3.3, HMMs techniques have been discussed to give an idea about the systems mathematics that have been used in Chapters 6 and Chapter 7. In this work HMM techniques have been used to classify Arabic handwritten words. The ID3 tree technique is discussed in section 3.3 and used in Chapter 7 to classify an Arabic handwritten word into a group of words or a single word.

63 3: Methodology: Useful Techniques Off-line Arabic Words Recognition Methods The previous chapter the steps that every OCR system includes (for any language) was mentioned. The main steps that differ when dealing with Arabic writing systems are segmentation and feature extraction because of the special characteristics of Arabic writing. In the next sub-section some techniques used for the recognition of off-line Arabic writing (handwritten and printed) are discussed Feature Extraction Methods It is known that features represent the smallest set that can be used for discrimination purposes and for a unique identification of each character. Features can be classified into thee categories: geometric features (e.g. concave/convex parts, and type of junctions intersections/t-junctions/endpoints etc.) topological features (connectivity, number of connected components, number of holes, etc.) statistical (Fourier transform, invariant moments, etc.) Off-line character recognition systems typically use a scanner as the main input device. Off-line recognition can be considered as the most general case where no special device is required for writing [BWB94]. Since this research deals with the off-line recognition of handwriting, some of the field trials to automatically recognize handwritten Arabic writing are summarized below.

64 3: Methodology: Useful Techniques 51 Abuhaiba [AHD94] produced a paper that deals with three different problems in the processing of binary images of handwritten text documents. Firstly, an integrated algorithm that finds a strength line approximation of textual strokes is described. The distance transform of thinned binary images has been used to identify spurious bifurcation points (which are unavoidable when thinning algorithms are used) and remove them and recover the original ones. Secondly, a method is presented to recover loops that become blobs due to blotting. As reported, it is not possible to recover such loops with a high rate of success. Finally, a method is developed to extract lines from pages of handwritten text by finding the shortest spanning tree of a graph formed from the set of main strokes. At the end, an ordered list of main strokes is obtained. Each combination of main secondary strokes is the input to a subsequent recognition stage. The method can deal with variable handwriting styles. So in the pre-processing system in Chapter 5 a similar strokes extraction method has been used. But a different thinning algorithm (improved Zhang and Suen method) has been used as in [MIB01], which proved that the Zhang and Suen method is among one of methods that has a better skeleton structure and execution time than other techniques when resolution is less than 600 dpi. Almuallim and Yamaguchi [AY87] proposed a structural technique for the recognition of Arabic handwritten words. Their system consisted of four phases. The first phase is pre-processing, in which the word is thinned and the middle of the word is calculated. Since it is difficult to segment a cursive word into letters, words are then segmented into separate strokes and classified either as strokes with a loop, strokes without a loop, or complementary characters. These strokes are then further classified using their geometrical and topological properties. Finally, the relative positions of the classified strokes are examined and the strokes are combined in several steps on to the string of characters that represents the recognized

65 3: Methodology: Useful Techniques 52 word. The system in Almuallim and Yamaguchi paper [AY87] is too simple to deal with complex Arabic words, since it used simple geometrical features and a small set of testing words. Also loops are difficult to extract from Arabic handwriting. A look-up table can be used for the recognition of isolated hand-written Arabic characters. In this approach, the character is placed in a frame, which is divided into six rectangles, and a contour-tracing algorithm is used for coding the contour as a set of directional vectors by using Freeman coding. However, this information is not sufficient to determine Arabic characters. Extra information related to the number of dots and their position is therefore added. If there is no match, the system will add the feature vector to the table and consider that character as a new entry [SY85]. This method is used to recognize segmented Arabic characters, which is not a real case in Arabic writing, since Arabic text is written cursively and, in general, Arabic characters are difficult to segment in the real handwritten Arabic words. Saleh et al. [Sa94] describe an efficient algorithm for coding handwritten Arabic characters. Certain feature points of the skeleton, which are end, branch, and main connection points, are extracted. Primitives are then assigned according to the sequence of ordering and positioning of these points. Isolated sub-patterns (secondaries) within some Arabic characters are treated separately and then related to the principal patterns of the character. Stability and performance of the algorithm have been established by applying it in several patterns of all Arabic characters as well as in an experimental context-free recognizer. Again this method is not realistic since Arabic letters are separated in nature. A structural approach has also been adopted for recognizing printed Arabic text (Amin and Masini [AM86]). Words and sub-words are segmented into

3: Methodology: Useful Techniques 53 characters using a baseline technique. Features such as vertical bars are then extracted from the character using horizontal and vertical projections (Figure 3-1).

66 3: Methodology: Useful Techniques 53 characters using a baseline technique. Features such as vertical bars are then extracted from the character using horizontal and vertical projections (Figure 3-1). Four decision trees, which are chosen according to the position of the character within the word and computed in the segmentation process, have been used. The structure of the four decision trees allowed a rapid search for the appropriate character. Furthermore, trees are utilized to distinguish characters that have the same shape but appear in different positions within a word. (a) (b) (c) Figure 3-1: Vertical and horizontal scanning of the character (a) character (b) horizontal scanning (c) vertical scanning. Amin and Mari [AM89] proposed a technique for multifont Arabic text that includes character and word recognition. A character is divided into many segments by a horizontal scan process (Figure 3-2). In this way, segments are connected within the basic shape of the character. Segments that are not connected with any other segment are considered to be complementary characters. By using the Freeman Code [Fr68], the contour detection process is applied in these segments to trace the basic shape of the character and generate a directional vector through a 2*2 window. A decision tree is then used for the recognition of the characters. Finally, a Viterbi algorithm

67 3: Methodology: Useful Techniques 54 [Fo73] is used for Arabic word recognition to enhance the recognition rate. The main advantage of this technique is to allow an automatic learning process to be used. The last two approaches used general features for limited printed words. Some of the features can be extended to be used in handwritten words. Nouh et al. [NST80] suggested a standard Arabic character set to facilitate computer processing of Arabic characters. In this work, thirteen features, or radicals, which represent parts of the characters, are selected by inspection. The recognition is based on a decision tree and a strong correlation measurement. The disadvantage of the proposed system is the assumption that the incoming characters are generated according to specified rules. Figure 3-2: Major segments of character Parhami and Taraghi [PT81] presented a technique for the automatic recognition of machine printed Farsi text (which is similar to Arabic text). The authors first segment the sub-word into characters by identifying a series of potential connection points on the baseline at which line thickness changes from or to the thickness of the baseline. Although they also have some rules to keep characters at the end of a sub-word intact, they segment some of the wider characters (e.g.* ) into up to three segments. Then they

68 3: Methodology: Useful Techniques 55 select twenty features based on creation geometric properties of the Farsi symbols to construct a 24-bit vector, which is compared with entries in a table where an exact match is checked first. The system is heavily font dependent, and the segmentation process is expected to give incorrect rules in some cases. The study reported in [ERK90, AU92] utilizes descriptors to recognize the characters. Other techniques include a set of Fourier descriptors from the coordinate sequences of the outer contour, which is used for recognition [EG88]. Also, Nough [NU87] assign each character with a logical function where characters are re-classified into four groups depending on the existence of certain pixels in a specified location of the image. The last papers examined segmented printed Arabic characters and a similar pixel distribution feature that can be used with recognition of handwritten words, as for the case on feature (section 5.3), used in this research. To enhance the recognition rate of an OCR system, Taylor [Ta00] describes a family of lexical analyzers and text measurement tools. The tools are used to tag verbs, search for roots, and discover morpheme frequencies in Arabic text. The morpheme frequencies can be used to construct relative figures of merit for alternative lexical analyses of an ambiguous word. Amin and Al-Sadoun [AA94] proposed a structural approach for recognizing handprinted (this is not a true case in Arabic writing) Arabic characters. The binary image of the character is first thinned using a parallel thinning algorithm and then the skeleton of the image is traced from right to left using a 3*3 window in order to build a graph to represent the character. Features like straight lines, curves and loops are then extracted from the graph. Finally, a hierarchical classification (similar to a decision tree) is used for the recognition of the characters.

69 3: Methodology: Useful Techniques 56 Obaid [Ob94] introduced Arabic handwritten character recognition by neural nets, using the traditional Multi Layer Perceptron (MLP) with its back propagation learning algorithm to classify handwritten Arabic characters. Since Arabic script is cursive, he assumes that the characters are already segmented and he presents them to the network. When the network is trained, the output layer, in response to a familiar input pattern or one which resembles a familiar pattern, activates the neuron corresponding to this character classification [Ob94]. Al-Badr and Haralick [AH95] proposed a system to recognize machine printed Arabic words without prior segmentation by applying a mathematical morphology operation on the whole page to find the locations where shape primitives are present. They then combine those primitives into characters and print out the character identities and their location on the page. The advantage of the work in that paper is that it optimized the recognition of the symbols with respect to the whole word, without committing itself to a particular segmentation of the word into symbols. In this work a segmentation-free approach has been tested to recognize handwritten Arabic words using the ID3 tree (see sections 3.4, 7.1, and 8.5). Finally, El-Khaly and Sid-Ahmed [ES90] used moment-invariant descriptors to recognize isolated and connected printed Arabic characters. They obtained a 100% recognition rate for isolated printed Arabic characters in one font type. For connected printed characters, they obtained a 95% recognition rate for isolated characters. In this work another set of moments to recognize handwritten Arabic words was used as described in Chapter 5, sub-section

70 3: Methodology: Useful Techniques Segmentation Methods Two techniques have been applied for segmenting machine printed and handwritten Arabic words into individual characters implicit and explicit segmentations. Implicit segmentation (straight segmentation): this type of segmentation is usually designed with rules that attempt to identify all the characters segmentation points in order to segment the words directly into letters. Explicit segmentation: words are externally segmented into pseudo-letters, which are individually recognized. In all Arabic characters, the width at a connection point is much less than the width of the beginning character. This property is essential in applying the baseline segmentation technique [AM86, AM89]. The baseline is a medium line in the Arabic word in which all the connections between successive characters take place. If vertical projection of bi-level pixels is performed on the word Eq. 3-1 v ( j) = w( i, j) i Eq. 3-1 where w (i, j) is either zero or one and i, j indexes the rows and columns, respectively the connectivity point will have a sum less than the average value (AV) Eq. 3-2 N c AV 1 Xj Eq. 3-2 = ( ) Nc = j 1 Where Nc is the number of columns and Xj is the number of the jth column.

71 3: Methodology: Useful Techniques 58 Hence, each part with a sum value much less than AV should be a boundary between different characters. However, if the histogram produced for the vertical projection does not follow the condition of Eq. 3-3, the character remains unsegmented, as illustrated in Figure 3-3. By examining typewritten Arabic characters, it is found that the distance between successive peaks in histogram (Figure 3-3) does not exceed onethird the width of an Arabic character. That is: d k <d l /3 Eq. 3-3 where d k is the distance between kth peak and k+1, and d I is the total width of the character. Figure 3-3: An example of segmentation of the Arabic word5i&x into characters (a) Arabic Word (b) Histogram (c) word segmented into characters

72 3: Methodology: Useful Techniques 59 Moreover, at the end of a word or a sub-word, Eq. 3-4 is applied L k + 1 > 1.5* L k Eq. 3-4 by means of some features, such as the existence of a maximum or minimum or either the horizontal or vertical direction of the main stroke, the ratio between length and width, the type of secondary stoke and other features Al-Emami and Usher [AU90] presented a system for the on-line recognition of handwritten Arabic words. Words were entered via a graphic table and segmented into strokes based on the method proposed by Belaid et al. [BM83, AA92]. In the preliminary learning process, specifications of the strokes of each character are fed to the system, while in the recognition process, the parameters of each stroke are found and special rules are applied to select the collection of strokes that best matches the features of a stored character. However, few words were used in the learning and testing process, which makes the performance of the system questionable. This approach depends heavily on a predefined threshold value relating to the character width. Moreover, this approach will not work effectively for skewed images.

$3: Methodology: Useful Techniques 60 Figure 3-4: An example of the Arabic word ]i\u and its segmentation into character (a) Arabic word (b) Histogram (c) word segmented into character Segmentation is$

73 3: Methodology: Useful Techniques 60 Figure 3-4: An example of the Arabic word ]i\u and its segmentation into character (a) Arabic word (b) Histogram (c) word segmented into character Segmentation is also achieved by tracing the outer contour [EG88] of a given word and calculating the distance between the extreme points of intersection of the contour with a vertical line. The segmentation is based on a horizontal scan from right to left of the closed contour using a window of adjustable width (w). for each position of the window, the average vertical distance (h av ) is calculated across the window. At the boundary between two characters, the following conditions should be met: h av < T. In this case, a silence region is detected, which means that the average vertical distance over the window should be less than a certain preset threshold. Directed boundaries should lie on the same horizontal line (the baseline). No complimentary character should be located either above or below the baseline at a silence region.

74 3: Methodology: Useful Techniques 61 Re-adjustment of parameters w and T, as well as backtracking, may occur if segmentation leads to a rejected character shape. Figure 3-5 illustrates some examples of this method. Figure 3-5: Segmented Arabic word and the corresponding contour heights, for words (a) Mahal and (b) Alalamy El-Khaly and Sid-Ahmed [ES90] segment a thinned word into characters by following the average baseline of the word and detecting when the pixels start going above or below the baseline. Abdelazim and Hashish [AH88] used the technique of an energy curve (similar to that used in speech recognition which discriminates the spoken

75 3: Methodology: Useful Techniques 62 utterance from the silence background), to show the number of black pixels in each column of the digitized word, and hence to segment the word into characters. This curve is traversed and a threshold value is used to select significant primitives, leaving out silent zones. Shoukry [SS91] used a sequential algorithm based on the input-time tracing principle, which depends on the connectivity properties of the acquired text in the binary image domain. This algorithm bears some resemblance to an algorithm devised by Wakayama [Wa82] for the skeletonization of binary pictures. The SARAT system [Ma92] used the outer contour to segment an Arabic word into characters. The word was divided into a series of curves by determining the start and endpoints of the word. Whenever the outer contour changed sign (from a positive to negative curvature) the character was segmented, as illustrated in Figure 3-6. Kurdy and Joukhadar [KJ92] use the upper distance of the sub-word to segment printed Arabic words, which is the set of the highest points in each column. They assign each point of the function a token name by comparing the height of the point to the height and the token name of the point on its right. Using a grammar, they then parse the sequence of tokens of a subword to find the connection points.

76 3: Methodology: Useful Techniques 63 Figure 3-6: An example of a segmented sub-word, with start point A, endpoint E, and horizontal lines 2-3 and 5-6 Finally, Amin and Al-Sadoum [AA92, AA95] adopted a new technique for segmenting Arabic text. The algorithm can be applied to any font and it permits the overlay of characters. There are two major problems with the traditional segmentation method, which refers to the baseline. Firstly, overlapping of adjacent Arabic characters occurs naturally (see Figure 3-7a), hence no baseline exists, a common phenomenon in both typed and handwritten Arabic text. Secondly, the connection between the two characters is often short. Therefore, placing the segmentation points is a difficult task. In many cases, the potential segmentation points will be placed within, rather than between, characters. The word in Figure 3-7(a) is segmented utilizing a baseline technique. Figure 3-7(b) shows the proper segmentation, and the result of the new segmentation method is shown in Figure 3-7(c). Their technique can be divided into four major steps. In the first, the original image is transformed into a binary image utilizing a scanner (300 dpi). Secondly, in the preprocessing step, the Arabic word is thinned using a parallel thinning algorithm. Then, the skeleton of the image is traced from right to left using a 3*3 window, a binary tree is constructed and the Freeman Code [Fr68] is

77 3: Methodology: Useful Techniques 64 used to describe the skeleton shape. Finally, the binary tree is segmented into sub-trees, each tree describing a character in the image. The system was tested on a small set of words. The drawback of this recognition system is that it needs too complicated a tree to recognize and segment more words. For segmentation in this thesis a new histogram calculation was used, which combines more pre-processing operations as well as histogram calculation to suit handwritten Arabic words. Figure 3-7: Example of an Arabic word 2_-^ and different techniques of the segmentation

78 3: Methodology: Useful Techniques Recognition Methods Surveys on Arabic recognition can be found in [Kh02, Al99, AM95 ]. There are three main strategies which have been applied to printed and handwritten Arabic character recognition, as described in section 1.4. These can be categorized as the holistic approach, the analysis approach and feature sequence matching. Using the holistic approach, recognition is performed globally on the whole representation of the word with no attempt to identify characters individually. With the analysis approach, recognition is not directly performed at word level but at an intermediate level dealing with units or segments. In this strategy the words are not considered as a whole, but as sequences of small size units or segments. Feature sequence matching uses methods based on probabilistic framework HMM. Chapter 2, section 2.4 contains the previous trials done on Arabic recognition using HMM. Amin [Am00] used a global method to recognize printed Arabic words using machine learning to generate a decision tree. The algorithm resulted in a 92% recognition rate. In this thesis the holistic approach, feature sequence matching, and a combination of them was used to recognize Arabic words (see the system described in Chapters 6 and 7).

79 3: Methodology: Useful Techniques Vector Quantization One application of distance measures that has been important in automatic object and speech recognition is known as Vector Quantization (VQ). VQ is a data reduction method, which means it seeks to reduce the number of dimensions in the input data so that the models used to match unknown characters or segments are as simple as possible. VQ reduces dimensionality quite drastically since it encodes each vector as a single number [Sc03] called the codebook [GN90]. In the earlier days, the design of a vector quantizer (VQ) was considered to be a challenging problem due to the need for multi-dimensional integration. In 1980, Linde, Buzo, and Gray (LBG) proposed a VQ design algorithm based on a training sequence. The use of a training sequence bypasses the need for multi-dimensional integration. VQs that are designed using this algorithm are referred to as LBG-VQ [Ph03] VQ Mathematic Definition The VQ design problem can be stated as follows. Given a vector source with known statistical properties, a distortion measure, and the number of codevectors, one can use the codebook and a partition which results in the smallest average distortion. Assume that there is a training sequence consisting of M source vectors: Τ = {,,, }. Eq x 1 x xm This training sequence can be obtained from some large database. For example, if the source is a speech signal, then the training sequence can be obtained by recording several long telephone conversations. M is assumed

80 3: Methodology: Useful Techniques 67 to be sufficiently large so that all the statistical properties of the source are captured by the training sequence. It assumed that the source vectors are k- dimensional, e.g., x m (,, ), m = 1,2,,. x x x = Eq. 3-6, M m, 1 m,2 m, k Let N be the number of codevectors and let C = {,,, }, Eq c 1 c cn represent the codebook. Each codevector is k-dimensional, e.g., c n {,, }, n = 1,2,,. c c c = Eq. 3-8, N n, 1 n,2 n, k Let S n be the encoding region associated with codevector c n, and let P = {,,, }, Eq S 1 S S N denote the partition of the space. If the source vector x m is in the encoding region S n, then its approximation (denoted by Q(x m )) is c n : Q( x ) cn xm S m =, if. Eq n Assuming a squared-error distortion measure, the average distortion is given by: D ave = 1 MK M m= 1 x m Q( x m ) 2, Eq where 2 e = e + e + +. The design problem can be succinctly stated e k as follows: given T and N, find C and P such that D ave is minimized.

81 3: Methodology: Useful Techniques Optimality Criteria If C and P are a solution to the above minimization problem, then it must satisfy the following two criteria Nearest Neighbour Condition The condition says that the encoding region S n should consist of all vectors that are closer to c n than any of the other codevectors and the equation is written as follows: 2 2 { x : x cn x cn n = 1,2, N} S, n = Eq Centroid Condition This condition says that the codevector c n should be an average of all those training vectors that are in encoding region S n as described in the following equation: xm S n m c n = 1,2,, N n 1 = xm Sn x Eq In implementation, one should ensure that at least one training vector belongs to each encoding region so that the above equation is never zero. 3.3 Hidden Markov Model (HMM) During the last decade, HMMs have become the predominant approach to automatic speech recognition The success of HMMs in speech recognition has recently led many researchers to apply them to handwriting recognition by representing each word image as a sequence of observations [Ca01,

82 3: Methodology: Useful Techniques 69 BWB94]. Historically, the HMM has been used in text recognition, as early as 1980, e.g. Cave and Neuwirth [CN80], who analyzed machine-printed text using HMM. For on-line recognition of handwriting, the HMM was first used in [NWF86] where the approach followed the basic HMM scheme each word is modelled by an HMM, like the one used in the recognition of isolated digits of speech. The success of these attempts, however, were limited by constrained experiments, and problems of a single writer, small and fixed vocabulary, small test samples, etc. The application of HMMs to the more general problem of handwriting recognition, involving large dictionaries, off-line data, unconstrained style, etc., was introduced in [BWB94, KHB98, JSW90]. The basic problems of handwriting recognition are common to all languages, but the special features and constraints, for example, of each language, need to be considered as well. For example, the large set of Chinese characters and the complicated combination of strokes make the recognition task difficult. In [JSW90], projection profiles are first obtained from each Chinese character image. The HMM is then used to model the sequence of the histogram of projection profiles. To counter the serious loss of stroke information after the projection of the character image [PL93], the Regional Projection Contour Transform (RPCT) is proposed to transform the character image into the contour of four feature maps. As the pattern transformed by RPCT has only one outer contour and does not contain any internal contour, the HMMs used for 2-D planar shape recognition such as the one proposed in [HK91] are directly applicable. Another interesting application of HMM to the off-line HWR problem was recently introduced in [BR95]. Besides handwriting recognition, HMMs have also been used to analyze document images. Vlontzos and Kung have proposed a multilevel structure of HMMs for the recognition of machine, or hand-printed text [VK92]. Kuo and Agazzi have successfully spotted key words in a poorly printed

83 3: Methodology: Useful Techniques 70 document image using the pseudo 2-D HMM [KA93], where the word image is modelled by a hierarchical structure composed of vertical and horizontal HMMs. A complete scheme for encoding the printed document using HMMs, from the pixel level to characters, words, and whole documents, is proposed by Kopec and Chou [KC94] Implementation Strategies In practical pattern recognition problems, there are three ways of building the HMM. 1. Model Discriminant HMM (MD-HMM): the patterns are classified by different models. For each class of pattern one or more HMM are built. Given an observed pattern, the path probability against each model was calculated, and it was classified to the class whose model leads to the maximum path probability. This kind of model has been highly successful for speech recognition, especially for the recognition of isolated words [Ra89]. 2. One Path Discriminant HMM (PD-HMM): for all the classes and use different paths to distinguish one class from the others [LHR89]. A comparison of the two strategies is shown in Table Combination of PD-HMM and MD-HMM: this composite approach has been successfully used [LHR89]. From the review of papers written on HMM used to recognize handwritten and printed words (see section 2.3), one can see that there has been little experimention of the HMM approach to Arabic writing recognition. Until this thesis, there was no implementation of HMM on Arabic handwriting text written by different writers.

84 3: Methodology: Useful Techniques 71 In this research HMMs have been used in a new way. In Arabic handwriting dots are important in character recognition, but writers do not always place dots exactly on each character. So dots will be omitted in the recognition of words using HMM, and use dots and more general features in the global recognition of Arabic words for lexicon reduction, before recognition by HMM. Table 3-1: A comparison between PD-HMM and MD-HMM strategies MD-HMM PD-HMM Likely to be independent of A reasonable approach Memory and dictionary size as one HMM for vocabularies up to a dictionary size is built for the whole few hundred words dictionary Accuracy Portability The MD-HMM often performs better than the PD-HMM, since many modelling constraints can be easily implemented in the MD-HMM approach Poor portability (the ability to adapt as the dictionaries change) Less accurate recognition results [LHR89], since the states are usually transparent during training and are semantically meaningful Better portability since the only need for changing a dictionary is to recompute the transition probability HMM Theory Markov Dependency: Assuming that the occurrences of states depend only on their immediate preceding states, for example a first-order Markov

85 3: Methodology: Useful Techniques 72 chain, the joint probability of P(Q) of a sequence of states Q={q 1,, q T } can be defined as: P(Q) = P(q 1 )P(q 2 q 1 ) P(q 3 q 2 ).. P(q T q T-1 ). Eq Similary, if we assume that q t depends only on immediate preceding states, n-th order Markov chains can also be defined. HMM: An HMM is a doubly stochastic process with an underlying Markov process that is not observable (the states are hidden). It can only be observed through another set of stochastic processes, which are produced by the Markov process (the observations are probabilistic functions of the states). Let us assume that a sequence of observations O=( o 1,,o T ) is produced by the state sequence Q=(q 1,, q T ) where each observation o t is from the set of M observation symbols V={ v k ; 1 k k k M} and each state q t is from the set of N states S ={ s i ; 1 k i k N}. Thus, an HMM can be characterized by: Π = { i }, where i = P(q 1 = s i ) is the initial state probability; Α = { a ij }, where a ij = P(q t+1 = s j q t =s i ) is the state transition probability; Γ = { j }, where j = P(q T = s j ) is the last state probability; Β = { b j (k) }, where b j (k) = P(o t = v t q t = j ) is the symbol probability; (3.2); and they satisfy the probability constraints:

86 3: Methodology: Useful Techniques 73 N i= 1 N j= 1 π = 1 a N j= 1 M k = 1 ij = 1 i; j = 1 j; b k = 1 j; i; Eq Eq Eq Eq We will denote the HMM by a compact notation λ= {Π, Α, Γ, Β}. Chapter 6 contains a discussion of HMM Statistics implementation on Handwritten Arabic Words Scoring Problem Given an observation sequence O = o 1,..,o T and a model λ= {Π, Α, Γ, Β}, how one can find P(Ο/λ)? This is the scoring problem. One can find P(Ο/λ) by the forward algorithm [DHP00]. Where a forward variable l t (j) is the probability of having the state q t at time t generating the observation sequence O t = o 1,..,o t, it can be computed iteratively as: α t ( j) ( k) ( i) a b ( k) π jb j if t = 1 = N i = α t ij j otherwise 1 1 Eq Then it can be shown that: P ( O λ ) = α T ( i ) γ i N i= 1 Eq. 3-20

87 3: Methodology: Useful Techniques Training Problem Given the training sequence O = o 1,..,o T, to adjust the model parameters λ= {Π, Α, Γ, Β} such that P(Ο/λ) is maximized. The Baum-Welch Algorithm will be used as the optimization criterion for finding the Maximum Likelihood (ML). In general, HMMs can be trained by the Baum-Welch Algorithm with satisfactory performance [BWB94] Recognition Phase The modified Viterbi algorithm (MVA) can solve the recognition problem. Given the model parameters λ= {Π, Α, Γ, Β} and the testing sequences O = o 1,..,o T, we want to find the optimal state sequence: * Q = arg P Q O, λ) = arg P( Q, O λ) Eq max Q ( max Q Post-processing The Post-processing operation is used if a PD-HMM system is implemented, because the PD-HMM system is not guaranteed to be a legitimate word from the given dictionary. 3.4 ID3 Classifier A decision tree is constructed by looking for regularities in data. ID3 Induction of Decision Trees is particularly interesting for its representation of learned knowledge, approach to management of complexity, heuristics for selecting candidate concepts, and its potential for

88 3: Methodology: Useful Techniques 75 handling noisy data. ID3 represents concepts as decision trees, a representation that allows us to determine the classification of an object by testing its values for certain properties [WF00]. The learnt decision tree should capture relevant relationships between attributes values and class information. In addition, such systems typically use information-based heuristics to bias their learning towards shallower trees. ID3 was developed by Quinlan [Qui79, Qui86] and is perhaps the most commonly used ML algorithm in scientific literature and commercial systems. A quick introduction is given in [LS93, HS94]. In conclusion, ID3 is an algorithm which has high classification accuracy (even in noisy data sets), a fast learning phase, and low time complexity. ID3 must be supplied with the entire training set at once, but variations with incremental learning exist. The decision tree resulting from ID3 is not very simple for humans to cope with when large amounts of data are used [So96]. To perform induction, start with a set of objects (training examples) C. Choose an attribute A as the root node. Create as many children as the number of values for A. If all the objects in C belong to the same class, stop with a leaf labelled with that class name. Otherwise, distribute the objects in C among the children nodes according to their value for A. Iterate the process for each child. The main issue is which attribute to split on each iteration? Since many DTs exist that correctly classify the training set, a smaller one is usually preferred (Occam s Razor) [Qui79]. However, finding the smallest DT is NP-complete, so we need a heuristic. The paper by Quinlen [Qui79] advocates the use of information theory: the quantity of information I is a real number between 0 and 1 that characterizes the amount of uncertainty in a set C w.r.t. class membership. I=0 if all objects in C belong to the same

89 3: Methodology: Useful Techniques 76 class. I=1 if they are evenly distributed between two classes. At each iteration, the heuristic minimizes I. ID3 uses all training examples at each step in the search to make statistically-based decisions regarding how to refine its current hypothesis. This contrasts with methods that make decisions incrementally, based on individual training examples (e.g. version space candidate-elimination). One advantage of using statistical properties of all the examples is that the resulting search is much less sensitive to errors in individual training examples. ID3 can be easily extended to handle noisy training data by modifying its termination criterion to accept hypotheses that imperfectly fit the training data [Ha9]. 3.5 Conclusion In this chapter some of the important techniques used in this thesis are discussed. Since this Thesis deals with Arabic handwriting, a general review of techniques that have been used for feature extraction and preprocessing steps in Arabic writing are discussed. This chapter clarifies that segmentation by histogram calculation is still one of the most commonly used techniques in Arabic writing and that features (such as dots) are useful features in Arabic writing. This chapter also presented the techniques of VQ, HMM and ID3 classifiers. Also, there was a discussion of the mathematics used in the HMM which was implemented in the system this thesis describes. A new database for off-line Arabic handwriting recognition is discussed in next chapter. Applications, which usually include some pattern recognition require the use of large sets of data. Since there are few Arabic databases available, none of which are a reasonable size or scope, this research built

90 3: Methodology: Useful Techniques 77 the AHDB database in order to facilitate the training and testing of systems that are able to recognize unconstrained handwritten Arabic text [AHE02a] [AHE03a].

91 4.1 A New Arabic Handwritten Database (AHDB) T his chapter presents a new database for off-line Arabic handwriting recognition. A new database for the collection, storage and retrieval of Arabic handwritten text (AHDB) have been developed. This supercedes previous databases both in terms of the size and the number of different writers involved. In this chapter the most popular words in Arabic writing are identified for the first time, using an associated program. An off-line handwriting character recognition system is required to perform the automatic transcription of text where only an image of the script is available. Much work has been done on the recognition of Latin characters, covering both the cases of separated (hand printed) characters and cursive script. Much less research has been undertaken on the task of recognizing Arabic script. The results reported are also applicable to the recognition of

92 4: A Database for Arabic Handwritten Text Recognition Research 79 handwritten text in languages such as Farsi, Kurd, Persian, and Urdu, which also use Arabic characters in writing, but differ in pronunciation. Previous research in this area includes work carried out by Abuhaiba et al. [AMG94], who dealt with some problems in the processing of binary images of handwritten text documents. In this database, the stages of designing, storing, and retrieving information have been considered, as well as the pre-processing of off-line handwritten Arabic words. Successful off-line Arabic character recognition is likely to be a complex process involving many steps that are interdependent and may need to be undone using backtracking algorithms. It is crucial to have a suitable representational scheme to underpin the research. In this chapter the first organized database for Arabic handwritten text and words is described. A significant aspect of handwriting recognition in domains such as bank cheques [FEB+00] and postal address recognition [CKZ94] is that there is no control over the author, writing instrument, or writing style. For example, an arbitrary handwritten word might be produced by a felt pen and could include isolated, touching, overlapping characters, cursive fragments, or fully cursive words [CK94]. However, these difficulties are offset by the constraint that input words come from a relatively small fixed vocabulary.

facilitate research in handwritten text recognition.

93 4: A Database for Arabic Handwritten Text Recognition Research 80 Figure 4-1: One form filled in by one writer A standard database of images is needed to facilitate research in handwritten text recognition. A number of existing databases for English off-line handwriting recognition are summarized by [Su90, Na92], and also

94 4: A Database for Arabic Handwritten Text Recognition Research 81 by [MB99, Zi02]. For machine-printed Arabic the Environmental Research Institue of Michigan (ERIM) has created a database of machine-printed Arabic documents. These images are extracted from typewritten and typeset Arabic books and magazines [Sc02]. Applications, which usually include some pattern recognition, require the use of large sets of data. Since there are few Arabic databases available, and none of a reasonable size and scope, the AHDB database was built in order to facilitate the training and testing of systems that are able to recognize unconstrained handwritten Arabic text. In Figure 4-1 one can see an example of a form filled in by one writer. There are different approaches to form dropout, some using separated cleaning steps, whilst others use combined cleaning methods for both foreground and background [DI97]. The three most common approaches to form dropout, symbolic subtraction of an image, colour filtering, and thresholding, are described by [CD97]. The later approach proposed in this chapter is dropout by colour filtering using hardware (optical filtering), which is faster than the other techniques and more accurate than dropout by symbolic subtraction. Sections 4.4 and 4.5 will discuss how the AHDB is stored and sorted into separate directories for simpler data retrieval. The database created in this research contains Arabic words and text written by 100 different writers. The following sections describe the steps involved in constructing this database. As the AHDB contains the most popular written Arabic words, the next section will discuss in detail how they were identified. 4.2 Arabic Word Counting The aim of this step was to find the most popular words in Arabic writing. First, Arabic texts differing in context and subject matter were copied from several sites from the Internet. Then a program was written to count the

95 4: A Database for Arabic Handwritten Text Recognition Research 82 repeated words in the text files, which contained more than 30,000 different words. Finally, the words were totalled and sorted using a Microsoft Excel worksheet. From the test experiment, the twenty most used words in written Arabic were identified (for the first tinme), sorted, and are illustrated, along with the English meanings, in Table 1. From the table it can be seen that the most popular words in Arabic writing are different from those in English. For example, in English the most popular word is the whereas in Arabic it is in = +,. The most popular words have been added to the AHDB to be used as a testbed database for researchers as in the case for English and other languages. It should be clear that this work has never previously been done for Arabic writing.

96 4: A Database for Arabic Handwritten Text Recognition Research 83 Table 4-1: The twenty most used words in written Arabic, with their meanings in English Arabic Word jr c^ m\ m[ j#[ 4[ c ^ ^ Meaning in English In From Is On To That That About With Which 11 4d That 12 n Or 13 X Was 14 `" Finish 15 gd He 16 No 17 jd She 18 o God 19 2 Servant 20 ] U Before

97 4: A Database for Arabic Handwritten Text Recognition Research 84 (a) (b) (c) Figure 4-2: Handwritten Arabic words in the AHDB written by three different writers (a, b, and c)

98 4: A Database for Arabic Handwritten Text Recognition Research Form Design The form was designed with six pages. The first three pages were filled with 96 words, 67 of which were handwritten words corresponding to numbers that can be used in handwritten cheque writing. The other 29 words were the most popular Arabic words, as identified in section 2. The fourth page was designed to contain three sentences of handwritten words representing numbers and quantities that may be written on cheques. The fifth page was lined, and designed to be completed by the writer in freehand on any subject of their choice. The colour of the forms was selected as light blue with black ink in the foreground because the scanner can mask blue, green, and red. This means one can print forms in green filled in with blue, red or black ink and get the same result as from a blue form with black writing. In the first three pages, the spaces for handwritten words are equal so there is no pressure on the writer as to the length of the word. The forms were scanned in black and white using channel blue as a mask (hardware mask). One hundred and five forms were scanned using a scanner. The images were scanned at 600 dpi. 4.4 Data Storing Every word image is saved with a name and number indicating its writer, for example the image for the word in for the first writer is saved as in001.tiff. Common file types in bitmap format are JPEG, GIF, TIFF and WMF. The TIFF format was chosen for the AHDB because it can store complex information for the CMYK colour model and can also use JPEG compression techniques. This makes TIFF one of the most robust and well-

99 4: A Database for Arabic Handwritten Text Recognition Research 86 supported image formats available [Ho00]. For easier retrieval of handwritten images, the Arabic handwritten data was sorted and saved in five sub-directories containing: 1. Wrd_no: words used for numbers and quantities in cheque filling, as used in this research for testing and training data because of it use in cheque verification. 2. Wrd_mst: contains the most popular words in Arabic writing (Table 4-1), which have been calculated in this research as described in section Chq: contains sentences used in writing cheques with Arabic words (Figure 4-3), which is useful for cheque verification applications. 4. Page: free-handwriting pages in any area of writer interest (Figure 1-3). 5. Form_Wrd: the first three pages of the forms. The first page is stored as the number of the form_a, the second page is stored as the number of the form_b, and the third page is stored as the number of the form_c, for example, 001_a, 001_b, and 001_c, respectively. As shown in The Arabic handwritten DB ftp site ftp://ftp.cs.nott.ac.uk/pub/users/sxa. 4.5 Data Retrieval As mentioned earlier in section 4.4, the data was stored in TIFF format. For image retrieval the system used lizard s TIFF library for Java [LW0]. The pre-processing operations are implemented on the

100 4: A Database for Arabic Handwritten Text Recognition Research 87 word s images (stored in Wrd_no directory) as described in Chapter 5. Figure 4-3: Examples containing sentences used in cheque writing in Arabic

101 4: A Database for Arabic Handwritten Text Recognition Research 88. Figure 4-4: Examples of free-handwriting 4.6 Conclusion The AHDC have been built which contains Arabic words and text written by 100 different writers. This database contains words used for numbers and quantities in cheque filling. It also contains the most popular words in Arabic writing (reported for the first time in this thesis). Also contained are sentences used in writing cheques with Arabic words. Finally it contains free-handwriting pages in any area of writer interest. This database is meant to provide training and testing sets for Arabic text recognition research.

102 4: A Database for Arabic Handwritten Text Recognition Research 89 In the next chapters there are some useful pre-processing operations applied to the wrd_no set (containing words used for numbers and quantities in cheque filling in AHDB). An innovative and simple, yet powerful, tagging procedure designed for this database, which enables easily extract the bitmaps of words. A pre-processing class, which contains some useful preprocessing operations was constructed. These are discussed in the next chapter. The next chapter deals with pre-processing steps (before classification) of off-line handwritten Arabic words. In this system, some of the methods applied to handwritten Arabic writing, such as slant correction, slope correction, thinning, segmentation and feature extraction, are descussed. These methods have not been applied before, as mentioned in section 2.3. The system first attempts to remove some of the variation in the images that do not affect the identity of the handwritten word (slant correction, slope correction, and baseline estimation). Next, the system codes the skeleton of the word so that information about the lines is passed on to the recognition system (segmentation and feature extraction).

103 5.1 Overview T his chapter describes the operation of the complete pre-processing system for the recognition of a single Arabic word taken from the handwriting Arabic word database. Any word recognition system can be divided into modules, for example, pre-processing, recognition and postprocessing. In this systen the handwritten word is normalized to remove incidental differences in style which are independent of the identity of a word. Then, recognition is carried out by first estimating the likelihoods for each frame of data in the representation (using a suitable classification technique, such as a codebook and HMM in this research), and then postprocessing can be carried out to reduce ambiguity. The system built in this chapter concentrates on the pre-processing operations, which are especially important in the recognition process of handwritten Arabic words. The

104 5: A Pre-processing System for the Recognition of Off-Line AHW 91 steps involved in pre-processing are implemented in Java code. The system has three main advantages. Firstly, it deals with non-segmented words. Secondly, it takes advantage of the position of features in the character or sub-character. Thirdly, more than 29 features are calculated and used by the VQ and HMM classifiers in the next two chapters. Word Image image loading Calculate Vertical Histogram Baseline Estimation Slant Correction Slope Correction Thinning Skeleton of word image consisting of vertical letter on horizontal baseline Figure 5-1: The pre-processing operations 5.2 Pre-processing Steps Pre-processing of the handwritten word image is important in order to organize the information so as to simplify the task of recognition. The most crucial step in the pre-processing stage is normalization, which attempts to remove some of the variations in the images which do not affect the identity of the word [SR98]. The system incorporates normalization for stroke width, slope and height of the letters (see Figure 5-1). The normalization task reduces each word image into one consisting of vertical letters of uniform height on a horizontal baseline and made up of one-pixel-

105 5: A Pre-processing System for the Recognition of Off-Line AHW 92 wide strokes. In this system, the word image is loaded and cropped. Then the slant and slope of the word is corrected and thinned. Features are calculated to represent the useful information contained in the image of the word [AHE01]. Then the word is segmented into frames, so the features in each frame can be found. The following sub-sections of this chapter discuss in detail the steps involved in Arabic handwritten pre-processing that have been used for this research. This system incorporates normalization for each of the following factors: 1. Stroke width: this depends on the writing instruments used, the pressure of the instrument, and its angle with respect to the tablet 2. Slant: this is the deviation of strokes from the vertical axis, which varies between words and between writers 3. Slope: the slope is the angle of the base of a word if it is not written horizontally 4. Height of the letters: this varies between authors for the same document, and for a given author for different documents. Figure 5-1 shows the processes involved in pre-processing. In this system, the word image is loaded and cropped. Then the slant and slope of the word is corrected and thinned. The details of each of the processes are described in the following sections.

5: A Pre-processing System for the Recognition of Off-Line AHW 93 Figure 5-2 shows some examples of pre-processing operations done on Arabic words in the database.

106 5: A Pre-processing System for the Recognition of Off-Line AHW 93 Figure 5-2 shows some examples of pre-processing operations done on Arabic words in the database. The implementation of the pre-processing operation consists of the following important steps: Image Loading Slope Correction Slant Correction Thinning Figure 5-2: Different examples of pre-processing stages (a) baselines detection (b) slant and slop correction (c) features extraction (d) width normalization

107 5: A Pre-processing System for the Recognition of Off-Line AHW 94 In the Java implementation, every pre-process operation has a separate class. In the following sub-sections, the steps (processes) involved in the Arabic handwritten pre-processing that are used in this research are discussed in detail Image Loading For loading the word image, the system uses a ready-made class library for loading images of type TIFF [LW0]. Before that the images of Arabic words stored in the database were converted to TIFF format type4 using the Polyview software. Figure 5-3: (a) The word before the operation of slope correction. (b) The word after its slope is corrected horizontally. (c) The same word after slant correction. (d) The operation of thinning

108 5: A Pre-processing System for the Recognition of Off-Line AHW Slope Correction The slope may be defined as the angle of the baseline of a word that is not written horizontally. Figure 5-3 (a and b) shows a sloped word before and after slope correction. The character height is determined by finding the important baseline. For each Arabic word there are two baselines, as can be seen in Figure 5-4, but the second baseline is difficult to examine since some handwritten Arabic words may have more than a secondary baseline for each segment. So only the main baseline was computed. Figure 5-4: The two baselines of the word :_/ - five. (a) the second baseline (b) the main baseline The heuristic, used for the baseline estimation, consists of three main steps [RW92]: 1. Calculation of the vertical density histogram for the word image. 2. Baseline correction. 3. Slope correction. The first steps are done by counting the number of black pixels in each horizontal line in the image. Then the baseline estimation follows by

109 5: A Pre-processing System for the Recognition of Off-Line AHW 96 rejecting the part of the image likely to be a hooked descender, such as occurs with the letter (-!.!/). Such descenders are indicated by the maximum peak in the vertical density histogram. Finally the slope correction procedures are carried out as follows: 1. Calculate the slope a. Find the lowest remaining pixel in each vertical scan line b. Retain only the points around the minimum of each chain of pixels and discard the points that are too high c. Find the line of best fit through these points. 2. Slope correction a. The image of the word is straightened to make the baseline horizontal by application of the Shear transform parallel to the y-axis b. The baseline, height, and bounding rectangle of the cropped image are re-estimated, under the new assumption that the image is now horizontal Slant Correction The slant is the deviation of strokes from the vertical axis, which varies between words and between writers. In Figure 5-3 (c), a word is seen after correcting its slant. The slant of a word is estimated by finding the average angle of nearvertical strokes [RW92]. This is calculated by using an edge detection filter

110 5: A Pre-processing System for the Recognition of Off-Line AHW 97 to find the edges of strokes. This technique gives a chain of connected pixels representing the edges of strokes. The mode orientation of the edges which are close to the vertical are used to estimate the overall slant. The procedure of slant correction contains the following steps: 1. Thin the image and calculate its endpoints 2. Find all near vertical strokes by starting at each endpoint above the baseline until another endpoint on the baseline 3. Calculate the average slant for all strokes 4. Using a Shear transform parallel to the axis, the slanted word can be corrected 5. Bounding the bounding box and width of the image is reestimated Thinning Numerous algorithms have been proposed for thinning (also called skeletonizing) the plane region. This system uses a Zhang- Suen/Stentiford/Holt combined algorithm for thinning binary regions [Ch94] Normalization Before finding the handwriting features of the word, the original word image can be normalized and encoded in a canonical form so that different images of the same word can be encoded similarly. The normalization task will reduce each word image to one consisting of vertical letters of uniform

111 5: A Pre-processing System for the Recognition of Off-Line AHW 98 height on a horizontal baseline, and made up of one-pixel-wide strokes. The width will be normalized to 64 bits. Also the segments that do not contain any features are removed, which improves the recognition rate, as will be seen in the experimental results. See Figure 5-9 (c) for an example of width normalization, which is useful in finding moments and pixel distribution features. 5.3 Finding Handwriting Features This section discusses the method used to represent the useful information contained in the image of the word. The choice of the feature extraction method limits or dictates the nature and output of the pre-processing step [SR98]. Since the word in this system is represented by a thinned pattern, or skeleton, most of its geometrical features are suitable for this representation. The features that capture topological and geometrical shape information, both globally and locally, are the most useful, while features that capture the spatial distribution of black pixels are also important. Some features related to the positional information of segments are also of value. A good mixture of these features was expected to perform well. After experimental investigations with classification results as the selection criterion, the following 29 features were chosen. The geometrical and the topological features are: loops with position of its intersection (up, down, left, right), four curve directions, right-, left-disconnection, four directions of long strokes and their position (above the baseline or below the baseline), endpoints, intersections, and the number and positions of endpoints and intersections. The moment features provide information about the global shape. The feature selection technique is empirical but largely reflects the structure of an Arabic handwritten word.

5: A Pre-processing System for the Recognition of Off-Line AHW 99 The system used in this thesis uses a skeleton coding scheme. The word is then segmented into frames.

112 5: A Pre-processing System for the Recognition of Off-Line AHW 99 The system used in this thesis uses a skeleton coding scheme. The word is then segmented into frames. The horizontal density histogram is calculated and smoothed. The maxima and minima of the smoothed density histogram are found and frame boundaries are defined to be the midpoints between adjacent maximum/minimum pairs. To ensure that the frames do not exceed a certain width, more frames are added where the maxima and minima are far apart, and chosen according to character height. Every vertical frame will be segmented into regions or rectangles. For each of these rectangles, four bins are allocated to represent different line segments, angles with vertical and horizontal, and the line 45 degrees from these. The concept of frame segmentation and lines will be discussed in section 5.4. Figure 5-5: Two words with the features written on them

5: A Pre-processing System for the Recognition of Off-Line AHW 100 The performance of the recognizer can be improved by passing on more information about salient features in the word.

113 5: A Pre-processing System for the Recognition of Off-Line AHW 100 The performance of the recognizer can be improved by passing on more information about salient features in the word. A number of useful features can easily be discerned from the processing that has already been performed on the writing: endpoints, junctions, complementary characters, loops, and turning points. These salient features are defined in more detail in sections through to 5.3.6, which discuss methods for the detection of intersection points, and endpoints which all operate on a skeletonized bit map Outer Contour and Loops This system uses a class for detecting contour and inner loops, to know how many segments there are in the word and the area of the loops. The location of each blob can be found. This procedure works very quickly since it deals with the original file s data. Figure 5-6: The blobs of the Arabic word ahad Locating Dots Dots above, or below, the letters (i.e :; ) can be identified with a simple set of rules. Short, isolated strokes occurring on or above the half-line are marked as potential dots. The number of dots and their location relative to the main skeleton of the character have to be identified in every frame. The number of dots can be one, two, or three, and

114 5: A Pre-processing System for the Recognition of Off-Line AHW 101 they can be below or above the main skeleton of the character. Dots are calculated by tracing every path from every endpoint. If the tracing reaches another endpoint and the path length is less than a threshold, the procedure finds a dot. If the path is more than one pixel, the centre of the path is enrolled as a dot feature. This is then added to the dots array and the endpoint feature is erased in that point (dot). From the contour, if the width of the dot is double the height of the dot, then the line is considered to be two dots. If the dots line has an east or south curve, it is treated as three dots Locating Endpoints The endpoint is the end or start of a line segment. Endpoints are points in the skeleton with only one neighbour, which marks the ends of strokes, though some are artifacts of the skeletonization algorithm. Endpoints are found by examining all individual one-pixels in the skeletonized bit map image. As a consequence of skeletonization, an endpoint will have one, and only one, of its eight contiguous neighbours as a one-pixel. Therefore, if the sum of eight neighbours is one, this is an endpoint Junctions Junctions occur where two strokes meet or cross and are easily found in the skeleton as points with more than two neighbours. The system proposed in this thesis will use an algorithm as follows: each of the one-pixels in the image is examined, and the number (n) of contiguous one-pixels to the focus pixel is counted. If the count n exceeds 2 (np3), then the focus pixel is considered to be an intersection.

115 5: A Pre-processing System for the Recognition of Off-Line AHW 102 Figure 5-7: Four turning points in different directions (a) top, (b) down, (c) left, and (d) right Turning Points Points where a skeleton segment changes direction from upward to downward are recorded as top turning points. Similarly left, right, and bottom turning points can be found, as illustrated in Figure 5 7. Turning points are detected by multiple fixed windows to examine the variation in coordinate values of the start, mid, and endpoints of the curve. It is worth noting that in the input image x increases from left to right and y increases from top to bottom. The following table (Table 5-1) demonstrates the curve categorization using these coordinates. Table 5-1: The curve categorization using the coordinates East South West North x-value Start>mid _ Start<mid _ <end >end y-value _ Start>mid <end _ Start<mid >end

116 5: A Pre-processing System for the Recognition of Off-Line AHW Right and Left Disconnection This feature should be determined after the segmentation stage to identify whether the frame or character is disconnected from its right or left sides. This feature is added to letters like ( = < and = > ) and is calculated as follows: Set both the right and left disconnection features to false for each frame Set the right disconnection feature to true for the first frame Set the left disconnection feature to true for the last frame From the first frame to the one before last do the following: Examine a vertical line from below baselines 1 and 2. If all have zero pixels, set the left disconnection feature for the current and right disconnection frame to true Detect Strokes The purpose of this process is to detect strokes in each segment. The strokes or line segments can be vertical, horizontal, or at 45 degrees from these (see Figure 5-8). Within this framework, the lines of the skeleton image are coarse coded as follows. The one-pixel-wide lines of the skeleton are followed, and wherever a new grid is entered, the section in the previous box is coded according to its angle. Segments, which are not perfectly aligned with the discrete angles of the lines, contribute to the lines representing the two closest orientations.

117 5: A Pre-processing System for the Recognition of Off-Line AHW 104 (a) (b) (c) (d) Figure 5-8: The four stroke directions detected in this research for an Arabic word (a) horizontal (b) vertical (c) positive or back diagonal (d) negative or diagonal Pixel Distribution These features are widely used under different names. After all, a binary character image can be described by the spatial distribution of its black pixels. Width to length proportion can be used to differentiate characters ( ) from each other. To compute the pixel distribution features, the segment image is divided into two segments according to the baseline. The first set of pixel distribution features are computed by counting the number of pixels in the whole segments B_all. Then in each zone first represented by B_upper, and B_lower respectively. The first pixel distribution feature is the percentage of black pixels to white ones: f 22 == B_all /( B_all + w_all); Eq. 5-1 while w_all represents the number of white pixels in the image. The next feature represents the percentage of black pixels in the lower zone: f == B_lower /( B_lower w_lower); Eq

5: A Pre-processing System for the Recognition of Off-Line AHW 105 while w_lower represents the number of white pixels in the lower segment of the image.

118 5: A Pre-processing System for the Recognition of Off-Line AHW 105 while w_lower represents the number of white pixels in the lower segment of the image. The next feature represents the percentage of black pixels in the upper zone: f == B_upper /( B_upper w_upper); Eq while w_upper represents the number of white pixels in the upper zone of the image. The next feature represents the percentage of black pixels in the lower zone to the black pixels in both zones: f == B_lower / B_all; Eq The next feature represents the percentage of black pixels in the upper zone to the black pixels in both zones: f 26 == B_upperl / B_all; Eq. 5-5 Then the features can be normalized so they can be represented in ranges from 0 to 1. Figure 5-9: Arabic word f9_/ - five after (a) contour extraction and thinning, (b) width normalization, and (c) segmentation

119 5: A Pre-processing System for the Recognition of Off-Line AHW Moments Features Moment can be used to differentiate characters ( BC ) from each other, or other feature extraction techniques. The moment features capture the global shape information. Let the coordinates of a generic black pixel be represented by (u, v). For the 2-D binary image of a segment (could be a complete character, or partial character), the central moments are given by µ pq = N 1 p ( ui u ) ( vi v) N 1 i= q Eq. 5-6 Where u = 1 N u, 1 v = i v i N i= 1 N i= 1 N Eq. 5-7 and N is the total number of black points in the image. The following three features, size, rotation and translation, independently based on moments are described in [Ch94, Th88] as M 2 M M M ; 3 2 = M 3 = ; M = 6 r r r 6 ; Eq. 5-8 Where r = ( µ M M M = ( µ = ( µ = ( µ + µ ) µ ) 02 3µ ) + µ ) ; 1/ µ ; 2 + (3µ + ( µ µ ) + µ ) ; 2 ; Eq. 5-9 Eq Eq Eq. 5-12

5: A Pre-processing System for the Recognition of Off-Line AHW 107 5.3.

120 5: A Pre-processing System for the Recognition of Off-Line AHW Zonal Features Some topological and geometrical features are more useful when they are associated with their zonal information as is the case with dots, loops, and strokes. 5.4 Segmentation Stage The position of complementary characters, endpoints, turning points, and junctions are useful, and they are recorded along with line-segment features for each of the horizontal strips. For loops, it is only useful to know whether it is present in a particular frame. Thus for each segment in the frame, twelve features are encoded (four line-segment angles, four turning points, junction, dots, endpoint and loop). Figure 5-10: Horizontal histogram and segmentation of words into frames The generation of frame procedure contains four main steps that are done in the segmentation of Arabic words:

121 5: A Pre-processing System for the Recognition of Off-Line AHW 108 a. Calculation of the horizontal histogram for the word image (after doing the pre-processing stages) as seen in Figure 5-10 b. Find the maximum peak in the horizontal histogram c. Find the minimum peak in the horizontal histogram d. Make the frames by adding the minimum and maximum peaks in the same array (Frame Array) e. Check for short frames if there are any, and remove them f. Check for long frames if there are any, and segment them After this stage, the image of the word will be segmented into frames as seen in Figure 5 7. By testing the previous Algorithm on some Arabic, it appears that characters, such as E are segmented into more than one frame so one more step should be added. This step is to stop segmenting any loop in the characters and is easy to implement. Finally, at the end, the word image is segmented into frames, so previous features in these frames can be distributed. 5.5 Conclusion Off-line handwriting recognition is the automatic transcription by computer of handwriting, where only the image of the handwriting is available. In this chapter, pre-processing techniques are described, including segmentation and normalization of word images to give tolerance to scale, slant, slope, and stroke thickness. Representation of the image is discussed, and the skeleton and stroke features which are used are also described.

122 5: A Pre-processing System for the Recognition of Off-Line AHW 109 Also, other feature detection techniques are described in a new way, with a segmentation process, to detect the place of the segment in the word (e.g. above baseline2, between baseline and baseline2, and under baseline1). The features detected and extracted in this chapter was entered as an input into an HMM for classification. The next chapter discusses how to classify the segments into words. Also the features retrieved from procedures in this chapter was fed into a classifier in Chapter 7.

123 ! 6.1 System Overview T he operation of the complete classification process for a handwriting recognition system for a single Arabic word, from feature extraction of the handwritten Arabic word on the database, to the output of the recognized word is described in this chapter. The handwritten word is normalized to be presented in a more informative manner by the stage of pre-processing (Chapter 5), then recognition is carried out to identify the word. This is done by estimating the data likelihoods for each frame of data in the representation using a vector quantization (VQ) method. The previous chapter described the operation of pre-processing and discussed the features used in such systems. In this chapter, the HMM classifier is discussed, which classifies the features captured from the word image. Status and symbols of HMMs are also presented in this chapter. Finally, the

124 6: Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model 111 training and recognition phases in this hidden Markov model classifier are discussed. Features Vector = 29 features. Endpoints Strokes to Endpoints Strokes to intersections Curves Alef (Arabic Character) XAlef (Arabic Character) Loop Waw or Ra (Arabic Characters) Left Disconnection Right Disconnection Upper dots Lower dots Pixel distribution features Moments All Normalized from 0 to 1 Figure 6-1: Feature vector for HMM classifier 6.2 Pre-processing The main advantage of pre-processing the handwritten word image is to organize the information to make the subsequent task of recognition simpler. The main part of the pre-processing stage is normalization, which attempts to remove some of the variations in images that do not affect the identity of the word [SR98]. The previous chapter described a system

125 6: Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model 112 which incorporates normalization for each of the following factors: stroke width, slope, and height of the letters. The normalization task reduces each of the word images to one consisting of vertical letters of uniform height on a horizontal baseline made of one-pixel-wide strokes. In this system, the word image is loaded and cropped then the slant and slope of the word is corrected and thinned and features are calculated to represent the useful information contained in the image of the word as described in Chapter 5. Finally the word is segmented into frames, so that previous features can be distributed in these frames. TRAINING WORDS Features vector quantization FORWARD BACKWORD ALGORITHM HMM PARAMETER TRAINING PHASE TESTING PHASE TESTING Features vector quantization Viterbi Algorithm Figure 6-2: Training and testing phases in the HMM classifier 6.3 Features Used To represent each segment image in a more compact form, i.e. a feature vector, a mathematical model of such images with a finite number of

126 6: Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model 113 parameters is desired. Unfortunately, no satisfactory mathematical model currently exists for such data. As a result, there is no universally accepted set of feature vectors in document image understanding. It is clearly understood, however, what specific information the features should capture. Features that capture topological and geometrical information, both globally and locally, are the most useful. Features of the spatial distribution of black pixels (assuming that white pixels represent the background) are also very important, as are features related to position information of the small segments (dots and diacritical markings) to the baseline. A good mixture relating to these features is expected to work well. Using the experimentral results, 29 features are used (Figure 6-1). The features vector includes moment features, regional and global topological features, and spatial distribution of pixels. While the moment features provide information about the global shape, the topological features capture many of the regional and local attributes of the shape of the character. The features related to the spatial distribution of pixels capture information about writing style variations. Needless to say, this feature selection technique is empirical and largely subjective. There were two set of features used in this work: general features extracted from the whole word, and features extracted for each segment of the word. For the HMM in this chapter, the features extracted for each segment are used. In the next chapter the general features are used and discussed. Figure 6-3 shows the features in each segment of the different Arabic words. Each line represents the 29 features of each segment in the word starting from the right hand side.

127 6: Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model 114 ahad001.tif six001.tif ifty001.tif Figure 6-3: Examples of feature vectors in different Arabic words

128 6: Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model 115 Table 6-1: Arabic words without dots and other diacritical markings Name Isolated Start Middle End Alif Ba, Ta, Tha, Noon, and Ya Jeem, Hha, and Kha! ( ) * + Dal and Thal 2 Ra and Zay 5 Sad and Tta and Za F G H I Ain and Ghain Fa and Qaf R Kaf X Y Z Lam [ \ ] Meem ^ _ ` The end part of seen, sheen and Noon Ha d e f Waow g Ya j Noon2, and Ha2 ~ ~- Ha_Meem Meem2 _, _a LamAlif T c

129 6: Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model HMM Classifier This section briefly describes some of the details at the implementation of the HMM based on Arabic handwritten recognition used in this thesis. The sequence of feature vectors is used to train the HMM in the training mode and to recognize unknown words in the recognition mode. The output is either the correctly recognized word or a small set of words that includes the correct word States and Symbols for Handwritten Words Throughout this research the system was used to classify words without dots. Instead of having all 120 of the Arabic letter states, the system has 55 states, which represent true Arabic letters without dots. Since the recognition system is implemented on the lexicon used on Arabic cheques, the words stored on the database (see Chapter 4) do not contain all of the classes in Table 6-1. The new model originally contained only 34 states. However, this has been changed through experimentation, because many handwritten segments can look similar. By training and testing the classification system the parameters chosen at the first stage were changed, as described in Chapter 8. As the writing style varies amongst individuals, there are many possible representations of the same segment of a letter. This might suggest a Markov model with an increased number of states. However, different letters may look alike, and this occurs a lot in Arabic writing, as explained in the experimental results presented in Chapter The Calculation of Model Parameters The HMM is denoted by a compact notation λ= {Π, Α, Β}. The 34 letters or sub-letters of the alphabet are defined as the states of the HMM, the

130 6: Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model 117 initial q i, transition a ij, last-state r j, and the symbol probability B j. Probabilities are initialized as: π i = number of total number words beginning with of words in the dictionary s i Eq. 6-1 a ij = number of trans. from number of transitions s i to from s j s i Eq. 6-2 γ j = number of words ending total number of words in with s the dictionary j Eq. 6-3 υ j( k) bj( k) = Eq. 6-4 υ j In this equation, υ (k) is total number of times the symbol k is observed j where the state is j. while υ j is the total number of times the letter or segments depicting the state j appear. 6.5 The Scoring Problem Given an observation sequence O = o 1,..,o T and a model λ= {Π, Α, Γ, Β}, how do we find P(Ο/λ)? This is the scoring problem. One can find P(Ο/λ) using the forward-backward (FB) algorithm [DHP00].

131 6: Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model The Training Problem Given the training sequence O = o1,..,ot, one want to adjust the model parameters λ= {Π, Α, Γ, Β} such that P(Ο/λ) is maximized. An iterative procedure, which is known as the Baum-Welch Algorithm was used as the optimization criterion for finding the Maximum Likelihood (ML) (Appendix A). In general, HMMs can be trained by the Baum-Welch Algorithm with satisfactory performance [AMG94]. 6.7 Recognition Phase The HMM features are vector quantized into the possible vectors given by the codebook. The resulting observation sequence and the probability measures for all the HMMs, given by the cell-arrays A_m, B_m, and pi_m, are then used to calculate the log-likelihoods for the HMMs (column of logp). The word associated with the HMM of the highest log-likelihood is declared as the recognized word, and the index is returned in a matrix. This procedure is repeated for all the words in the data. The Viterbi Algorithm, was used in the first stage in this research to carry out this calculation [Fo73] (see Appendix A). Then the Modified Viterbi Algorithm (MVA) [BWB94] was used, which can solve the recognition problem, and find an ordered list of the best L state sequences (see Appendix A) rather than a single result. 6.8 Conclusion Hidden Markov Models (HMM) have been used with some success in recognizing printed Arabic words. In this chapter, a complete scheme for totally unconstrained Arabic handwritten word recognition based on a

132 6: Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model 119 model discriminant HMM was presented. A system able to classify Arabichandwritten words of 100 different writers was proposed and discussed. In the previous chapter some of the variations in the images that do not affect the identity of the handwritten words were removed (using slant correction, slant correction, normalization etc.). Next, the skeleton and edge of the words were coded so that feature information about the lines in the skeleton can be extracted. In this chapter a classification process based on the HMM approach is used. The output is a word in the dictionary. A detailed experiment is carried out and successful recognition results are reported (Chapter 8).

133 "! T he previous chapter described a system to classify features of handwritten words. Because Arabic handwriting is difficult to classify, more trials are examined in this chapter with the goal of raising the recognition rate. From the confusion matrix in the results of experements presented in the last chapters (see section 8.5), some words which have been mis-classified as other words can easily be distinguished from each other by specific general features, so at first the ID3 classifier was used to classify groups of words that have similar features. Several ways were examined to find a suitable group of features (as described in Chapter 8) to be able to classify the Arabic-handwritten words of about 100 different writers, a classification process based on the rule-based classifier is used as a global recognition engine to classify words into eight groups. Then, for each group, the HMM approach is used for classification. The output is a word in the dictionary. This approach raised the recognition rate, as will be explained in Chapter 8.

134 7: Multiple Hidden Markov Models Classifier 121 Figures 7 1 through to Figure 7-7 show examples of single words written in different allographs and styles identified in this thesis. Because Arabic writers write the same handwritten word in different allographs and styles, it is difficult to have one single classifier or set of features that gives a reasonable recognition rate. For example, two or more differently written words can have the same meaning. By training and testing it has been found that some words in the database are clearly distinguished by a single set of features, giving almost a 100% recognition rate for words that are written in one kind of writing style. As a result two substantially different recognition engines, both based on a segmentation-free approach were developed. The first engine is a global feature scheme using some ascender and descender features and making use of a rule-based classification engine. The second scheme is based on a set of features using an HMM classifier. Figure 7-1: The Arabic word f9" nine written in different allographs and styles

135 7: Multiple Hidden Markov Models Classifier 122 Figure 7-2: The Arabic word 2, one written in different allographs and styles Some researchers seem to agree that the future of reliable recognition systems lies in the combination of multiple classifiers [PL93]. As such, these two different recognition engines were incorporated into one hybrid system. The engine with global features is first used to dynamically reduce the original lexicon to a restricted number of possible choices for each input word. Then the second engine is used to accurately recognize the handwritten data. Figure 7-3: ga_% eighty written in different allographs and styles

136 7: Multiple Hidden Markov Models Classifier 123 Figure 7-4: ci9_/ fifty written in different allographs and styles Figure 7-5: fs^ hundred written in different allographs and styles

137 7: Multiple Hidden Markov Models Classifier 124 Figure 7-6: 0GDH# ninety written in different allographs and styles Figure 7-7: no written in different allographs and styles

138 7: Multiple Hidden Markov Models Classifier ID3 Classifier As described in Chapter 3, ID3 is a learning tree based system that can handle noisy data, and can recognize an object by testing its values for certain properties. The system has two phases of training and testing, as illustrated in Figure 7-8. The global features described in Chapter 5 are used based on upper dots, lower dots, number of Alefs, number of x-alefs, number of waws, number of segments, and loops. The ID3 classifier is used to classify words into groups of words or a word, as can be seen in the experimental results in Chapter 8. So more than ID3 was built and several feature vectors were tested (a detailed experiment is described in section 8.5). The database of handwritten words was segmented into two parts, for training and testing, as described in the following sub-sections. Features Extraction Pre-processing Training feature vectors Build ID3 tree TRAINING TESTING Testing feature ID3 tree RESULT Figure 7-8: ID3 classifier

139 7: Multiple Hidden Markov Models Classifier 126 U_dots L_dots no_alef no_xalef no_waw N0outBlobs Loops Figure 7-9: Global features vector Training and Testing Sets Although (50:50) are usually used in training and testing data, the two-toone (2:1) split is used in this research due to the limited amount of available data, despite a quite large amount being collected. Two thirds of the data was used for training and building a suitable ID3 decision tree giving better recognition rates. One third of data was used for testing the ID3 decision tree. 7.2 Multiple Hidden Markov Models This section describes the operation of the complete classification process for a handwriting recognition system for a single Arabic word, from the features extracted from the handwritten Arabic word in the database to the output of the recognized word. This is done by firstly using the global features engine to reduce the original lexicon. Some words can appear in more than one group to reduce the error rate in the global classifier. Then the HMM recognizer (Chapter 6) is applied to further reduce the lexicon. The data likelihoods for each frame of data in the representation are then estimated using a vector quantization (VQ) method. The proposed system has two main advantages. Firstly, it deals with similar Arabic words with different meanings. Secondly, it takes advantage of the position of the features in the character or sub-character, to be used by the VQ and HMM classifier. The following sub-sections describe the operation of the global word classifier to classify the word into suitable word groups and the HMM classifier, which classifies the features captured from the word image.

140 7: Multiple Hidden Markov Models Classifier 127 Figure 7-10 illustrates the recognition system, which is based on the combination of rather different segmentation-free engines based, respectively, on global features and HMM schemes. In the recognition phase in this system the global features are used to classify a word into a suitable group, say group-x, then the local features for this word are used to classify it using HMMs for group-x (the HMMs trained for group-x). The result is a word in the dictionary (see Figure 7-11 for illustration of steps). In the next sub-sections these steps are discussed. Word Features Global Recognition Classify Group1 Group2 Group3 Group4 Group5 Group6 Group7 Group8 Classified Words Figure 7-10: Recognition of off-line handwritten Arabic words using Multiple Hidden Markov Models Global Classifier In the classification scheme, the global and local features have different meaning. the global features are used for the global ruled based classifier

141 7: Multiple Hidden Markov Models Classifier 128 which classifies data into groups of words, whereas local features (Chapter 6) are used for the HMM. From [HK91] it has been found that the recognition rate was low because some words conflict with each other. Words that conflict with each other can be distinguished by three main features: the number of upper dots, the lower dots, and the number of segments. By using these three features and setting each feature to 1 or 0 the result is 2 3, which equals eight groups. A list of each group can be seen in Table 7-1; some words are repeated in some groups. Figure 7-10 describes the relationship between words and groups in the training phase. Table 7-1: Group names and a list of each group Name Number of words Words list Group1 1 Group2 2 IDJKLJK Group3 11 MNKOKMDH#LH#M?PC&MHC&0GHC& Q(RS M""TRS 0GNK Group4 3 WV% UV%U V%U Group5 2 0GDJKL$/U Group6 18 0GDH#M<?0GHC&XY"QZ'<C"MZ'<C"WV%U0<C" M""9"2U[>UQZ""0G""TRS 0GNK 0<Y"0RS Group7 15 0GZ?0GD$/UQZD$/UMD$/UL$/UMZ'<C"XY"0<C" QZN?0<N?QZY"UQ(RS QZ""0G""0GDJK Group8 8 QZN?Q(RS QZNK QZDJKMDJK QZDH# QZHC& RZ\

142 7: Multiple Hidden Markov Models Classifier 129 A word global features vector U_dots L_dots no_alef no_xalef no_waw N0outBlobs Loops transfer the word into a suitable group (say group-x) that it might belong to, using a global set of rules classifier A word local features vector Classify the word using group-x HMMs λ x ={Π x, Α x, Γ x, and Β x} that maximized P(Ο/λ x ) for group-x Local grammar Classified word Figure 7-11: A word recognition using local and global features Local Classifier The local classifier classifies the groups into words using a different HMM for each group, as described in Chapter 6. However in the new system each

143 7: Multiple Hidden Markov Models Classifier 130 group will have a different set of parameters: the number of states, the number of symbols and stopping criteria. These parameters were chosen depending on training and testing the data. The set of parameters that gave the highest recognition rate were selected. 7.3 Local Grammar In Arabic handwriting the same word can be written in different ways. The 47 words in the lexicon are actually 25 words that have the same meaning but are written in a slightly different way (see the examples in Figures 7 1 through to Figure 7-7). So local grammar is used to identify words that are mis-classified as words that have the same meaning. 7.4 Conclusion In this chapter a global classifier based on global features (features that have been taken from words without any segmentation) has been described. Then a novel aproach based on combining local and global classifiers, using local and global feature vectors, is explained. Detailed experiment results are explained in the next chapter in sections 8.5 and 8.6.

144 # T he research can be divided into four main steps, as illustrated in Figure 8-1. The first step was the database building stage, the second the pre-processing stage, the third was feature extraction, and the fourth was classification. This chapter describes the experimental details used in each of those steps, as well as experimental tools and software involved in each step. DB Pre-processing Feature extraction Classification Figure 8-1: Stages of this research 8.1 Experimental Tools The word image acquisition in this system was carried out using a Hewlett Packard 6350 scanner. In this experiment, the Pronote computer (IBM compatible) that was used had the following specifications: Microsoft Windows NT system, year 2000 compliant, Pentium III processor,

145 8: Experimental results MHz with 128 MB RAM memory. For multiple HMMs there was an insufficient memory problem, so RAM was increased to 192 MB. 8.2 Software Used The Arabic words were segmented using a graphics program (i.e. Photo Shop Windows) and then saved in files to be used as the testbed for this system. For image loading, the images in the AHDB database were first converted from TIFF group8 to TIFF group4 using the PolyView (version ) software. As was mentioned earlier in Chapter 4, the data was stored in TIFF format, but there were problems with some of the conversion procedures (see Figure 8-7). Some images were corrupted with a straight long line at the far right side (because of the image conversion from TIFF group8 to TIFF group4), so this line was removed in the preprocessing stage. However a few images still remained that included this needed to be corrected with a photo editor. These words were removed from the training and testing stage for each classification experiment. Figure 8-7 shows an example of such words. To retrieve the images the system used Lizard s TIFF library for Java [LW0]. For pre-processing, segmentation, and feature extraction techniques, a program was written in Java. It is capable of showing the pre-processing steps graphically, so that any shortcoming step of the implementation of the techniques can be quickly observed and acted upon. The Weka program [WF00] was used to classify features into possible words using ID3. MATLAB software was used for HMM classifying, because it is the core computer programming language tool (and environment) around which a variety of tools are easily developed to support mathematically based analysis. It also minimizes the time needed to

8: Experimental results 133 obtain numerical and graphical solutions to a problem, rather than minimizing computer memory and execution time.

146 8: Experimental results 133 obtain numerical and graphical solutions to a problem, rather than minimizing computer memory and execution time. It is thus a tool that can be used to analyse and solve problems, rather than a language to produce executable applications [Ha99]. 8.3 Experimental Details The recognition process described here has four main steps with sub-steps as follows: database building (form scanning, data capture and image loading- sections and 8.3.2), preprocessing (section 8.3.3), feature extraction (baseline detection, slant and slope correction, thinning, feature extraction, segmentation and normalization- sections to 8.3.9). the three main methods of implementing the fourth step, normally HMM, ID3, and multiple HMMs are described in sections and 8.6 respectively. Each step in this research has its own experimental details discussed in the following sub-sections. (a) (b) (c) Figure 8-2: Colour dropout using software a) scanned image b) after applying blue channel mode c) the image after stamp filter

147 8: Experimental results 134 Figure 8-3: Colour dropout using hardware Forms Scanning Several different methods of scanning were tried before actually scanning the completed forms; the first forms were scanned in colour mode (600 dpi), and then the blue channel mode was applied using Photo Shop. However, the images still had a grey shadow in place of the blue colour in the forms. Subsequently, a stamp filter was applied with good results (Figure 8-2). However, there are two disadvantages of colour scanning: the first is that it is more time-consuming; the second is that it takes up excessive storage space. Finally, the forms were scanned in black and white using the blue channel as a mask (hardware mask). The results look similar to those in the first step, but were obtained faster and using much less memory (Figure 8-3). One hundred and five forms were scanned using a Hewlett Packard 6350 scanner. The images were scanned at 600 dpi, requiring about one minute for scanning and colour dropout per image.

2 Data Capture and Image Loading Samples of about 500 Arabic words were gathered and

Each word image is loaded using a ready made class library.

148 8: Experimental results 135 Figure 8-4: Words with touching characters Data Capture and Image Loading Samples of about 500 Arabic words were gathered and stored in separate files (see Each word image is loaded using a ready made class library. The image object class library is linked with this system allowing functions derived from this class to be used for loading. finally the cropping function is carried out on each word to facilitate subsequent pre-processing steps. The Lizard TIFF library currently handles raw, packbits, G3, G4, and JPEG decompression (section 4.4). This library decompresses the file to a java.awt.image, which requires a byte array for all the pixels. Java limits the amount of memory to 8 MB. Thus, memory will run out for larger than letter size images with this library.

149 8: Experimental results 136 Figure 8-5: Dot above the last left character 0 noon and below the real baseline Figure 8-6: Over-segmented words

150 8: Experimental results 137 Figure 8-7: Error from file transformation Figure 8-8: Wrong baseline for different Arabic words

151 8: Experimental results 138 Figure 8-9: Dots inside loops in character waw in word 2, -Wahed one Figure 8-10: Arabic letter T[ Alef was mistakenly classified as a complementary character

152 8: Experimental results 139 Figure 8-11: Complementary characters above Arabic letter T[ Alef Figure 8-12: Example of overwritten dots or unwritten dots in the word ch5< Twenty

153 8: Experimental results Pre-processing The following are some remarks on different steps of the experiments carried out through this research. Comments are also mentioned about the problems faced and on the results obtained. The site contains most of the word images after pre-processing and appendix B lists a few examples of those images Baseline Detection For baseline detection, it was found that choosing the lower baselines that have a maximum peak in the vertical density histogram is more accurate than choosing minimum peaks below this point since Arabic characters are connected in the baseline. In some cases, baseline detection failed. This may especially be the case in words that contains the character ( ra, /), like (tun, un four). Figure 8-8 shows examples of some words with incorrect baseline detection. The baseline is calculated above the original baseline. Since this system deals with calculating the baseline for a single word, in a practical system this problem can be overcome by calculating the baseline for lines of several words Slant and Slope Correction Slant and slope correction was carried out to each word. This was done after calculating the slope of the word on the baseline. Some words were not corrected properly because the characters of the word were not written in the same direction. Some examples in Figure 8-8 illustrate the error with slope correction for some words.

154 8: Experimental results Thinning The thinning algorithm Zhang-Suen/Stentiford/Holt [Pa97] was applied (section 5.2.4) as this algorithm was tested by Melhi et. all [MIB01] and shown to be the best performing thinning algorithm. The skeleton of the word is shown after thinning in Figure 8-12, which illustrates examples of words after thinning. Some extra details (short lines below words and short lines far above the baseline) were deleted from the skeleton of the word after applying the thinning algorithm because this enhanced the thinning algorithm and classification results Feature Extraction In the site one can see the features set calculated for each word. It also contains the word features in each segment of the words used to train and test the system. The HMM classifier classifies the words according to these features. Feature classification must be optimal in order to extract the distinct primitives and correctly identify the character, but this is not always the case in practice because of difficulties in handwritten Arabic (see section 1.5). Also handprinted characters tend to have hairs which create problems in generalizing the primitives. Another problem encountered with feature extraction was that of establishing the direction of curves, as mentioned in section The methods for the detection of intersection points, endpoints, and loops all operate to a skeletonised bit map. These features are detected 100% successfully. Some details are added to the technique used to make it suitable to deal with Arabic characters through out the experiment. This step takes more storage in memory and approximately one to five seconds in time to determine silence features (geometric features described in

155 8: Experimental results 142 section 3.1.1) in the words. At first loop detection consumes most of the processing time because it makes a linked list for each pixel in the loop. This loop detection object was later swapped with abetter loop detection object, which uses classes for detecting contours and inner loops. Thus a number of segments in the word and the area of the loops is known. Also the location of each blob can be found. This procedure works very rapidly since it deals with the original data file as described in Chapter 5. Since the loop detection algorithm detects loops that are not fully closed, it also created spurious non-loop features, by closing semi-loops. However, the wrong loops always appear small relative to the word and character s size. So deleting any extra small loops (with size less than four pixels) often solve this problem. Algorithms for detecting other features such as curves and strokes were tested. Some curves were not correctly detected. This is because the thinned image still has points that have more than two neighbours, but not an intersection, so any adjacent pixel found was encountered in the path detection algorithm. Also, estimating the curve magnitude was a problem, because Arabic handwritten characters do not offer a consistent curve magnitude, so this was calculated relative to frame size. For detecting curves, several window sizes with several magnitudes were taken relative to frame sizes. The curves were detected successfully 80% of the time. Contextual information that can be obtained before the segmentation stage is helpful. For example, in characters (6-th, '-noon), (<-alef, >-lam) and (]- ka, -hamza) the first character cannot be connected from the left side, while the second must be. In (-ain, -hamza), the first character must be at the beginning of a word (or sub-word) while the second must be at the end. In the first experiment, the features concerning connection from left

156 8: Experimental results 143 and right were not encountered but adding two more features (left, and right connection), which indicates if the character or sub-character is connected from right or left sides, improved the system s accuracy. These two features are successfully deleted in the images of 95% of words unless there is incorrect segmentation, or overlapping characters. For the detecting dots feature, sometimes two dots were classified into one dot or three dots, either because the writer wrote the two dots too close together (one dot) or curved (three dots). This problem also happened with three dot or one dot features. Also to determine upper and lower dots; at first the lower dots were defined as dots below the baseline, and upper dots defined as dots above the baseline. In practice this is not true all the time, because some characters are written below or above the word baseline. In Figure 8-5 one can see how a single dot character -noon comes on the main baseline or below it. This is why it is better to define upper dots as dots above the character baseline rather than the word baseline, and lower dots defined as dots below the character baseline. Dots inside a loop were also deleted, as seen in Figure 8-9, inside the character waw. Sometimes a complete character is detected as complementary characters (position and number of dots associated with a character). As seen in Figure 8-10, the character alef was detected as a dot feature because it is too small. In Arabic writing some writers write complementary characters above the character Alef (see Figure 8-11) and some do not. In future work complementary characters (position and number of dots associated with a character) above the character Alef could be deleted. This condition was also carried out with the character ^-ha and character 0 -noon at the end of some words.

157 8: Experimental results 144 In the database some writers wrote extra dots above letters, as seen on the first image in Figure 8-10, or they forgot to write dots above or below characters, as seen in the second image in Figure All these effects adversely affected the recognition rate Segmentation Segmentation caused most difficulties in the recognition system. A word is segmented into sub-characters (frames) by calculating a histogram, and then each frame is segmented again into more segments or zones. Through experiments, it was found that there were three zones for each extracted character, each zone is located between the maximum and minimum histogram peaks that should be chosen as separators between character segments (frames). This is effective to English handwriting, but in practice the second baseline is difficult to extract because Arabic handwriting may have different secondary baselines, as shown in Figure 3-7. Also Figure 8-6 illustrates words that are over-segmented because of writing style, and Figure 8-4 shows touching characters (touching because of the writer s errors) and segments that affect recognition Normalization It was also found empirically that if the frame contains a bigger part of the target character, the result will be more accurate. At the beginning of the experiment, the frames were segmented into five parts (zones) that reflected the position of the features in the word based on baseline (upper and lower baseline) positions. But the result was not encouraging. The number of segments in each frame was decreased to three, and the HMM system trained again.

158 8: Experimental results 145 Some characters were segmented into sub-characters: (8-seen) is segmented into two parts. This is solved by giving each part of the character a separate target or class: ( EwA v=v:vbv>vv ) are segmented into parts. The first part of these characters is captured as ( D v@ v< v9 vc v? v;v8 ); they are the same character but in different locations. The second part for all of these characters will have the same target and the same shape (U). Some characters were overlapping, especially in words that contain the characters show in Figure 1-8. Also in Arabic, writers can write the same letter of the alphabet or the same word long or short, for example the letter (8-seen) can be written longer like this (8-seen), which creates a problem in recognition. So, the characters separators () were omitted before recognition. Also segments that do not contain any features were removed, which improved the recognition rate as seen in the experimental results (Table 8-1), especially on the HMM recogniser (section 8.4). 8.4 Classification Using HMM The HMM is used to classify handwritten words. It then returns to the preprocessing stage to improve its performance, and on to the feature extraction process to add more features. A series of experiments were carried out to improve recognition results using the HMM. The system described in this research has been applied to a database of handwritten words produced by approximately 100 different writers. It was created especially for this application since there was not previously a public database of Arabic handwritten characters available to use as a standard test bed, as there was for English writing. Samples of about 4700 Arabic words

159 8: Experimental results 146 for the lexicon used in cheque filling were gathered and stored in separate files. About 9% of the data was deleted from the testing and training data because of errors in baseline detection and pre-processing, which is an interesting area for new research, and might well be improved in future work. Also feature classification was not optimal and could be improved. The database was segmented into training and testing data. In the training phase the words were segmented into characters and sub-characters. Feature extraction then transformed the segmented images into quantity feature vectors, which were then partitioned into several groups by the clustering algorithm. The cluster centroid (or codebook) from this part of training is kept for further use. For each word, the (quantized) observation sequences are then used to train an HMM for this word, using the Forward- Backward re-estimation algorithm. In the testing phase a frame-based analysis (pre-processing, feature extraction, and segmentation) is performed to give observation vectors, which are then vector quantized into the possible codebook vectors. The recognition rate at the early stage of testing was 23%. Here the dots features were not included in the features set and the normalization stage was not included. Then the dots feature was added and more normalisation stages were added (as described in Chapter 5), such as removing segments that do not affect the identity of the word, i.e. segments that contain a horizontal line without any features, spaces, or repeated segments. The number of dots and the position of the dots were also normalized. The recognition rate for the second set of tests increased to 40%, after adding the dots feature and extra stages of normalisation.

160 8: Experimental results 147 Subsequently another set of experiments was performed with different sets of parameters and a different compilation of features. It was found that using a combination of all features gave the best results (Table 8-1). The resulting observation sequence and the probability measures for all the HMMs, given by λ= {Π, Α, Β}, were then used to calculate the loglikelihoods for the HMMs. The word associated with the HMM with the highest log-likelihood was declared to be the recognized word, and the index was returned. Two-thirds of the data was used for training and the rest for testing. By training and testing the system on the database of handwritten Arabic words, the system has achieved a recognition rate of 51% (without any post-processing). It also achieved a 63% recognition rate if the criterion for success was that the words to be recognized be in the first two position, and 96% if the words were to be in the first three. This increase to 77% if the words were to be in the best five words of the test results. If a simple grammar was used as described in Chapter 7, section 7.2.2, the recognition rates increases 56%, 77% and 100% respectively (as described in Table 8-1). The codebook size was chosen, after testing, to be 120, the number of states chosen was 18, and the number of iterations, chosen was one. Those parameters gave the best recognition result after many training and testing phases with 0.4 as the standard error (as described in Table 8-2, which described the basic statistics of recognition rates if codebook size equal 120, number of states equal 18, and with one iteration for experiment repeated five times).

161 8: Experimental results 148 Table 8-1: Result of series of tests using HMM Test approach Recognition rate Test1 without normalization and dots features 23% Test2 with normalization and all features 40% Test3 further normalization and testing 51% With simple grammar 56% If the word is in the best first two results 63% If the word is in the best first three results 69% If the word is in the best first four results 74% If the word is in the best first five results 77% If the word is in the best first two results with simple grammar If the word is in the best first three results with simple grammar If the word is in the best first four results with simple grammar If the word is in the best first five results with simple grammar 77% 91% 98% 100% The variation in recognition rate is due to one of the problems HMM has with learning i.e. its sensitivity to initialisation. The Markov chain can fail to converge to the stationary distribution of the posterior probabilities. A possible reason for this is the failure to visit all highly probable regions of the parameter space because of local maxima in the likelihood curve [Ra89]. This problem can be solve by maximum-likelihood algorithms [DWL00] and Monte-Carlo techniques (MCMC) techniques [HLM+02] (see future work in section 9.3.4).

162 8: Experimental results 149 total recognition rate Table 8-2 Recognition rate basic statistics Maximu m Mean Median Minimum Std Deviation Variance Standard Error of Mean Classification Using ID3 ID3 tree classifier is used to recognize the whole words using a global classifier. A program called Weka3 [WF00] that contains an ID3 classifier was applied to the handwritten database used for cheque filling. A series of experiments was carried out to classify the tested words into words or groups of words in the dictionary using ID3. Figure 8-13 shows the output of the Weka program, which is a short ID3 tree that groups handwritten words (used in cheque filling and written by approximately 100 writers) into four groups that have similar features. One can see that three global features or attributes have been used. The number of segments, number of upper dots, and number of lower dots. Each attribute or feature has four values; a, b, c, and d, which are equal to 0, 1, 2, 3. From the tree in Figure 8-13 one can see how the ID3 tree splits according to the training vectors. The root node in the tree is the number of segments, which have four branches according to a corresponding value. There are branches of the number of upper loops, and the number of lower loops attributes according to their values, as seen on the ID3 tree. Several experiments were done on the global features using an ID3 tree to classify words into a group of words that have similar features or one word in the lexicon, with different combinations of features and different attributes values.

163 8: Experimental results 150 Figure 8-13 shows the best result of preview experiments. For all experiments 66% of the data was used for the training ID3 classifier and the remainder for testing. Different values for each attribute were examined. For example, it was found that giving the upper dots only three values was more accurate than giving them more values because upper dots are difficult to count exactly, and it depends on the writer s style. To normalize each attribute or feature and test the ID3 classifier with different values, a small program has been written in Java to change the input files into appropriate Weka files with different values for each attribute. Also, one should be aware that the groups have been distributed according to the Confusion Matrix in the experiment that groups words into a single word in the dictionary. For example, the words ninemdh#and seven IDJK can be easily differentiated by the lower dot feature, while in the Confusion Matrix produced by ID3 the word nine was 30.4% misclassified as seven.

164 8: Experimental results 151 segments = a: null segments = b upperd = a: null upperd = b: fn upperd = c: fn upperd = d: fn segments = c upperd = a lowerd = a: 1 lowerd = b: null lowerd = c: ml lowerd = d: 4 upperd = b lowerd = a: fn lowerd = b: ml lowerd = c: ml lowerd = d: ml upperd = c: 8 upperd = d lowerd = a: 8 lowerd = b: 8 lowerd = c: 8 lowerd = d: 8 segments = d lowerd = a upperd = a: 1 upperd = b: 8 upperd = c: 8 upperd = d: 8 lowerd = b upperd = a: 4 upperd = b: 4 upperd = c: 4 upperd = d: 4 lowerd = c upperd = a: 4 upperd = b: 4 upperd = c: 8 upperd = d: 8 lowerd = d upperd = a: 4 upperd = b: 4 upperd = c: 4 upperd = d: 8 Figure 8-13: ID3 tree to classify words into four groups

165 8: Experimental results 152 Number of groups Five groups Nine groups Eight groups Into words Table 8-3: ID3 classifier results Incorrectly Unclassified Classified Instances Instances Correctly Classified Instances Mean absolute error Root mean squared error 93% 5% 0.8 % % 20% 0.6% % 2% 0.6% % Table 8-4: The relation between words, groups, and the percentage of each word in each group for some words in the dictionary Word group1 group2 group3 group4 group5 group6 group7 group8 ahad 0.0% 0.0% 0.0% 96.7% 1.7% 1.7% 0.0% 0.0% ahda 0.0% 0.0% 0.0% 91.5% 1.7% 5.1% 1.7% 0.0% eight 0.0% 0.0% 0.0% 0.0% 0.0% 6.4% 92.1% 1.6% eightb 0.0% 0.0% 0.0% 0.0% 0.0% 96.8% 3.2% 0.0% eightyb 0.0% 0.0% 0.0% 0.0% 0.0% 1.5% 98.5% 0.0% ethna 0.0% 0.0% 0.0% 0.0% 0.0% 94.7% 5.3% 0.0% fifty 0.0% 0.0% 11.1% 0.0% 0.0% 88.9% 0.0% 0.0% fiftyb 0.0% 0.0% 1.6% 0.0% 0.0% 0.0% 3.1% 95.3% four 0.0% 0.0% 0.0% 2.9% 11.4% 0.0% 85.7% 0.0%

8: Experimental results 153 Figure 8-14: The relation between words, groups, and the percentage of each word in each group for Table 8.2 8.

166 8: Experimental results 153 Figure 8-14: The relation between words, groups, and the percentage of each word in each group for Table Classification Using Multiple HMM The last stage was to combine general and local classifiers to classify the Arabic words using a series of trial experiments. The system described in this dissertation has been applied to a part of the AHDB database of handwritten words produced by different writers. Samples of about 4700 Arabic words for the lexicon used in cheque filling were gathered and stored in separate files and were used as training and testing data. As for the test on the previous experiments (sections 8.4 and 8.5), about 9% of the data was deleted from the testing and training data because of errors in baseline detection and pre-processing, which presents an interesting area for future research.

167 8: Experimental results 154 As for the global features, at first seven features were used; there veins the number of loops, the position of ascenders (ra /, and waw ), and descenders (alef ), number of Alef, lower dots, upper dots, and the number of segments. By testing the data on the database, the last three features were found to give the best recognition rate, as illustrated in Table 2. The database was segmented into training and testing data. In the training phase the words were segmented into characters or sub-characters. Feature extraction then transformed the segment images into numeric feature vectors, which were partitioned into several groups by the clustering algorithm. The cluster centroid (or codebook) from this part of the training was kept for further use. For each word, the (quantized) observation sequences were then used to train an HMM for this word, using the Forward-Backward re-estimation algorithm. In the testing phase, a framebased analysis (pre-processing, feature extraction, and segmentation) was performed to give observation vectors, which are then vector quantized into the possible codebook vectors. Table 8-5: The recognition rate for the global Word Feature Recognition Engine Feature Recognition rate into Recognition rate into Recognition rate into used groups (choosing the groups (choosing the groups (choosing all best rate) best two rates) rates) 1,2,3 94.9% 98.5% 99.1%

168 8: Experimental results 155 More than 400 experiments have been run for testing and training each group to find a suitable set of parameters for each group. After empirical testing the codebook size was chosen as 92 for the second group, 116 for the next group, 114 for group four, 100 for group five, 98 for group six, 88 for group seven, and 114 for group eight. It was found that deleting feature three (right stroke) raises the recognition rate by one percent (when the code booksize is 98, and the number of states is 10) for group seven. Deleting feature five (Horizontal stroke) raises the recognition rate (for a code booksize of 90, and 26 states) for group six by almost one percent. Deleting feature 18 (Left disconnection) makes the recognition rate half a percent higher (when the code booksize is 90, and the number of states is 32) for group three. The resulting observation sequence and the probability measures for all the HMMs, given by λ= {Π, Α, Β}, are then used to calculate the loglikelihoods for each HMM. The word associated with the HMM of highest log-likelihood is declared to be the recognized word, and the index is returned. Two-thirds of the data were used for training and the rest for testing. By training and testing the system on the database of handwritten Arabic words, the system has obtained a near 61% recognition rate (without using post-processing). (Table 8-6 describes the mean of three recognition rates for each group.) The first group, which contains one word, has a 97% recognition rate, and, according to [ACS02], this word is the most highly used in the cheque-filling application.

169 8: Experimental results 156 Table 8-6: Recognition rate for each group and the total recognition rate Feature used Recognition rate for Group1 Recognition rate for Group2 Recognition rate for Group3 Recognition rate for Group4 Recognition rate for Group5 Recognition rate for Group6 Recognition rate for Group7 Recognition rate for Group8 Total recognition rate Total recognition rate with feature selection Recognition rate with simple grammar Recognition rates: mean (std. deviation) 97 Codebook size Number of states 80 (6) (2) (3) (20) (1) (2) (2)

170 8: Experimental results 157 R e c o g n i t i o n r a t e recognition rate for group6 Group6 Number of iterations recognition rate for group2 Group2 Number of iterations recognition rate for group3 Group3 Number of iterations recognition rate for group4 Group4 Number of iterations recognition rate for group5 Number Group5 of iterations recognition rate for group7 Number Group7 of iterations recognition rate for group8 Group8 Number of iterations Number of iterations Figure 8-15 Recognition rate decrease as number of iterations increases for all groups (codebook =90, and twenty states). by experiment and as seen in Figure 8-15 the recognition rate decreases for all groups if number of iteration increase. So the behaviour of the other variables (number of states equal 20 and codebook size equal 90) that effect the recognition rates was studied with a fix number of iteration equal one (which give the highest recognition rate). The effect of codebook size to recognition rates for different groups is more stable if number of iteration is one. This can be shown in Figure The recognition rate increase as codebook size increase, until at some point the change will be minor for all groups.

171 8: Experimental results 158 R e c o g n it i o n r a t e rr6 S rr2 S rr3 S rr4 S rr5 S rr7 S rr8 S S = reference codebook size Figure 8-16: Recognition rate and codebook size relation for groups two to eight when number of iteration is constant. To study the variation in the recognition rates result even with the same values of codebook size, number of states and number of iteration, some points were selected and repeated twenty times. Table 8-7 through Table 8-19 describe the result of such an experiment and the mean of recognition rates, which was a result of twenty repeated experiments for each pair of codebook size and number of states when number of iteration equal one. The variation in recognition rate was slightly changed in most groups except groups five and two.

172 8: Experimental results 159 Table 8-7 The mean of 20 recognition rates for group six results from different states and codebook sizes. Number of states Mean for recognition rates C o d e b o o k s i z e Table 8-8 The std. Deviation of 20 recognition rates for group six results from different states and codebook sizes. Number of states Std Deviatio n Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on C o d e b o o k s i z e Table 8-7 shows that for group six, recognition rate size increases as codebook increases (from 4-88). There is an obvious improvement in recognition rate as codebook increase for codebook size 13 to 45. the highest recognition rate is if the codebook size is 116. After that point the recognition rate starts to slightly decrease. The number of states doesn t affect recognition rate in this pattern

173 8: Experimental results 160 Table 8-9 The mean of 20 recognition rates for group two results from different states and codebook sizes. Number of states C o d e b o o k s i z e Mean for recognition rates Table 8-10 The std. Deviation of 20 recognition rates for group two results from different states and codebook sizes. Number of states Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on C o d e b o o k s i z e

174 8: Experimental results 161 Table 8-11 The mean of 20 recognition rates results from different states and codebook sizes for group three. C o d e b o o k s i z e Number of states Mean for recognition rates Table 8-12 The std. Deviation of 20 recognition rates for group three results from different states and codebook sizes. Number of states Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on C o d e b o o k s i z e Table 8-9 shows that the data for group two takes the same pattern as the effect of codebook size and number of states on recognition rate as in group six. The difference in this group is that the recognition rate was high, even at the start and the recognition rate was higher than the recognition rate for

175 8: Experimental results 162 group six even with the small value of codebook size. In group two there is also a slight change in recognition rate as the number of states increase, but after 144 it was stable. In Table 8-11 there is no difference between the recognition rate for group six and the recognition rate for group three, keeping the same setting for codebook size and number of states. Also, there is a significant increase in recognition rate, especially when a change happens if the codebook size applied for all groups is small (4 to 32). Group four recognition rates Table 8-13 takes the same pattern but the recognition rate was high comparing to group six and group two. Table 8-13 The mean of 20 recognition rates results from different states and codebook sizes for group four. C o d e b o o k s i z e Number of states Mean for recognition rates

176 8: Experimental results 163 Table 8-14 The std. Deviation of 20 recognition rates for group four results from different states and codebook sizes. Number of states Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on C o d e b o o k s i z e C o d e b o o k Table 8-15 The mean of 20 recognition rates results from different states and codebook sizes for group five. Number of states Mean for recognition rates

177 8: Experimental results 164 C o d e b o o k s i z e Table 8-16 The std. Deviation of 20 recognition rates for group five results from different states and codebook sizes. Number of states Std Deviatio n Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on In Table 8-15 the number of states and codebook size have no effect of on the recognition rate for group five, where the values of recognition rate falls between 60 and 90, which shows that there is no regular pattern. Regardless of the number of the levels codebook size, and the number of states the result is a random pattern, because in this group there are only two words. One of those words had a problem in baseline detection (as described earlier in this chapter section 8.3.4) which affects recognition rates and leads to vastly different recognition rates according to random initialisation of HMM parameters, and sometimes learning can get stuck in local maxima. This requires some method of escaping such a maxima (see section 9.3.4) that can be implemented in the future. For group five the results may suggest that a high value in codebook size may lead to a drop in recognition rate (above 116).

178 8: Experimental results 165 Table 8-17 The mean of 20 recognition rates results from different states and codebook sizes for group seven. C o d e b o o k s i z e Number of states Mean for recognition rates Table 8-18 The std. Deviation of 20 recognition rates for group seven results from different states and codebook sizes. Number of states Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on C o d e b o o k s i z e

179 8: Experimental results 166 Table 8-19 The mean of 20 recognition rates results from different states and codebook sizes for group eight. C o d e b o o k s i z e Number of states Mean Mean Mean Mean Mean Mean Mean Mean Table 8-20 The std. Deviation of 20 recognition rates for group eight results from different states and codebook sizes. Number of states Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviati on Std Deviatio n C o d e b o o k s i z e Table 8-17 and Table 8-19 describe group seven and eight which take the same pattern as group six. There is a quick increase especially when codebook size is small and there is no effect for number of states.

180 8: Experimental results 167 The standard Deviation was small for most groups, except in group five, where there was a big variation in the result and for group two, where the range is slightly higher than other groups. Group five s Standard Deviation was high because one of the problems with learning with HMM is its sensitivity to initialisation, especially because group five has only two words. This also applies to group two. 8.7 Conclusion of the Experimental Results In this chapter the experimental details have been explained. First the software and hardware that have been used for this thesis were listed. Then, the details in of the pre-processing system were explained, such as difficulties in building the AHDB database, baseline detection and locating Arabic features. Also this chapter presents experimental examples carried out using real data. Problems were experienced in the pre-processing stage. These problems present different characteristics and restrictions that have to be taken into account at this stage. After that, three classification strategy experiments were carried out using features extracted from the pre-processing stage. The first strategy was a HMM used as a classifier based on local features. The second strategy was using ID3 to classify using general features, and this results in a better recognition rate than the HMM. The third strategy is a classifier that combined general and local features using a multiple HMM classifier for each group of data. This last classifier improved recognition rates. For HMM classifiers there is some variation in recognition rates, as has been explained (sections 8.4, and 8.6).

181 8: Experimental results 168 The effectiveness of combining multiple HMMs classifiers for handwritten word recognition had been verified. This evaluation was made using HMM, ID3, and multiple HMM schemes with similar feature sets, and therefore the results are influenced mainly by the classifiers performance. The results show that the classifier with the best individual performance is the multiple HMM. The difference in performance is probably due to the distinct model generation technique used by the classifiers. While ID3s train all classes and search for an optimum solution for separating them using a greedy algorithm, the HMM trains individual models for each class, without sharing information between them. However, the disadvantage of HMM classifiers is its sensitively to initialisation, which can results in high variation in accuracy. This is more obvious in the HMMs for groups two and eight described in section 8.6. By using a number of random starting positions it is possible to be reasonably confident that a good solution has been found, as several will usually be a close approximation to this. Solutions in a false minimum are almost always very poor. The Results also highlight the importance of using more sophisticated algorithms than those used currently to overcome the problems related to the poor accuracy. The main conclusion is that the (HMM, and ID3) classifiers are complementary and provide different information. The use of multiple HMM classifiers as proposed in section 8.6 further enhances their complementarity. If the results are compared with the very first version of the system that had a less than 25% recognition rate under the same testing conditions, it is clear that very considerable progress has been made. The final system (section 8.4), that is the Multiple HMM classifier that has been presented in

182 8: Experimental results 169 this chapter (section 8.6) achieved over two and a half times that recognition rate. It is hoped that the ideas presented in this thesis, combined with the AHDB, will make it possible for others to use and enhance it. It would give the author great pleasure if, in ten years time, a full system able to recognise handwritten Arabic words for a real commercial application.

183 $ section 9.1 discusses the problems addressed by this research and summarizes how its objectives have been achieved. The contribution of this work to the field of off-line Arabic handwritten recognition follows in section 9.2. section 9.3 suggests lines of enquiry for future work. Finally, section 9.4 presents the conclusions of this thesis. 9.1 Concluding Remarks Off-line handwriting recognition is the automatic transcription by computer, where only an image of the handwriting is available. Much work has been done on the recognition of English characters, covering both hand-printed and cursive script. The study of recognizing Arabic cursive script, however, has been much more limited. The recognition of Arabic characters is also important for some non-arabic speaking languages, such

184 9: Conclusion and Future Trends 171 as Farsi, Kurd, Persian, and Urdu. These groups use Arabic characters in writing, although they have a different pronunciations. Most of these studies (see section 3.1) were done assuming that each Arabic handwritten word has already been segmented into separated characters before recognition, which is rarely the case in natural Arabic writing. In contrast, this thesis assumed that the words have not been segmented into characters since the off-line Arabic character recognition operation has many steps that cannot be separated from one another. Among these steps are preprocessing, segmentation, feature extraction, and classification. This thesis deals with the Arabic character recognition aided by knowledge of the position of each character in the word, but without pre-segmentation. A standard database of images is needed to facilitate research in handwritten text recognition. In this dissertation a new database for off-line Arabic handwriting recognition was presented, together with associated pre-processing procedures. A new database for the collection, storage and retrieval of Arabic handwritten text (AHDB) have been developed, which supersedes previous databases both in terms of the size and the number of different writers involved. In this thesis the most popular words in Arabic writing have been identified for the first time, using an associated program. This thesis also describes pre-processing techniques implemented for Arabic handwriting, including the segmentation of word images to give invariance to scale, slant, slope, and stroke thickness. The representation of the image is discussed and the skeleton and stroke features used are described. Then features were extracted from the words. The preprocessing operations carried out in this research have three main advantages. Firstly they deal with non-segmented words, secondly they take advantage of the position of the features in the characters or sub-

185 9: Conclusion and Future Trends 172 characters, and thirdly they extract moments and other features for the first time in Arabic handwritten text. After feature extraction, recognition is carried out to identify the word. This is done by firstly estimating the data likelihoods for each frame of data in the representation, using a vector quantization (VQ) method. The features calculated in pre-processing are used by both the VQ and the HMM classifier. The HMM classifier was discussed, which classifies the features captured from the word image. The states of the HMM are identified with the letters of the alphabet or segments. Once the model is established the Viterbi algorithm is used to recognize the sequence of letters composing the word. Because of the difficulties with Arabic handwriting (see section 1.5), several other classifiers have been examined. The ID3 classifier was used to classify groups of words that have similar features. A classification process based on this rule-based classifier was also used as a global recognition engine, to classify words into eight groups. Then, for each group the HMM approach was used for classification. The output is a word in the dictionary. As reported in Chapter 8, this approach has a better recognition rate than the HMM alone. The prototype system described here is promising. However, there remains room for improvement in terms of early use of the dictionary. Comparison of the final results obtained in this research with other research is difficult because of differences in experimental details, the actual handwriting used, the method of data collection, and dealing with real Arabic off-line handwritten words. If this work is compared to other research (section 2.5) on the recognition of handwritten Arabic words, it is the first one to uses multiple HMMs to recognize Arabic handwritten words. This also means that it uses novel sets of features and segmentation techniques.

186 9: Conclusion and Future Trends Contribution to Arabic Handwritten Recognition This section revisits the contributions set out in Chapter 1 and demonstrates that the key objectives (section 1.6) have been accomplished. The contributions can be summarized as follows: The most popular words in Arabic writing have been identified for the first time. (AHDB) A new database used in this thesis for the collection, storage and retrieval of Arabic handwritten text have been generated, which supersedes previous databases both in terms of the size of the database and the number of different writers involved. In the field of pre-processing and feature extraction, a new set of handwritten features are combined and tested in the classification stage. A new HMM approach system is used to train and test Arabichandwritten words taken from around 100 different writers. A global approach has been developed, which is inexpensive in terms of feature classification, and avoids the problematic segmentation stage. The combination of global and local features to recognize words also improves the recognition rate and has not been used previously in Arabic word recognition.

187 9: Conclusion and Future Trends Future Work Although fully automated off-line handwritten Arabic document recognition is not likely to be achieved in the near future, this research, as well as other work in the field of off-line handwritten character recognition, are significant steps towards a completely automatic off-line handwritten Arabic document recognition system. The following section suggests improvements to the performance of each step involved in Arabic handwritten character recognition. The results obtained in this thesis encourage further research in several areas The Database As part of the future work, the algorithm will be applied to other data sets, in order to further verify the pre-processing and classification processes. More written words for the same lexicon by new authors will be added to the AHDB in order to improve the recognition rate and to deal with the variety of writing styles found in Arabic handwriting. The system described in this thesis has been applied using a database of handwritten words written by more than 100 different writers, and each word can be written in different styles. Some styles are common, whilst others are rare. For this reason, there is not enough data for training the system, and it is clear that the system needs more patterns to account for rarer writing styles. This will also allow more features to be developed for use in future work. Adding a simple statistical diagrammatic approach to adjust the likelihood of each word using data from a sample of cheques would produce a high percentage of correctly recognized sentences.

188 9: Conclusion and Future Trends Pre-processing The system was implemented for single word images, but it can be easily adapted to complete documents with the addition software for detecting the skew of the writing and for segmenting the text into lines and words. In the feature baseline detection step, which can be problematic, one can test other techniques for baseline detection such as the Hough Transform, or compute the baseline of a word as the average of words in one line of whiting. The segmentation can be more accurate if some constraints have been added for the zoning techniques to cut characters like ( - RA, 6 -ZA, AND - WAW). Some characters like ( - LA, HA, AIN) were segmented into two parts. In the future joining the two frames that segment loops in characters would solve that, increasing the recogntion rate Feature Extraction More features can be added and examined, for example Fourier descriptors and others moment features. Adding different feature topologies may increase the recognition rate. Changing the dot feature to three input features where these will contain the dot or cluster of dots and their position relative to the body of the other characters. Stroke features could be added for more accurate recognition of characters like (_-Ta, `-tha, a-ka). The addition of more secondaries like (hamza and mada) will help to recognize the supplementary characters mentioned in Table 2.3. From the training and testing process in this thesis it has been found that some people write secondaries like (hamza and mada) on the character (alef) and some do not. The results will be more accurate if these secondaries were deleted over Alef in the normalization process. Also, some people write dots at the end of a word above characters like (ha an noon) and often some do not. In the future one should delete or add such

189 9: Conclusion and Future Trends 176 dots over those characters. More samples of frames of words can be added to enhance the recognition rate. Another step of the process that needs more examination is feature selection. With appropriate feature selection the data might be better presented, improving performance or increasing recognition rate Classification As mentioned in the previous chapter the forward-backward (FB) algorithm in learning fase some times stuck in a poor local maxima which can be solved in the future by using smoothing in training fase using for example traditional method to fight with this problem is to repeat the training procedure with different randomly chosen initial parameters, and then select the best local maximum found [DWL00] or Monte-Carlo techniques (MCMC) [HLM+02]. Because Arabic writing is complicated, the classification can be improved by adding and testing more classifiers and compiling those classifiers. For example a recurrent neural network may be used to increase the recognition rate. Alternatively, a recurrent neural network combined with HMMs, is known to have a reasonable recognition rate in the case of English handwriting recognition. Techniques for the combination of results from multiple classifiers are now reasonably well established Post-processing Since there is no objective measure of good handwriting, it is very difficult to discriminate between what is acceptable and what is not. Style, mood, background,... etc. these are factors that influence the way a person writes. Even human recognition of average handwriting is estimated to

190 9: Conclusion and Future Trends 177 involve a 4% mis-classification level [Ob94]. Moreover, humans usually use context as the main discriminator. This means that having a 100% recognition rate of handwritten characters without contextual information is likely to be impossible. The previous work on Arabic handwritten OCR definitely needs a post-processing operation. As a suggestion, this postprocessing operation might be done using an HMM to combine the data likelihood of words. This would find the best choice of sentence for the observed data, and should enhance the overall recognition accuracy of the system. 9.4 Conclusion In this thesis, the problem of off-line Arabic handwritten recognition has been addressed. A layered framework with which to structure the discussion of my vision of Arabic handwriting recognition have been proposed. Within each layer, different areas of research in handwritten recognition can be related. The first and outer most layer of the framework are concerned with data capture. The most popular words in Arabic writing have been found and a new database on Arabic handwriting made available. The second layer of the framework comprises a new pre-processing and feature extraction method for Arabic handwritten words. The final core of the framework relates to classification, which establishes the relationships between an unknown symbol input and models of symbols known to a system.

191 9: Conclusion and Future Trends 178 Classification has been carried out using three main approaches. In the first approach a new classifier based on HMMs has been implemented. The second approach is based on ID3. The third approach, based on this model, is a multilevel word recognition system, which uses two types of representations based on global features and other local features using multiple HMMs. Comparing this work with [DF01], the best system represented to date, which has a 32% recognition rate, and the one proposed in Chapter 6, which has a 50% recognition rate, where both used HMM (implemented on unconstrained Arabic handwriting), the new approach has clearly been much more successful. The system in Chapter 8 based on multiple classifiers improves performance even further.

192 BIBLIOGRAPHY [AA92] [AA94] [AA95] [AAF96] [ACS02] [AG02] [AH88] [AH90] [AH95] Amin, A. and Al-Sadoun, H., A segmentation technique of Arabic text, 11th Int. Conf. on Pattern Recognition, 1992, pages Amin, A. and Al-Sadoun, H., Handprinted Arabic character recognition system, 12 th Int. Conf. on Pattern Recognition, 1994, pages Al-Sadoun, H. and Amin, A., A new structural technique for recognizing printed Arabic text, Int. J. of Pattern Recognition and Artif. Intell., 9(1), 1995, pages Amin, A., Al-Sadoun, H., and Fischer S., Handprinted character recognition system using artificial network, Pattern Recognition, 29(4), 1996, pages Al-Ohali, Y., Cheriet, M. and Suen, C. Y., Dynamic Observation and Dynamic State Termination for Off- Line Handwritten Word Recognition Using HMM. Proceedings of IWHR 02, Ontario, Canada, 2002, pages AramediA Group, Arabic Software Desktop Publishing Translation OCR ASR TTS MultimediA, On-line reference, Abdelazim, H. and Hashish, M., Arabic Reading machine, 10 th National Computer Conf. Riyadh, Saudi Arabia, 1988, pages Akiama s, T. and Hagita, N., Automated entry system for printed documents, Pattern Recognition, 23(11), 1990, pages Al-badar, B., and Haralick, R., Segmentation freeword recognition with application to Arabic, 3 rd Intl. Conf. on Document Analysis and Recognition, Montreal, 1995, pages

193 Bibliography 180 [AHD94] [AHE01] [AHE02a] [AHE02b] [AHE02c] [AHE03a] [AHE03b] [AHE04] Abuhaiba, I., Holt, M. and Datta, S., Processing of binary images of handwritten cursive Arabic characters, IEEE Trans. Pattern Analysis and Machine Intelligence, 29(4), June 1994, pages Alma adeed, S., Higgins, C. and Elliman D., A New Preprocessing System for the Recognition of Off-line Handwritten Arabic Words, IEEE International Symposium on Signal Processing and Information Technology, December, Alma adeed S., Higgins C., and Elliman D., A Database for Arabic Handwritten Text Recognition Research, Proc. 8 th IWFHR, Ontario, Canada, 2002, (8): Alma adeed S., Higgins C., and Elliman D., Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model Approach, ICPR 2002, Quebec City, August Alma adeed S., Higgins C., and Elliman D., Recognition of Off-Line Handwritten Arabic Using ID3, Doha, Qatar, Alma adeed S., Higgins C., and Elliman D., A Database for Arabic Handwritten Text Recognition Research, has been accepted to be published in International Arabic Journal of Information Technology. Alma adeed S., Higgins C., and Elliman D., Recognition of Off-Line Handwritten Arabic Words Using Multiple Hidden Markov Models, AI03, Oxford, UK, December Alma adeed, Somaya, Colin Higgins, and Dave Elliman, Recognition of Off-Line Handwritten Arabic Words Using Multiple Hidden Markov Model. Has been accepted to be published next year in Journal of Knowledge-Based Systems. [AKM+71] Ascher, R. N., Koppelman., G. M., Miller, M. J., Nagy, G. and Shelton, Jr., G. L., An interactive system for reading unformatted printed text, IEEE Trans. Comput. C-20, 12, 1971, pages

194 Bibliography 181 [Al99] [Am00] [AM86] [AM89] [AM89] [AM95] [AMG94] [AU90] [AU92] [AWS81] [AY87] Al-ma adeed, Somaya, Computer vision application to automatically recognize handwritten Arabic characters, MSc. Thesis, Alexandria University, Mathematics Department, Egypt, Amin, A., Recognition of printed Arabic text based on global features and decision tree learning techniques, Pattern Recog., 33, 2000, pages Amin, A. and Masini, G., Machine recognition of multi font printed Arabic texts, 8 th Intl. Conf. on Pattern Recognition, Paris, 1986, pages Amin, A. and J., Mari, Machine recognition and correction of printed Arabic text, IEEE Trans. on Systems, Man, and Cybernetics, 19(5), 1989, pages Amin, A. and Mari, J. F., Machine recognition on printed Arabic text, IEEE Trans. Man CYBERN, 9(1), 1989, pages Al-Bader, B. and Mahmoud S., Survey and bibliography of Arabic optical text recognition, Signal Processing, 41, 1995, pages Abuhaiba, Mahmoud, S. and Green, R., Recognition of Handwritten Cursive Arabic Characters, IEEE Trans., Pattern Analysis and Machine Intelligence, 16(6), June 1994, pages Al-Emami, S., and Usher M., On-line recognition of handwritten Arabic characters, IEEE Trans. Pattern Anal. Machine Intell. PAMI-12, 1990, pages Al-Yousefi, H. and Udpa, S. S., Recognition of Arabic characters, IEEE Trans. Pattern Anal. and Machine Intell. PAMI-14, 1992, pages Abeli, L., Wahl, F. and Scheri, W., Procedures for automatic segmentation of text graphic and halftone regions in document, Proc. 2 nd Scandinavian Conference on Image Analysis, 1981, pages Almuallim, H. and Yamaguchi, S., A method of recognition of Arabic cursive handwriting, IEEE Trans. Pattern Anal. Machine Intell. PAMI-9, 1987, pages

195 Bibliography 182 [AY97] [BH84] Atici, A. A., Yarman-Vural, F. T., A heuristic algorithm for optical character recognition of Arabic script, Signal Processing, vol. 62, no. 1, OCT 1997, pages Belaid, A. and Haton, J. P., A syntactic approach for handwritten mathematical formula recognition, IEEE Trans. Pattern Anal. Machine Intell. PAMI-6, 1984, pages [BKK+96] Bouchaffra, D., Koontz, E., Krpasundar, V., and Srihari, R. K., Incorporating diverse information sources in handwriting recognition postprocessing, International Journal of Imaging Systems and Technology, 7(4), 1996, pages [BM83] [BR95] [BS97] [BSM99] [BSM99] [BWB94] [Ca01] Belaid, A. and Masini, G., Segmentation of line drawings for recognition and interpretation, Technology and Sc. Informatics, 1(2), 1983, pages Bunke, H., Roth, M., and Schukat-talamazzini, Off-Line cursive handwriting recognition using hidden Markov model, Pattern Recognition, 28 (9), SEP 1995, pages Bushofa, B. M. and Spann, M., Segmentation and recognition of Arabic characters by structural classification, Image and Vision Computing, 15, 1997, pages Bazzi, I., Schwartz Richard and Makhoul John, An omnifont open-vocabulary OCR system for English and Arabic, Pattern Analysis and Machine Intelligence, 21, 1999, pages Bazzi, I., Schwartz, R. and, Makhoul, J., An omnifont open-vocabulary OCR system for English and Arabic, IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(6), 1999, pages Bunke, H., Wang, P. S. P. and Baird, H. S. (eds.), Hand Book of Character Recognition and Document Image Analysis, World Scientific, Singapore (1994). Cappé, O., Ten years of HMMs, On-line reference,

196 Bibliography 183 [CD97] Cracknell, C., and Downton, A. C., A colour approach to form dropout, A. Downton and S. Impedovo, Progress in Handwriting Recognition, World Scientific, UK, [CDP88] Ciardiello G., Degrandi M. T., Poccotelli, M. P, Seafuro, G. and Spada, M. R., An experimental system for office document handling and text recognition, Proc. 9 th Int. Conf. on Pattern Recognition, 1988, pages [CFM+92] [Ch94] [CK94] [CK95] [CKS95] [CKZ94] [CL96] Casey, R. G., Ferguson, D. R., Mohiuddin, K. M. and Walsh, E., An intelligent forms processing system, Machine Vis. Appl. 5(3), 1992, pages Chen, M. Y., Off-line Handwritten Word Recognition Using a Hidden Markov Model Type Stochastic Network, IEEE Trans. Pattern Analysis and Machine Intelligence, 16(5), May 1994, pages Chen, M. Y. and Kundu, A., A complement to variable duration hidden Markov model, in Proc. IEEE Int. Conf. on Image Processing, Austin, Texas, Nov. 1994, pages Chen, M. Y. and Kundu, A., Multi-level hidden Markov model handwritten word recognition, in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, Detroit, Michigan, May 1995, pages Chen, M. Y., Kundu, A. and Srihari, S. N., Unconstrained handwritten word recognition using continuous density variable duration hidden Markov models, Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, Minneapolis, MN, May 1993, pages (also see IEEE Trans. on Image Processing, 4, 1995, pages Chen, M. Y., Kundu, A. and Zhou J., Off-line handwritten word recognition using a hidden Markov model type stochastic network, IEEE Trans. Pattern Anal., Machine Intell., 16, 1994, pages Casy, R. and Lecolinet, E., A Survey of methods and strategies in character segmentation, IEEE Trans., Pattern Analysis and Machine Intelligence, 18(7), July 1996, pages

197 Bibliography 184 [CLK95] [CN80] [CN91] [DF01] [DHP00] [DI97] [DK82] [DWL00] [EG88] [EGS99] Cho, W. Y., Lee, S. W. and Kim, J. H., Modeling and recognition of cursive words with hidden Markov models, Pattern Recognition, 28(12), Dec 1995, pages Cave, R. L. and. Neuwirth, L. P, Hidden Markov models for English, in Proc. of the Sys. on the Application of Hidden Markov Models to Text and Speech, ed. J. D. Ferguson, Princeton, 1980, pages Casey, R. G. and Nagy, G., Document Analysis - a broader view, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, September 30 th -Oct 2 nd 1991, pages Dehghan, M., Faez, K., Ahmadi, M., and Shridhar, M., Handwritten Farsi (Arabic) word recognition: a holistic approach using discrete HMM, Pattern Recognition, 34, 2001, pages Deller J. R., Hansen J. H., and Proakis J. G., Discretetime processing of speech signals, IEEE Press, NewYork, Downton, A. and Impedovo, S., Progress in Handwriting Recognition, World Scientific, UK, Devijver, P. A. and Kittler, J., Pattern recognition: a statistical approach, Prentice Hall, London, Davis, Richard I. A., Walder, Christian J. and Lovell, Brian C. Improved Classification Using Hidden Markov Averaging From Multiple Observation Sequences. In Proceedings WOSPA,, Brisbane, 2002 December 17-18, pages El-Sheikh, T. and Guindi, R., Computer recognition of Arabic cursive script, Pattern Recogn., 21(4), 1988, pages El-Yacoubi, A., Gilloux, M. and Sabourin, R., An HMM- Based Approach for Off-Line Unconstrained Handwritten Word Modeling on Recognition, IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(8), 1999, pages

198 Bibliography 185 [EMS+90] [ERK90] [ES90] [Fe99] [FEB+00] Esposito, F., Malirda, D., Senerio, G., Annesc, E. and Sceafuro, G., An experimental page layout recognition system for office document automatic classification: An integrated approach for inductive generalization, Proc. 10 th Int. Conf. on Pattern Recognition, 1990, pages ElDabi, S., Ramsis, R. and Kamel, A., Arabic character recognition system: Statistical approach for recognizing cursive typewritten text, Pattern Recogn. 23(5), 1990, pages El-Khaly, F. and Sid-Ahmed, M., Machine recognition of optically captured machine printed Arabic text, Pattern Recog., 23(11), 1990, pages Fernandez, Fernando, Definitions, On-line reference, tml, Freitas, C. O., El Yacoubi, A., Bortolozzi, F. and Sabourin R., Brazilian Bank Check Handwritten Legal Amount, Proceedings of the XIII Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI 00), Brazil, [Fo73] Forney, D., The Veterbi Algorithm, Proc. IEEE, 61(3), 1973, pages [Fr68] Freeman, H., On the encoding of arbitrary geometric configuration, IEEE Trans. Electronic Comp. EC-10, 1968, pages [GB02] Gunter, S. and Bunke, H., Creation of classifier ensembles for handwritten word recognition using feature selection algorithms, Proceedings of IWHR 02, Ontario, Canada, (2002) pages [GN90] Gray, R. M., and Neuhoff, D. L., Quantization, IEEE Transaction on Information Theory, 44(6), 1990, pages [Go97] Goski, N., Practical Compilation of Multiple Classifier, pages , in the book A. C. Downton, S. Impedovo, Progress in Handwriting Recognition, Word Scientific, London, 1997.

199 Bibliography 186 [GS95] [GS98] [Ha98] [Ha99] [HFA90] [HK91] [HLM+02] [Ho00] [HS94] [Hu94] Guillevic, D. and Suen, C. Y., Cursive script recognition applied to the processing of bank cheques, Proceedings of the International Conference Document Analysis and Recognitions, Montreal, Canada, August 1995, pages Guillevic, D. and Suen, C. Y., Recognition of legal amounts on bank cheques, Pattern Analysis and Applications, 1(1), 1998, pages Hamilton, J. H., Decision Tree Construction, On-line reference: trees/4_dtrees2.html Hart, D., Getting Started With MATLAB, On-line reference, d/#whatis, Hinds, S. C., Fisher, J. L., D Amato, D. P., A Document Skew Detection Method using Run-length Encoding and Hough Transform, Proc. 10th Int. Conf. on Pattern Recognition, 1990, pages He, Y. and Kundu A., 2-D shape classification using hidden Markov model, IEEE Trans. Pattern Anal., Machine Intell. 13, 1991, pages Huelsenbeck, J., B. Larget, R. Miller, and F. Ronquist. Potential applications and pitfalls of bayesian inference of phylogeny, Syst. Biol., 51(5), 2002, pages Howell, D., Getting to grips with graphic file format, Computer Publishing, 9, Holsheimer, M. and Siebes, A. P. J. M., Data mining: the search for knowledge in databases. Technical Report CS- R9406, CWI, P.O. Box 94079, 1090 BG Amsterdam, The Netherlands, Hull, J. J., A database for handwritten text recognition research, IEEE Trans. Pattern Anal. Machine Intell., 16(5), 1994, pages

200 Bibliography 187 [Hu96] [IA91] [ICD91] Hull, J. J., Incorporating language syntax in visual text recognition with a statistical model, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 18, no. 12, Dec 1996, pages Ingold, R. and Armangil, D., A top-down document analysis method for logical structure recognition, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, September 30 th -Oct 2 nd, 1991, pages ICDAR 91, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, September 30 th -October 2 nd, [ICDAR01] 6th International Conference on Document Analysis and Recognition (ICDAR 2001), September 2001, Seattle, WA, USA, IEEE Computer Society 2001, ISBN [ICDAR03] 7th International Conference on Document Analysis and Recognition (ICDAR 03), will take place on 3-6 August in Edinburgh, Scotland at the Edinburgh Conference Centre. [ICDAR97] 4th International Conference on Document Analysis and Recognition (ICDAR 97), 2-Volume Set, August 18-20, 1997, Ulm, Germany, Proceedings. IEEE Computer Society 1997, ISBN [ICDAR99] 5th International Conference on Document Analysis and Recognition, ICDAR 1999, September, 1999, Bangalore, India, IEEE Computer Society, [IOO91] [Is93] [IWF02] Impedovo, S., Ottaviano, L. and Occhinegro, S., Optical Character Recognition - a survey, Int. J. Pattern Recognition and Artificial Intelligence, 5(1), 1991, pages Ishitani, Y., Document Skew Detection Based on Local Region Complexity, Proc. of the 2nd Int. Conf. on Document Analysis and Recognition, Tsukuba Science City, Japan, Oct., 1993, pages IWFHR 02, Proc. 8th Int. Workshop on Frontiers in Handwriting Recognition, Ontario, Canada, August 6-8, 2002.

201 Bibliography 188 [Ja91] [JDM00] [JLG78] [JSW90] [KA93] [KAN+98] [KC94] [KC99] [Ke91] [Kh00] Jampi, K., Arabic character recognition: Many approaches and one decade, Arabian J. Sc. Eng. 16(4), 1991, pages Jain, A. K., Duin, R. P. W., and Mao J., Statistical Pattern Recognition: A review, Pattern Analysis and Machine Intelligence, 22(1), 2000, pages Johansson, S., Leech, G. N. and Goodluck, H., Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for Use with Digital Computers, Department of English, University of Oslo, Norway, Jeng, B. S., Sun, F. W. and Wu, T. M., Hidden Markov model based on optical character recognition - a novel approach, in Proc. IEEE Int. Sys. on Information Theory, San Diego, CAA, Jan. 1990, pages Kuo, S. S. and Agazzi, O. E., Machine vision for keyword spotting using pseudo 2-D hidden Markov models, in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, Minneapolis, Minnesota, V:81-84, April 1993, pages Knerr, S., Augustin, E., Baret, O. and Price, D., Hidden Markov model based word recognition and its application to legal amount reading on French checks, Computer vision and Image Understanding, 70(3), Jun 1998, pages Kopec, G. E. and Chou, P. A., Document image decoding using Markov source models, IEEE Trans. Pattern Anal., Machine Intell., 16, 1994, pages Khorsheed, M. and Clocksin, W., Structural Features of Cursive Arabic Script, The 10 th British Machine Vision Conference, University of Nottingham, Nottingham-UK, September Kerpedjiev, S. M., Automatic extraction of information structures from documents, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, September 30 th -Oct 2 nd, 1991, pages Khorsheed, M., A dissertation for the degree of Doctor of Philosophy, University of Cambridge, June 2000.

202 Bibliography 189 [Kh02] [KHB98] [KHC98] [KJ92] Khorsheed, M., Off-Line Arabic Character Recognition - A Review, Pattern Anal. Appl., 5(1), 2002, pages Kundu, A., He, Y. and Bahl, P., Recognition of handwritten word: first and second order hidden Markov model based approach, Pattern Recognition J., 22, 1989, pages Kundu, A., He, Y. and Chen, M. Y., Alternatives to variable duration HMM in handwriting recognition, IEEE Trans. Pattern Anal., Machine Intell. 20(11), Nov 1998, pages Kurdy, B. M. and Joukhadar, A., Multifont recognition system for Arabic characters, 3 rd Int. Conf. and Exhibition on Multi-lingual Computing (Arabic and Roman Script), UK, 1992, pages [Ko94] Kornai, András Language models: where are the bottlenecks? AISB Quarterly 88, 1994, in Bunke, Horst, and Patrick S. P. Wang, Handbook of Character Recognition and Document Image Analysis, World Scientific, Singapore, 1997, pages [KP96] [LBK97] [Le99] [LHR89] [LL91] Kim, W. S. and Park, R. H., Off-line recognition of handwritten Korean and alphanumeric characters using hidden Markov models, Pattern Recognition, 29(5), May 1996, pages Lindwurm, R., Breuer, T. and Kreuzer, K., Multi Expert System for Handprint Recognition, pages in the book A. C. Downton, S. Impedovo, Progress in Handwriting Recognition, Word Scientific, London, Lee, S., Advances in Handwriting Recognition, World Scientific, UK, Lee, K. F., Hon, H. W. and Reddy, R., An overview of the SPHINX speech recognition system, IEEE Proc., 77, 1989, pages Ljolje, A. and Levison, S. E., Development of an acousticphonetic hidden Markov model for continuous speech recognition, IEEE Trans. Signal Processing, 39, 1991, pages

203 Bibliography 190 [LS91] [LS93] [LS96] Lam, S. W. and Srihari, S. N., Multi-domain document layout understanding, Proc. First Int. Conf. on Document Analysis and Recognition. Saint-Malo, France, September 30 th -Oct 2 nd, 1991, pages Luger, G. F. and Stubblefield, W. A., Artificial Intelligence: Structures and strategies for complex problem solving. The Benjamin/Cummings Publishing Company, Inc., second edition, Lu, Y. and Shridhar, M., Character segmentation in handwritten words, Pattern Recognition, 29(1), pages 77-96, [LW00] LizardWorks, Inc., On-line reference, [MA02] Maddouri, S., Amiri, H., Compiling of Local and Global Vision Modelling for Arabic Handwritten Word Recognition, Proceeding of IWHR 02, Ontario, Canada, (2002) pages [Ma86] Mantis, J., An overview of character recognition methodologies, Pattern Recognition 19(6), 1986, pages [Ma92] Margner, V., SARAT - A system for the recognition of Arabic printed text, 11 th Int. Conf. on Pattern Recognition, 1992, pages [MB02] Marti, U. and Bunke, H., Handwritten Sentence Recognition, Proc. of the 15th Int. Conf. on Pattern Recognition, Barcelona, Spain, 2000, Vol. 3, pages [MB99] [MG96] Marti, U. and Bunke, H., A full English sentence database for off-line handwriting recognition, Proc. of the 5th Int. Conf. on Document Analysis and Recognition, ICDAR 99, Bangalore, 1999, pages Mohamed, M. and Gader, P., Handwritten word recognition using segmentation-free HMM and segmentation-based dynamic programming techniques, IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(5), pages , 1996.

204 Bibliography 191 [MIB01] [MLR+96] [MSY92] [Na00] [Na68] [Na75] Melhi, M., Ipson, S., and Booth, W., A novel triangulation procedure for thinning hand-written text, Pattern Recognition Letters, 22, 2001, pages Makhoul, J., LáPré, C., Raphael, C., Schwartz, R. and Zahao, Y., Towards language-independent character recognition using speech recognition methods, in The 5th International Conference and Exhibition on Multi-Lingual Computing, Cambridge University Press, Mary, S., Suen, C. Y. and Yamamoto, K., Historical review of OCR research and development, Proc. IEEE, 8(7), 1992, pages Nagy, G., Twenty years of document image analysis in PAMI, Pattern Analysis and Machine Intelligence, 22(1), January 2000, pages Nagy, G., A preliminary investigation for techniques for the automatic reading of unformatted text, Comm. ACMII(7), 1968, pages Nazif, A., A system for the recognition of the printed Arabic characters, Master s Thesis, Faculty of Engineering, Cairo University, [Na92] Nagy G., At the frontiers of OCR, Proc. IEEE, 7, 1992, pages [Na92] [Na92] [Na92] [ND94] Nagy, G., Optical Character recognition and document image analysis, rensselaer video, Clifton Park, New York, Nagy, G., At the Frontiers of the OCR, Proc. IEEE, vol.80, no. 7, 1992, pages Nagy, G., What does a mission need to know to read a document? Proc. of the First Annual Symp. on Document Analysis and Information Retrieval, Las Vegas, Nevada, USA, March1992, University of Nevada, pages Newman, R. and Downton, A., An Offline Script and Character Recognition Toolset. (OSCAR), On-line reference, ple/oscar.html, Jan 1997.

205 Bibliography 192 [NFK+86] [NSF+90] [NST80] [NU87] [NWF86] [Ob94] [OK95] [OK95] [OM02] [Pa97] [PG00] Nakano, Y., Fujisawa, H., Kunisaki, O., Okada, K. and Hananoi, T., A document understanding system incorporating character recognition. Proc. 8 th Int. Conf. on Pattern Recognition, 1986, pages Nakano, Y., Shima, Y., Fujisawa, H., Higashino, J. and Fujiwara, M., An Algorithm for Skew Normalization of Document Image, Proc. 10 th Int. Conf. on Pattern Recognition, 2, 1990, pages Nough, A., Sultan, A. and Tulba, R., An approach for Arabic character recognition, J. Eng Sc. 6(2), 1980, pages Nough, A., Ula A. and Sharaf-Edin, A., Boolean recognition technique for typewritten Arabic character set, Proc. 1 st King Saud Univ. Symp. on Computer Arabization, Riyadh, 1987, pages Nag, R., Wong K. H. and Fallside, F., Script recognition using hidden Markov model, in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, Tokyo, Japan, 1986, pages Obaid, A. M., Arabic handwritten character recognition by neural nets, Journal on Communications, 45, July-Aug. 1994, pages O Gorman, L. and Kasturi, R., Document Image Analysis, IEEE Computer Society Press, Los Alamitos, O Gorman, L. and Kasturi R. (eds.), Document Image Analysis, IEEE Computer Society Press, New York, Pechwitz, M. and Margner, V., Baseline Estimation for Arabic Handwritten Words, Proceeding of IWHR 02, Ontario, Canada, 2002, pages Parker, J. R., Algorithms for Image Processing and Computer Vision, John Wiley & Sons, Inc., USA, Park, J. and Govindaraju, V., OCR in a Hierarchical Feature Space, IEEE Trans., Pattern Analysis and Machine Intelligence, 22(4), April 2000, pages

206 Bibliography 193 [Ph03] Phamdo, Nam, Vector Quantization, On-line reference, [Pl00] Plamondon, R., On-line and Off-line Handwriting Recognition: A Comprehensive Survey, IEEE Trans., Pattern Analysis and Machine Intelligence, 22(1), January 2000, pages [PL93] [PL96] [PL98] [PS00] [PT81] [Qui79] [Qui86] [Ra89] Park, H. S. and Lee, S. W., Off-line recogntion of largeset handwritten hangul with hidden Markov models, in Proc. Int. Workshop on Forntiers in Handwriting Recognition, Buffalo, New York, May 1993, pages Park, H. S. and Lee, S. W., Off-line recognition of large-set handwritten characters with multiple hidden Markov models, Pattern Recognition, 29(2), Feb 1996, pages Park, H. S. and Lee, S. W., A truly 2-D hidden Markov model for off-line handwritten character recognition, Pattern Recognition, 31(12), Dec 1998, pages Plamondon, R., and Srihari, S. N., On-line and offline handwriting recognition: a comprehensive survey, Pattern Analysis and Machine Intelligence, 22(1), 2000, pages Parhami, B. and Taraghi, M., Automatic recognition of printed Farsi text, Pattern Recog., 14(6), 1981, pages Quinlan, J. R., Discovering rules by induction from large numbers of examples: a case study. In D. Michie, editor, Expert systems in the micro-electronic age. Edinburgh University Press, Quinlan, J. R., Induction of decision trees, Machine Learning, vol. 1, Kluwer Academic Publishers, 1986, pages Rabiner, L.R., A tutorial on hidden Markov model and selected applications in speech recognition, IEEE Proc., 77, 1989, pages

207 Bibliography 194 [RF97] [RW92] Rahman, A. F. and Fairhurst M. C., A New Approach to Handwritten Character Recognition using Multiple Experts, pages In A. C. Downton, S. Impedovo, Progress in Handwriting Recognition, Word Scientific, London, Rafuel, C. and Woods, R. E., Digital Image processing, Addison, USA, [Sa03] Salam, Calligraphy, On-line reference, php?l=4, [Sa92] [Sa94] [Sc02] [Sc03] [SD94] [SK98] [So96] Sabourin, M., Optical character recognition by neural network, Neural Networks, 5(5), 1992, pages Saleh, A., A method of coding Arabic characters and it s application to context free grammar, Pattern Recognition Letters, 15(12), 1994, pages Schlosser, S. G., ERIM Arabic Document Database, On- Line reference: abic_db.html, Schremmer, Claudia, Speech Recognition, Vector Quantisation, On-line reference, Spitz, A. L. and Dengel, A. Document Analysis Systems, World Scientific, UK (1994). Sin, B. K., Kim, J. H., Network-based approach to Korean handwriting analysis, International Journal of Pattern Recognition and Artificial Intelligence, 12(2), Mar 1998, pages Solheim, H. G., ID3, On-line reference:

208 Bibliography 195 [Sr93] [SR98] [SS91] [Su90] [SY85] [Ta00] [Th88] [TJT96] [TLS96] [VK92] Srihari, S. N., From Pixels to Paragraph: The use of contextual models in text recognition, Proc. of the Second Intl. Conf. on Document Analysis and Recognition, TSUKUBA Science City, Japan, IEEE Computer Society Press, Los Alamitos, California, USA, October 1993, pages Senior, A. W. and Robinson, A. J., An Off-Line Cursive Handwriting Recognition System, IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(3), 1998, pages Sokry, A., A Sequential algorithm for the segmentation of typwritten Arabic digitized text, Arabian J. Sc. and Eng. 16(4), 1991, pages Suen, C. Y., Proceedings of the First International Workshop Frontiers Handwriting Recogn., Montreal, Canada, Apr. 2-3, Saadallah, S. and Yacu, S., Design of an Arabic character reading machine, Proc. of Computer Processing of Arabic Language, Kuwait, Stephen Taylor, Measuring Morphology Features in Arabic Verbs, Workshop on Computational Lexicography and Multimedia Dictionaries (COMLEX 2000), Kato Achaia, Greece, September The C. H., On Image Analysis by the Method of Moments, IEEE Trans. Pattern Analysis and Machine Intelligence, 10(4), July 1988, pages Trier, O. D., Jain, A. and Taxt, T., Feature Extraction Methods for Recognition, Pattern Recognition, 29(4), 1996, pages Tang, Y., Lee, S. and Suen, C., Automatic Document Processing, Pattern Recognition, 29(12), 1996, pages Vlontzos, J. A. and Kung, S. Y., Hidden Markov models for character recognition, IEEE Trans. Image Processing 1, 1992, pages

209 Bibliography 196 [Wa82] Wakayama T., A core-line tracing algorithm based on maximal square moving, IEEE Trans. Pattern Anal. Machine Intell. PAMI-4, 1982, pages [WBR02] Wang W., Brakensiek, A. and Rigoll, G., Combination of Multiple Classifier for Handwritten Word Recognition, Proceedings of IWHR 02, Ontario, Canada, (2002), pages [WCW82] [WF00] [XKL02] [YNT97] [YTS95] [Zi02] Wong I. Y., Casey, R. G. and Whal, F. M., Document Analysis system, IBM J. Research Develop 26(6), 1982, pages Witten,, I. and Frank, E., Data Mining, Academic Press, London, Xu, Q., Kim, J. H., Lamand, L. and Suen, C. Y., Recognition of Handwritten Month Words on Bank Cheques, Eighth International Conference Proceedings of IWHR 02, Ontario, Canada, 2002, pages Yamaguchi, S., Nagata, K. and Tsutsumida, T., Study on Multi-Expert Systems for Handprinted Numeral Recognition, in Downton A. and Impedovo, S., Progress in Handwriting Recognition, World Scientific, UK, 1997, pages Yu, C. L., Tangand, Y. Y. and Suen, C. Y., Document Skew Detection Based on the Fractal and Least Squares Method, Proc. 3 rd Int. Conf. on Document Analysis and Recognition, , Montreal, Canada, October 1995, pages Zimmermann M., The Homepage of the IAM Database, On-line reference,

210 APPENDIX A Some HMMs Algorithms A.1 Baum-Welch Re-estimation Algorithm Like α t (j), we define β t (j). as ), ( 1 λ j t T t s q = Ο Ρ + Thus, ( ) ( ) ( ) ( ) = + + = N j t j ij t t o b a j j β β The initial conditions are ( ) ( ) ( ) i i T i i i o b i = = ; ; 1 1 γ β π α We then define ( ) ( ) ( ) ( ) ) ( ) ( ) ( ) (, ; ) ( 1 1 λ α ε λ β α δ β O r j o i j i O r i i i t t j ij t t t t b a Ρ = Ρ = + + Using these definitions, the HMM parameters are computed as = = = = ) ( ), ( ; ) ( ˆ ˆ T o t t T t t ij i i j i i k a δ ξ δ π = = = = = 1 1, ) ( ), ( ; ) ( ˆ ˆ T o t t T v o o t t ij T i i j i i k t b δ δ δ γ Using the current values of a ij and b j (k) and π i,δ and ξ etc., are evaluated which are then used to recomputed the A, B, parameters iteratively. A.2 Modified Viterbi Algorithm A formal technique for finding the best single state sequence is the Viterbi algorithm. The formal statement of the algorithm follows: Initialization: for 1 i N 0 ) ( ; ) ( ) ( = = Ψ i o i b i i π δ Recursion: for 2 t T, 1 j N

211 Appendix A: Some HMMs Algorithms 198 [ ( i) ] ( ) ; ( j) = arg [ ( i) ] δ j) max δ t 1 aij b j ot 1 max δ * t t ( 1 i N 1 i N 1 Termination: Ρ * = = Ψ * [ i) ] ; = arg [ ( i) ]. max δ ( T γ it max δ T 1 i N Path backtracking: for t = T-1, T-2,., 1 i i * = t Ψ t + t + 1 ( * 1 ) ij 1 i N The Viterbi algorithm is an efficiant search for the globally best path. In HWR, a considerable performance improvement can be obtained if the knowledge of the globally second best path, the third best path, etc., can be utelized during post-processing using modified Viterbi algorithms, which provide an ordered list of the best L state sequences. γ i a ij

APPENDIX B Pre-processing Implementation

all Arabic words Ahad written by approximately one

212 APPENDIX B Pre-processing Implementation Preprocessing stage implemented on Arabic words used cheque filling applications. Images examples for preprocessing and segmentation on all Arabic words Ahad written by approximately one hundered writers. Then the feature vectors associated with each word. B.1 Preprocessing Images

213 Appendix E: Preprocessing implementation 200

214 Appendix E: Preprocessing implementation 201

215 Appendix E: Preprocessing implementation 202

216 Appendix E: Preprocessing implementation 203

217 Appendix E: Preprocessing implementation 204

218 Appendix E: Preprocessing implementation 205

219 Appendix E: Preprocessing implementation 206

Appendix E: Preprocessing implementation 207 B.2 Feature vectors for all word Ahad and Ahda Training Data ahda001.

47 0.55 0.5 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.41 0.

64 0.39 0.3 0.4 0.3 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.05 0.01 0.18 0.

220 Appendix E: Preprocessing implementation 207 B.2 Feature vectors for all word Ahad and Ahda Training Data ahda001.tif

Rearrangement of Recognized Strokes in Online Handwritten Gurmukhi. words recognition.

2009 10th International Conference on Document Analysis and Recognition Rearrangement of Recognized Strokes in Online Handwritten Gurmukhi Words Recognition Anuj Sharma Department of Mathematics, Panjab