Blick ins Buch

Guide to OCR for Indic Scripts (eBook)

Document Recognition and Retrieval

Venu Govindaraju, Srirangaraj (Ranga) Setlur (Herausgeber)

eBook Download: PDF

2009
XXI, 325 Seiten
Springer London (Verlag)
978-1-84800-330-9 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

This is the first comprehensive text on Optical Character Recognition for Indic scripts. It covers many topics and describes OCR systems for eight different scripts-Bangla, Devanagari, Gurmukhi, Gujarti, Kannada, Malayalam, Tamil and Urdu.

Foreword 4
Preface 6
1 Part I: Recognition of Indic Scripts 9
2 Part II: Retrieval of Indic Documents 11
3 Target Audience 11
Acknowledgments 13
Contents 14
Contributors 16
Part I Recognition of Indic Scripts 19
Building Data Sets for Indian Language OCR Research 20
1 Introduction 20
2 Datasets 21
2.1 Image Corpus 21
2.1.1 Digitization 22
2.1.2 Processing and Storage 22
2.2 Text Corpus 23
2.3 Annotated Data Sets 23
3 Annotation 24
3.1 Hierarchical Annotation 26
3.1.1 Different Levels of Annotation 26
3.1.2 Methods of Annotation 27
3.2 Annotation Process 28
3.2.1 Segmentation 28
3.2.2 Components Labeling 29
3.2.3 Annotation Tools 31
4 Representation and Access 32
4.1 Sources of Metainformation 33
4.2 Recognizer-Specific Metainformation 34
4.3 Digitization Meta Information 34
4.4 Annotation Data 35
4.4.1 Page Structure Information 36
4.4.2 Text Block Structure Information 36
4.4.3 Akshara Structure Information 37
4.5 Representation Issues 37
4.5.1 Complex Layout 37
4.5.2 Indian Language Script Issues 37
4.6 Data Access 38
5 Implementation and Execution 39
5.1 Organization of Tasks 39
5.2 Status of the Data Sets 40
6 Conclusions 40
References 41
On OCR of Major Indian Scripts: Bangla and Devanagari 43
1 Introduction 43
2 Basic OCR System 45
2.1 Group and Individual Character Classifiers 48
3 Quantification of Errors 50
4 Post-recognition Error Correction 52
4.1 Forward--Backward Error Correction Scheme 53
5 Discussion 57
References 57
A Complete Machine-Printed Gurmukhi OCR System 59
1 Introduction 59
2 Characteristics of Gurmukhi Script 60
2.1 Character Set 60
2.2 Connectivity of Symbols 60
2.3 Word Partitioning into Zones 61
2.4 Frequently Touching Characters 62
2.5 Broken Characters and Headlines 62
2.6 Similarity of Group of Symbols 62
3 System Overview 62
4 Digitization and Pre-processing 62
5 Splitting Text into Horizontal Text Strips 64
6 Word Segmentation 67
7 Sub-division of Strips into Smaller Units 68
8 Repairing the Word Shape 69
9 Thinning 70
10 Repairing Broken Characters 72
11 Character Segmentation 74
11.1 Touching Characters 77
12 Recognition Stage 78
12.1 Feature Extraction 78
12.2 Classification 80
12.2.1 Design of the Binary Tree Classifier 81
12.3 Merging Sub-symbols 81
13 Post-Processing 84
13.1 Check for the Existence of a Word in the Corpus 84
13.2 Perform Holistic Recognition of a Word 84
14 Experimental Results 85
15 Conclusion 86
References 87
Progress in Gujarati Document Processing and Character Recognition 88
1 Introduction 88
2 Gujarati Script: OCR Perspective 89
3 Segmentation 91
4 Zone Boundary Identification 92
4.1 Using Slopes of the Imaginary Lines Joining Top Left (Bottom Right) Corners 93
4.2 Dynamic Programming Approach 95
5 Extracting Recognizable Units 98
6 Recognition 98
6.1 Feature Extraction 99
6.1.1 Fringe Map 100
6.1.2 Discrete Cosine Transform 100
6.1.3 Wavelet Transform 101
6.1.4 Zone Information 102
6.1.5 Aspect Ratio 102
6.2 Classification 102
6.2.1 Nearest Neighbor Classifier 102
6.2.2 Artificial Neural Networks [ 25 , 26 ] 103
6.2.3 Multi-layer Perceptron (MLP) [ 25 ] 103
6.2.4 Radial Basis Functions (RBF) networks 103
6.2.5 General Regression Neural Network (GRNN) 104
6.3 Experimental Setup and Results 106
7 Text Generation 107
8 Post-processing 108
9 Conclusion 108
References 109
Design of a Bilingual KannadaEnglish OCR 111
1 Introduction 111
2 Kannada Script 112
3 Segmentation 112
3.1 Line Segmentation Based on Connected Components 114
3.2 Word and Character Segmentation 115
4 Script Recognition 115
4.1 Gabor and DCT-Based Identification 116
4.2 Results of Script Identification 117
5 Component Classification 119
5.1 Introduction 119
5.2 Graph Representations for Components 120
5.3 Distance Measures 122
5.4 Classification Strategy 123
5.5 Training 123
5.6 Prediction 124
5.7 Experiments, Results and Discussion 124
5.7.1 Data Sets 124
5.7.2 Features for SVM Classifiers 126
5.7.3 Pre-processing 128
5.7.4 Results and Discussions 128
6 Conclusion 137
References 137
Recognition of Malayalam Documents 139
1 Introduction 139
1.1 The Malayalam Language 140
1.1.1 Origin 140
1.1.2 Literary Culture 140
1.1.3 Word and Sentence Formation 141
1.2 The Malayalam Script 141
1.2.1 Script Revision 143
1.3 Evolution of Printing and Publication 144
1.4 Challenges in Malayalam Recognition 145
2 Character Recognition 146
2.1 Overview of the Approach 146
2.2 Design Guidelines 147
2.3 Features for Component Classification 148
2.4 Classifier Design 148
2.5 Beyond Recognition of Isolated Symbols 150
3 Recognition of Online Handwriting 151
3.1 Stroke Recognition 152
3.1.1 Dealing with Similar Strokes 153
3.2 Word Recognizer 154
4 Experimental Results 154
4.1 Overview of the Data Set 154
4.2 Classifier and Feature Comparisons 155
4.3 Recognition of Online Handwriting 157
5 Conclusions 158
References 159
A Complete OCR System for Tamil Magazine Documents 161
1 Introduction and Background 161
1.1 Preprocessing 162
1.1.1 Skew Estimation 163
1.1.2 Binarization 163
1.2 Page Segmentation and Classification 163
1.2.1 Page Segmentation 163
1.2.2 Block Classification 164
1.3 Optical Character Recognition (OCR) 164
1.3.1 Character Segmentation 164
1.3.2 Character Recognition 165
1.4 Logical Structure 165
1.4.1 Document Models 166
2 Preprocessing 166
2.1 Image Size Reduction 166
2.2 Skew Correction 167
2.2.1 Text Recognition 167
2.2.2 Skew Estimation 168
2.3 Binarization 168
2.4 Noise Removal 168
3 Segmentation and Classification 168
3.1 Page Segmentation 169
3.2 Classification of the Blocks 169
4 Optical Character Recognition 170
4.1 Line, Word, and Character Segmentation 170
4.2 Recognition of Characters 171
5 Reconstruction of the Document Image 171
5.1 Logical Structure Derivation 171
5.2 Reconstruction into HTML Format 172
6 Results and Conclusions 172
6.1 Results 173
6.2 Conclusions 174
References 175
Experiments on Urdu Text Recognition 177
1 Introduction 177
2 Urdu Language Resources 180
3 Prior Work in Urdu Recognition Systems 181
4 Prior Work in Urdu Document Preprocessing 182
5 Experiments 183
References 184
The BBN Byblos Hindi OCR System 186
1 Introduction 186
1.1 Background 186
1.2 Review of Basic OCR System 187
1.3 Model Training and Recognition 188
2 DATA 189
2.1 Hindi Character Set 189
2.2 Corpus 191
3 Experimental Results 191
3.1 Model Configuration 191
3.2 Recognition Performance 192
4 Conclusions 192
References 193
Generalization of Hindi OCR Using Adaptive Segmentation and Font Files 194
1 Introduction 194
1.1 Challenges of Segmentation 195
1.2 Feature Extraction and Classification 196
2 Base Devanagari OCR System 197
2.1 Background 197
2.2 System Design 198
2.3 Character Segmentation 200
2.3.1 Devanagari Script Overview 200
2.3.2 Hindi Character Segmentation 200
2.4 Feature Extraction 206
2.5 Classification 208
2.5.1 Template Matching 208
2.5.2 Generalized Hausdorff Image Comparison (GHIC) 208
2.5.3 Nearest Neighbor Classifier and Weighted Euclidean Distance 209
2.5.4 Hierarchical Classification 209
2.6 Devanagari OCR Evaluation 210
2.7 Additional Challenges 210
3 Font-Based Intelligent Character Segmentation 212
3.1 Benefits and Font Models 212
3.2 Training Using Font Files 214
3.3 Segmentation and Recognition 214
4 Experiments 215
4.1 Data Sets 216
4.2 Protocols for Evaluation 217
4.3 Character Segmentation 217
4.4 Feature Extraction 217
4.5 Recognition Results 218
5 Conclusion and Future Work 218
References 219
Online Handwriting Recognition for Indic Scripts 221
1 Introduction 221
2 The Structure of Indic Scripts 222
3 Challenges for Online HWR 224
3.1 Large Alphabet Size 224
3.2 Two-Dimensional Structure 225
3.3 Inter-class Similarity 225
3.4 Issues with Writing Styles 226
3.5 Language-Specific and Regional Differences in Usage 227
4 Recognition of Isolated Characters 228
4.1 Strategies 229
4.2 Preprocessing 230
4.3 Features 230
4.4 Classification 231
5 Word Recognition 234
5.1 Preprocessing 235
5.2 Analytic Approaches Based on Explicit Segmentation 235
5.3 Analytic Approaches Based on Implicit Segmentation 236
5.4 Holistic Approaches 237
5.5 Language Models 238
6 Applications 238
7 Resources 240
7.1 Data Set Standards 241
7.2 Tools 241
7.3 Data Sets 242
8 Summary 242
References 243
Part II Retrieval of Indic Documents 247
Enhancing Access to Primary Cultural Heritage Materials of India 248
1 Introduction 248
2 Linguistic Tools 251
3 Image-Processing Tools 256
Digital Image Enhancement of Indic Historical Manuscripts 259
1 Introduction 259
2 Image Enhancement 261
2.1 Background Normalization 261
2.1.1 Background Normalization Using a Piece-Wise Linear Model 262
2.1.2 Background Normalization Using a Nonlinear Model 264
2.2 Image Normalization 266
2.3 Background Normalization for Color Images 267
2.4 Color Document Image Enhancement 268
3 Experiments 269
4 Extract Text Lines from Images 270
4.1 ALCM Method 272
4.1.1 ALCM Transform 272
4.1.2 Locations of Possible Text Lines 274
4.1.3 Extraction of Text 275
5 Conclusion 276
References 276
GFG-Based Compression and Retrieval of Document Images in Indian Scripts 278
1 Introduction 278
2 Geometric Feature Graph (GFG) of a Word Image 280
2.1 GFG Extraction 281
2.2 Converting the GFG to a String Representation 282
2.3 Reconstruction of Word Images Using GFG 283
2.4 GFG Compression 284
3 GFG-Based Indexing 285
4 Latent Semantic Indexing Using GFG 285
4.1 Results of Using LSA and PLSA 287
5 Ontology-Based Access with GFG 290
5.1 Concept-Driven Document Image Retrieval 290
5.2 Results 291
6 Conclusion 292
References 293
Word Spotting for Indic Documents to Facilitate Retrieval 294
1 Introduction 294
2 Related Work 296
3 Proposed Methodologies 297
3.1 Recognition-Based Keyword Spotting 297
3.1.1 Performance 302
3.2 Recognition-Free Keyword Spotting 303
3.2.1 Performance 307
4 Conclusion 307
References 308
Indian Language Information Retrieval 309
1 Introduction 309
1.1 Background 311
2 Overview of Indian Language IR 311
2.1 Information Sources 311
2.2 Research Efforts 312
2.2.1 Text Retrieval 313
2.2.2 Information Extraction 316
2.2.3 Question Answering 317
2.2.4 Topic Detection and Tracking 317
2.2.5 Indian Language Subtrack at CLEF 2007 318
3 The CLIA Project 319
3.1 The Forum for Information Retrieval Evaluation (FIRE) 320
4 Conclusion 320
References 321
Colour Plates 323
Index 329

Erscheint lt. Verlag	25.9.2009
Reihe/Serie	Advances in Computer Vision and Pattern Recognition
Reihe/Serie	Advances in Computer Vision and Pattern Recognition
Zusatzinfo	XXI, 325 p. 161 illus., 11 illus. in color.
Verlagsort	London
Sprache	englisch
Themenwelt	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
Schlagworte	Digital Libraries • Document Retrieval • Handwriting Recognition • Indic Scripts • OCR • Text Recognition
ISBN-10	1-84800-330-7 / 1848003307
ISBN-13	978-1-84800-330-9 / 9781848003309

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.