Automatic Digital Document Processing and Management (eBook)
XXVI, 297 Seiten
Springer London (Verlag)
978-0-85729-198-1 (ISBN)
This text reviews the issues involved in handling and processing digital documents. Examining the full range of a document's lifetime, the book covers acquisition, representation, security, pre-processing, layout analysis, understanding, analysis of single components, information extraction, filing, indexing and retrieval. Features: provides a list of acronyms and a glossary of technical terms; contains appendices covering key concepts in machine learning, and providing a case study on building an intelligent system for digital document and library management; discusses issues of security, and legal aspects of digital documents; examines core issues of document image analysis, and image processing techniques of particular relevance to digitized documents; reviews the resources available for natural language processing, in addition to techniques of linguistic analysis for content handling; investigates methods for extracting and retrieving data/information from a document.
Foreword 6
Preface 9
Acknowledgments 12
Contents 13
Acronyms 19
Digital Documents 23
Documents 25
A Juridic Perspective 25
History and Trends 26
Current Landscape 27
Types of Documents 29
Document-Based Environments 32
Document Processing Needs 33
References 34
Digital Formats 36
Compression Techniques 37
RLE (Run Length Encoding) 37
Huffman Encoding 37
LZ77 and LZ78 (Lempel-Ziv) 39
LZW (Lempel-Ziv-Welch) 40
DEFLATE 42
Non-structured Formats 42
Plain Text 43
ASCII 44
ISO Latin 44
UNICODE 45
UTF 45
Images 49
Color Spaces 49
RGB 50
YUV/YCbCr 50
CMY(K) 51
HSV/HSB and HLS 51
Comparison among Color Spaces 51
Raster Graphics 52
BMP (BitMaP) 53
GIF (Graphics Interchange Format) 55
TIFF (Tagged Image File Format) 57
JPEG (Joint Photographic Experts Group) 58
PNG (Portable Network Graphics) 60
DjVu (DejaVu) 62
Vector Graphic 64
SVG (Scalable Vector Graphic) 64
Layout-Based Formats 66
PS (PostScript) 66
PDF (Portable Document Format) 77
Content-Oriented Formats 80
Tag-Based Formats 81
HTML (HyperText Markup Language) 82
XML (eXtensible Markup Language) 87
Office Formats 90
ODF (OpenDocument Format) 90
References 91
Legal and Security Aspects 93
Cryptography 94
Basics 94
Short History 96
Digital Cryptography 97
DES (Data Encryption Standard) 99
IDEA (International Data Encryption Algorithm) 100
Key Exchange Method 101
RSA (Rivest, Shamir, Adleman) 102
DSA (Digital Signature Algorithm) 105
Message Fingerprint 105
SHA (Secure Hash Algorithm) 106
Digital Signature 108
Management 110
DSS (Digital Signature Standard) 112
OpenPGP Standard 113
Trusting and Certificates 114
Legal Aspects 117
A Law Approach 118
Public Administration Initiatives 121
Digital Signature 121
Certified e-mail 123
Electronic Identity Card & National Services Card
Telematic Civil Proceedings 124
References 128
Document Analysis 130
Image Processing 132
Basics 133
Convolution and Correlation 133
Color Representation 135
Color Space Conversions 136
RGB-YUV 136
RGB-YCbCr 136
RGB-CMY(K) 137
RGB-HSV 137
RGB-HLS 138
Colorimetric Color Spaces 139
XYZ 139
L*a*b* 140
Color Depth Reduction 141
Desaturation 141
Grayscale (Luminance) 142
Black& White (Binarization)
Otsu Thresholding 142
Content Processing 143
Geometrical Transformations 144
Edge Enhancement 145
Derivative Filters 146
Connectivity 148
Flood Filling 149
Border Following 150
Dilation and Erosion 151
Opening and Closing 152
Edge Detection 153
Canny 154
Hough Transform 156
Polygonal Approximation 158
Snakes 160
References 162
Document Image Analysis 163
Document Structures 163
Spatial Description 165
4-Intersection Model 166
Minimum Bounding Rectangles 168
Logical Structure Description 169
DOM (Document Object Model) 169
Pre-processing for Digitized Documents 172
Document Image Defect Models 173
Deskewing 174
Dewarping 175
Segmentation-Based Dewarping 176
Content Identification 178
Optical Character Recognition 179
Tesseract 181
JTOCR 183
Segmentation 184
Classification of Segmentation Techniques 185
Pixel-Based Segmentation 187
RLSA (Run Length Smoothing Algorithm) 187
RLSO (Run-Length Smoothing with OR) 189
X-Y Trees 191
Block-Based Segmentation 193
The DOCSTRUM 193
The CLiDE (Chemical Literature Data Extraction) Approach 195
Background Analysis 197
RLSO on Born-Digital Documents 201
Document Image Understanding 202
Relational Approach 204
INTHELEX (INcremental THEory Learner from EXamples) 206
Description 208
DCMI (Dublin Core Metadata Initiative) 209
References 211
Content Processing 215
Natural Language Processing 217
Resources-Lexical Taxonomies 218
WordNet 219
WordNet Domains 220
Senso Comune 223
Tools 224
Tokenization 225
Language Recognition 226
Stopword Removal 227
Stemming 228
Suffix Stripping 229
Part-of-Speech Tagging 231
Rule-Based Approach 231
Word Sense Disambiguation 233
Lesk's Algorithm 235
Yarowsky's Algorithm 235
Parsing 236
Link Grammar 237
References 239
Information Management 241
Information Retrieval 241
Performance Evaluation 242
Indexing Techniques 244
Vector Space Model 244
Query Evaluation 247
Relevance Feedback 248
Dimensionality Reduction 249
Latent Semantic Analysis and Indexing 250
Concept Indexing 253
Image Retrieval 255
Keyword Extraction 257
TF-ITP 259
Naive Bayes 259
Co-occurrence 260
Text Categorization 262
A Semantic Approach Based on WordNet Domains 264
Information Extraction 265
WHISK 267
A Multistrategy Approach 269
The Semantic Web 271
References 272
Appendix A A Case Study: DOMINUS 274
General Framework 274
Actors and Workflow 274
Architecture 276
Functionality 278
Input Document Normalization 278
Layout Analysis 279
Kernel-Based Basic Blocks Grouping 280
Document Image Understanding 281
Categorization, Filing and Indexing 281
Prototype Implementation 282
Exploitation for Scientific Conference Management 285
GRAPE 286
Appendix B Machine Learning Notions 288
Categorization of Techniques 288
Noteworthy Techniques 289
Artificial Neural Networks 289
Decision Trees 290
k-Nearest Neighbor 290
Inductive Logic Programming 290
Naive Bayes 291
Hidden Markov Models 291
Clustering 291
Experimental Strategies 292
k-Fold Cross-Validation 292
Leave-One-Out 293
Random Split 293
Glossary 294
Bounding box 294
Byte ordering 294
Ceiling function 294
Chunk 294
Connected component 294
Heaviside unit function 294
Heterarchy 295
KL-divergence 295
Linear regression 295
Run 295
Scanline 295
References 296
Index 305
Erscheint lt. Verlag | 3.1.2011 |
---|---|
Reihe/Serie | Advances in Computer Vision and Pattern Recognition | Advances in Computer Vision and Pattern Recognition |
Zusatzinfo | XXVI, 297 p. |
Verlagsort | London |
Sprache | englisch |
Themenwelt | Informatik ► Grafik / Design ► Digitale Bildverarbeitung |
Sozialwissenschaften ► Kommunikation / Medien ► Buchhandel / Bibliothekswesen | |
ISBN-10 | 0-85729-198-X / 085729198X |
ISBN-13 | 978-0-85729-198-1 / 9780857291981 |
Haben Sie eine Frage zum Produkt? |
Digital Rights Management: ohne DRM
Dieses eBook enthält kein DRM oder Kopierschutz. Eine Weitergabe an Dritte ist jedoch rechtlich nicht zulässig, weil Sie beim Kauf nur die Rechte an der persönlichen Nutzung erwerben.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich