Similarity Search - Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

Blick ins Buch

Similarity Search (eBook)

The Metric Space Approach

Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko (Autoren)

eBook Download: PDF

2006 | 2006
XVII, 220 Seiten
Springer US (Verlag)
978-0-387-29151-2 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

The area of similarity searching is a very hot topic for both research and c- mercial applications. Current data processing applications use data with c- siderably less structure and much less precise queries than traditional database systems. Examples are multimedia data like images or videos that offer query by example search, product catalogs that provide users with preference based search, scientific data records from observations or experimental analyses such as biochemical and medical data, or XML documents that come from hetero- neous data sources on the Web or in intranets and thus does not exhibit a global schema. Such data can neither be ordered in a canonical manner nor meani- fully searched by precise database queries that would return exact matches. This novel situation is what has given rise to similarity searching, also - ferred to as content based or similarity retrieval. The most general approach to similarity search, still allowing construction of index structures, is modeled in metric space. In this book. Prof. Zezula and his co authors provide the first monograph on this topic, describing its theoretical background as well as the practical search tools of this innovative technology.

Contents 7
Foreword 13
Preface 15
Acknowledgments 17
PART I METRIC SEARCHING IN A NUTSHELL 18
Chapter 1 FOUNDATIONS OF METRIC SPACE SEARCHING 22
1. The Distance Searching Problem 23
2. The Metric Space 25
3. Distance Measures 26
3.1 Minkowski Distances 27
3.2 Quadratic Form Distance 28
3.3 Edit Distance 29
3.4 Tree Edit Distance 30
3.5 Jaccard's Coefficient 30
3.6 Hausdorff Distance 31
3.7 Time Complexity 31
4. Similarity Queries 32
4.1 Range Query 32
4.2 Nearest Neighbor Query 33
4.3 Reverse Nearest Neighbor Query 34
4.4 Similarity Join 34
4.5 Combinations of Queries 35
4.6 Complex Similarity Queries 35
5. Basic Partitioning Principles 37
5.1 Ball Partitioning 37
5.2 Generalized Hyperplane Partitioning 38
5.3 Excluded Middle Partitioning 38
6. Principles of Similarity Query Execution 39
6.1 Basic Strategies 39
6.2 Incremental Similarity Search 42
7. Policies for Avoiding Distance Computations 43
7.1 Explanatory Example 44
7.2 Object-Pivot Distance Constraint 45
7.3 Range-Pivot Distance Constraint 47
7.4 Pivot-Pivot Distance Constraint 48
7.5 Double-Pivot Distance Constraint 50
7.6 Pivot Filtering 51
8. Metric Space Transformations 52
8.1 Metric Hierarchies 53
8.1.1 Lower-Bounding Functions 53
8.2 User-Defined Metric Functions 55
8.2.1 Searching Using Lower-Bounding Functions 55
8.3 Embedding Metric Space 56
8.3.1 Embedding Examples 56
8.3.2 Reducing Dimensionality 57
9. Approximate Similarity Search 58
9.1 Principles 58
9.2 Generic Algorithms 61
9.3 Measures of Performance 63
9.3.1 Improvement in Efficiency 63
9.3.2 Precision and Recall 63
9.3.3 Relative Error on Distances 65
9.3.4 Position Error 66
10. Advanced Issues 67
10.1 Statistics on Metric Datasets 68
10.1.1 Distribution and Density Functions 68
10.1.2 Distance Distribution and Density 69
10.1.3 Homogeneity of Viewpoints 71
10.2 Proximity of Ball Regions 72
10.3 Performance Prediction 75
10.4 Tree Quality IMeasures 77
10.5 Choosing Reference Points 80
Chapter 2 SURVEY OF EXISTING APPROACHES 84
1. Ball Partitioning IMethods 84
1.1 Burkhard- Keller Tree 85
1.2 Fixed Queries Tree 86
1.3 Fixed Queries Array 87
1.4 Vantage Point Tree 89
1.5 Excluded Middle Vantage Point Forest 92
2. Generalized Hyperplane Partitioning Approaches 93
2.1 Bisector Tree 93
2.2 Generalized Hyperplane Tree 94
3. Exploiting Pre- Computed Distances 95
3.1 AESA 95
3.2 Linear AESA 96
3.3 Other IMethods 97
4. Hybrid Indexing Approaclies 98
4.1 Multi Vantage Point Tree 98
4.2 Geometric Near-neighbor Access Tree 99
4.3 Spatial Approximation Tree 102
4.4 M-tree 104
4.5 Similarity Hashing 105
5. Approximate Similarity Search 106
5.1 Exploiting Space Transformations 106
5.2 Approximate Nearest Neighbors with BBD Trees 107
5.3 Angle Property Technique 109
5.4 Clustering for Indexing 111
5.5 Vector Quantization Index 112
5.6 Buoy Indexing 114
5.7 Hierarchical Decomposition of IMetric Spaces 114
5.7.1 Relative Error Approximation 115
5.7.2 Good Fraction Approximation 115
5.7.3 Small Chance Improvement Approximation 115
5.7.4 Proximity-Based Approximation 116
5.7.5 PAC Nearest Neighbor Search 116
PART II METRIC SEARCHING IN LARGE COLLECTIONS OF DATA 118
Chapter 3 CENTRALIZED INDEX STRUCTURES 122
1. M-tree Family 122
1.1 The M-tree 122
1.2 Bulk-Loading Algorithm of IVl-tree 126
1.3 Multi-Way Insertion Algorithm 129
1.4 The Slim Tree 130
1.4.1 Slim-Down Algorithm 131
1.4.2 Generalized Slim-Down Algorithm 133
1.5 Pivoting M-tree 135
1.6 The M+-tree 138
1.7 The M2 -tree 141
2. Hash-based metric indexing 142
2.1 The D-index 143
2.1.1 Insertion and Search Strategies 146
2.2 The eD-index 148
2.2.1 Similarity Self-Join Algorithm with eD-index 150
3. Performance Trials 153
3.1 Datasets and Distance Measures 154
3.2 Performance Comparison 155
3.3 Different Query Types 157
3.4 Scalability 158
Chapter 4 APPROXIMATE SIMILARITY SEARCH 162
1. Relative Error Approximation 162
2. Good Fraction Approximation 165
3. Small Chance Improvement Approximation 167
4. Proximity-Based Approximation 169
5. PAC Nearest Neighbor Searching 170
6. Performance Trials 171
6.1 Range Queries 172
6.2 Nearest Neighbors Queries 173
6.3 Global Considerations 176
Chapter 5 PARALLEL AND DISTRIBUTED INDEXES 178
1. Preliminaries 178
1.1 Parallel Computing 179
1.2 Distributed Computing 180
1.2.1 Scalable and Distributed Data Structures 180
1.2.2 Peer-to-Peer Data Networks 181
2. Processing M-trees with Parallel Resources 181
2.1 CPU Parallelism 182
2.2 I/O Parallelism 182
2.3 Object Declustering in M-trees 184
3. Scalable Distributed Similarity Search Structure 184
3.1 Architecture 185
3.2 Address Search Tree 186
3.3 Storage Management 186
3.3.1 Bucket Splitting 187
3.3.2 Choosing Pivots 188
3.4 Insertion of Objects 188
3.5 Range Search 189
3.6 Nearest Neighbor Search 190
3.7 Deletions and Updates of Objects 191
3.8 Image Adjustment 192
3.9 Logarithmic Replication Strategy 194
3.10 Joining the Peer-to-Peer Network 195
3.11 Leaving the Peer-to-Peer Network 195
4. Performance Trials 196
4.1 Datasets and Computing Infrastructure 197
4.2 Performance of Similarity Queries 197
4.2.1 Global Costs 198
4.2.3 Comparison of Search Algorithms 205
4.3 Data Volume Scalability 206
Concluding Summary 210
References 214
Author Index 228
Index 232
Abbreviations 236

Chapter 3 CENTRALIZED INDEX STRUCTURES (p. 105-106)

Most existing search structures have been designed to run on a single computer. Let us call them centralized structures. They are built with different assumptions about type of distance function (discrete vs. continuous), form of query (range, nearest neighbor, etc.), and temporal properties (static or dynamic) of the data to be organized. Although many index structures have been proposed as main memory structures, there are several indexes which organize data using disk storage to allow the processing of a large volume of data. In what follows, we focus on two basic approaches which store objects in secondary memories. Specifically, we discuss tree-based structures and methods which employ hashing (i.e., the key-to-address transformation) principles.

1. M-tree Family

[Ciaccia et al., 1997b] have proposed a dynamic organization, called the M-tree, which can handle data files that vary dynamically in size, i.e., in cases when insertion and deletion of objects is frequent. In contrast to other metric tree-based structures, the M-tree is built bottom-up and maintains the same length for all tree paths because the tree is balanced. This paradigm has become very popular and many researches have developed extensions of the M-tree storage structure with a main objective of increasing search efficiency, sometimes conditioned by specific application requirements. We start with the original idea of the M-tree, then describe its most important extensions.

1.1 The M-tree

Most of the indexing methods described in Chapter 2 are either static, unbalanced, or both. They are not very suitable for dynamic environments where data is subject to permanent alteration, nor for large data repositories where disk- based techniques are necessary. The M-tree is, by nature, designed as a dynamic and balanced index structure capable of organizing data stored on a disk. By building the tree in a bottom-up fashion from its leaves to its root, the M-tree shares some similarities with R-trees [Guttman, 1984] and B-trees [Comer, 1979].

This concept results in a balanced tree structure independent of the number of insertions or deletions and has a positive impact on query execution. In general, the M-tree behaves like the R-tree. All objects are stored in (or referenced from) leaf nodes while internal nodes keep pointers to nodes at the next level, together with additional information about their subtrees. Recall that R-trees store minimum bounding rectangles in non-leaf nodes that cover their subtrees. In general metric spaces, we cannot define such bounding rectangles because a coordinate system is lacking.
Thus M-trees use an object called a pivot, and a covering radius, to form a bounding ball region. In the M-tree, pivots play a role similar to that in the GNAT access structure [Brin, 1995], but unlike in GNAT, all objects are stored in leaves. Because pre-selected objects are used, the same object may be present several times in the M-tree - once in a leaf node, and once or several times in internal nodes as a pivot.

Erscheint lt. Verlag	7.6.2006
Reihe/Serie	Advances in Database Systems
Reihe/Serie	Advances in Database Systems
Zusatzinfo	XVII, 220 p.
Verlagsort	New York
Sprache	englisch
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
	Informatik ► Theorie / Studium ► Algorithmen
	Naturwissenschaften
Schlagworte	Approximation • area • collections • Computation • Computer • data structures • Design • DEX • EXIST • Form • Information • Scala • Similarity • sorting • techniques • Tool
ISBN-10	0-387-29151-2 / 0387291512
ISBN-13	978-0-387-29151-2 / 9780387291512

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 12,6 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Softcover

CHF 224,60