Loading...

Similarity joins in relational database systems /

State-of-the-art database systems manage and process a variety of complex objects, including strings and trees. For such objects equality comparisons are often not meaningful and must be replaced by similarity comparisons. This book describes the concepts and techniques to incorporate similarity int...

Full description

Bibliographic Details
Main Authors: Augsten, Nikolaus (Author), B�ohlen, Michael H. (Author)
Format: eBook
Language:English
Published: San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, 2014.
Series:Synthesis digital library of engineering and computer science.
Synthesis lectures on data management ; # 38.
Subjects:
Online Access:Abstract with links to full text
Table of Contents:
  • 1. Introduction
  • 1.1 Applications of similarity queries
  • 1.2 Edit-based similarity measures
  • 1.3 Token-based similarity measures
  • 2. Data types
  • 2.1 Strings
  • 2.2 Trees
  • 3. Edit-based distances
  • 3.1 String edit distance
  • 3.1.1 Definition of the string edit distance
  • 3.1.2 Computation of the string edit distance
  • 3.2 Tree edit distance
  • 3.2.1 Definition of the tree edit distance
  • 3.2.2 Computation of the tree edit distance
  • 3.2.3 Constrained tree edit distance
  • 3.2.4 Unordered tree edit distance
  • 3.3 Further readings
  • 4. Token-based distances
  • 4.1 Sets and bags
  • 4.1.1 Counting approach
  • 4.1.2 Frequency approach
  • 4.2 Similarity measures for sets and bags
  • 4.2.1 Overlap similarity
  • 4.2.2 Jaccard similarity
  • 4.2.3 Dice similarity
  • 4.2.4 Converting threshold constraints
  • 4.3 String tokens
  • 4.3.1 q-gram tokens
  • 4.4 Tokens for ordered trees
  • 4.4.1 Overview of ordered tree tokens
  • 4.4.2 The pq-gram distance
  • 4.4.3 An algorithm for the pq-gram index
  • 4.4.4 Relational implementation
  • 4.5 Tokens for unordered trees
  • 4.5.1 Overview of unordered tree tokens
  • 4.5.2 Desired properties for unordered tree decompositions
  • 4.5.3 The windowed pq-gram distance
  • 4.5.4 Properties of windowed pq-grams
  • 4.5.5 Building the windowed pq-gram index
  • 4.6 Discussion: properties of tree tokens
  • 4.7 Further readings
  • 5. Query processing techniques
  • 5.1 Filters
  • 5.2 Lower and upper bounds
  • 5.3 String distance bounds
  • 5.3.1 Length filter
  • 5.3.2 Count filter
  • 5.3.3 Positional count filter
  • 5.3.4 Using string filters in a relational database
  • 5.4 Tree distance bounds
  • 5.4.1 Size lower bound
  • 5.4.2 Intersection lower bound
  • 5.4.3 Traversal string lower bound
  • 5.4.4 pq-gram lower bound
  • 5.4.5 Binary branch lower bound
  • 5.4.6 Constrained edit distance upper bound
  • 5.5 Further readings
  • 6. Filters for token equality joins
  • 6.1 Token equality join, avoiding empty intersections
  • 6.2 Prefix filter, avoiding small intersections
  • 6.2.1 Prefix filter for overlap similarity
  • 6.2.2 Prefix filter for jaccard similarity
  • 6.2.3 Effectiveness of prefix filtering
  • 6.3 Size filter
  • 6.4 Positional filter
  • 6.5 Partitioning filter
  • 6.6 Further readings
  • 7. Conclusion
  • Bibliography
  • Authors' biographies
  • Index.