Loading...

Similarity joins in relational database systems /

State-of-the-art database systems manage and process a variety of complex objects, including strings and trees. For such objects equality comparisons are often not meaningful and must be replaced by similarity comparisons. This book describes the concepts and techniques to incorporate similarity int...

Full description

Bibliographic Details
Main Authors: Augsten, Nikolaus (Author), B�ohlen, Michael H. (Author)
Format: eBook
Language:English
Published: San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, 2014.
Series:Synthesis digital library of engineering and computer science.
Synthesis lectures on data management ; # 38.
Subjects:
Online Access:Abstract with links to full text
LEADER 06523nam a2200757 i 4500
001 201310DTM038
005 20160320103534.0
006 m eo d
007 cr cn |||m|||a
008 131221s2014 caua foab 001 0 eng d
020 |a 9781627050296  |q (ebook) 
020 |z 9781627050289  |q (paperback) 
024 7 |a 10.2200/S00544ED1V01Y201310DTM038  |2 doi 
035 |a (CaBNVSL)swl00402968 
035 |a (OCoLC)866563916 
040 |a CaBNVSL  |b eng  |e rda  |c CaBNVSL  |d CaBNVSL 
050 4 |a QA76.9.D3  |b A938 2014 
082 0 4 |a 005.7565  |2 23 
100 1 |a Augsten, Nikolaus.,  |e author. 
245 1 0 |a Similarity joins in relational database systems /  |c Nikolaus Augsten, Michael H. B�ohlen. 
264 1 |a San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) :  |b Morgan & Claypool,  |c 2014. 
300 |a 1 PDF (xvii, 106 pages) :  |b illustrations. 
336 |a text  |2 rdacontent 
337 |a electronic  |2 isbdmedia 
338 |a online resource  |2 rdacarrier 
490 1 |a Synthesis lectures on data management,  |x 2153-5426 ;  |v # 38 
500 |a Part of: Synthesis digital library of engineering and computer science. 
500 |a Series from website. 
504 |a Includes bibliographical references (pages 93-101) and index. 
505 0 |a 1. Introduction -- 1.1 Applications of similarity queries -- 1.2 Edit-based similarity measures -- 1.3 Token-based similarity measures --  
505 8 |a 2. Data types -- 2.1 Strings -- 2.2 Trees --  
505 8 |a 3. Edit-based distances -- 3.1 String edit distance -- 3.1.1 Definition of the string edit distance -- 3.1.2 Computation of the string edit distance -- 3.2 Tree edit distance -- 3.2.1 Definition of the tree edit distance -- 3.2.2 Computation of the tree edit distance -- 3.2.3 Constrained tree edit distance -- 3.2.4 Unordered tree edit distance -- 3.3 Further readings --  
505 8 |a 4. Token-based distances -- 4.1 Sets and bags -- 4.1.1 Counting approach -- 4.1.2 Frequency approach -- 4.2 Similarity measures for sets and bags -- 4.2.1 Overlap similarity -- 4.2.2 Jaccard similarity -- 4.2.3 Dice similarity -- 4.2.4 Converting threshold constraints -- 4.3 String tokens -- 4.3.1 q-gram tokens -- 4.4 Tokens for ordered trees -- 4.4.1 Overview of ordered tree tokens -- 4.4.2 The pq-gram distance -- 4.4.3 An algorithm for the pq-gram index -- 4.4.4 Relational implementation -- 4.5 Tokens for unordered trees -- 4.5.1 Overview of unordered tree tokens -- 4.5.2 Desired properties for unordered tree decompositions -- 4.5.3 The windowed pq-gram distance -- 4.5.4 Properties of windowed pq-grams -- 4.5.5 Building the windowed pq-gram index -- 4.6 Discussion: properties of tree tokens -- 4.7 Further readings --  
505 8 |a 5. Query processing techniques -- 5.1 Filters -- 5.2 Lower and upper bounds -- 5.3 String distance bounds -- 5.3.1 Length filter -- 5.3.2 Count filter -- 5.3.3 Positional count filter -- 5.3.4 Using string filters in a relational database -- 5.4 Tree distance bounds -- 5.4.1 Size lower bound -- 5.4.2 Intersection lower bound -- 5.4.3 Traversal string lower bound -- 5.4.4 pq-gram lower bound -- 5.4.5 Binary branch lower bound -- 5.4.6 Constrained edit distance upper bound -- 5.5 Further readings --  
505 8 |a 6. Filters for token equality joins -- 6.1 Token equality join, avoiding empty intersections -- 6.2 Prefix filter, avoiding small intersections --6.2.1 Prefix filter for overlap similarity -- 6.2.2 Prefix filter for jaccard similarity -- 6.2.3 Effectiveness of prefix filtering -- 6.3 Size filter -- 6.4 Positional filter -- 6.5 Partitioning filter -- 6.6 Further readings --  
505 8 |a 7. Conclusion -- Bibliography -- Authors' biographies -- Index. 
506 |a Abstract freely available; full-text restricted to subscribers or individual document purchasers. 
510 0 |a Compendex 
510 0 |a Google book search 
510 0 |a Google scholar 
510 0 |a INSPEC 
520 3 |a State-of-the-art database systems manage and process a variety of complex objects, including strings and trees. For such objects equality comparisons are often not meaningful and must be replaced by similarity comparisons. This book describes the concepts and techniques to incorporate similarity into database systems. We start out by discussing the properties of strings and trees, and identify the edit distance as the de facto standard for comparing complex objects. Since the edit distance is computationally expensive, token-based distances have been introduced to speed up edit distance computations. The basic idea is to decompose complex objects into sets of tokens that can be compared efficiently. Token-based distances are used to compute an approximation of the edit distance and prune expensive edit distance calculations. A key observation when computing similarity joins is that many of the object pairs, for which the similarity is computed, are very different from each other. Filters exploit this property to improve the performance of similarity joins. A filter preprocesses the input data sets and produces a set of candidate pairs. The distance function is evaluated on the candidate pairs only. We describe the essential query processing techniques for filters based on lower and upper bounds. For token equality joins we describe prefix, size, positional and partitioning filters, which can be used to avoid the computation of small intersections that are not needed since the similarity would be too low. 
530 |a Also available in print. 
538 |a Mode of access: World Wide Web. 
538 |a System requirements: Adobe Acrobat Reader. 
588 |a Title from PDF title page (viewed on December 21, 2013). 
650 0 |a Relational databases. 
650 0 |a Similarity transformations. 
653 |a edit distance 
653 |a lower bound 
653 |a pq-grams 
653 |a q-grams 
653 |a similarity 
653 |a similarity join 
653 |a strings 
653 |a token-based distance 
653 |a trees 
653 |a upper bound 
700 1 |a B�ohlen, Michael H.,  |e author. 
776 0 8 |i Print version:  |z 9781627050289 
830 0 |a Synthesis digital library of engineering and computer science. 
830 0 |a Synthesis lectures on data management ;  |v # 38.  |x 2153-5426 
856 4 8 |3 Abstract with links to full text  |u http://dx.doi.org/10.2200/S00544ED1V01Y201310DTM038 
942 |c EB 
999 |c 81069  |d 81069 
952 |0 0  |1 0  |4 0  |7 0  |9 73089  |a MGUL  |b MGUL  |d 2016-03-20  |l 0  |r 2016-03-20  |w 2016-03-20  |y EB