Loading...

Data cleaning : a practical perspective /

Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merg...

Full description

Bibliographic Details
Main Authors: Ganti, Venkatesh (Author), Das Sarma, Anish (Author)
Format: eBook
Language:English
Published: San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, 2013.
Series:Synthesis digital library of engineering and computer science.
Synthesis lectures on data management ; # 36.
Subjects:
ETL
Online Access:Abstract with links to full text
Table of Contents:
  • 1. Introduction
  • 1.1 Enterprise data warehouse
  • 1.2 Comparison shopping database
  • 1.3 Data cleaning tasks
  • 1.4 Record matching
  • 1.5 Schema matching
  • 1.6 Deduplication
  • 1.7 Data standardization
  • 1.8 Data profiling
  • 1.9 Focus of this book
  • 10. Conclusion
  • Bibliography
  • Authors' biographies.
  • 2. Technological approaches
  • 2.1 Domain-specific verticals
  • 2.2 Generic platforms
  • 2.3 Operator-based approach
  • 2.4 Generic data cleaning operators
  • 2.4.1 Similarity join
  • 2.4.2 Clustering
  • 2.4.3 Parsing
  • 2.5 Bibliography
  • 3. Similarity functions
  • 3.1 Edit distance
  • 3.2 Jaccard similarity
  • 3.3 Cosine similarity
  • 3.4 Soundex
  • 3.5 Combinations and learning similarity functions
  • 3.6 Bibliography
  • 4. Operator: similarity join
  • 4.1 Set similarity join (SSJoin)
  • 4.2 Instantiations
  • 4.2.1 Edit distance
  • 4.2.2 Jaccard containment and similarity
  • 4.3 Implementing the SSJoin operator
  • 4.3.1 Basic SSJoin implementation
  • 4.3.2 Filtered SSJoin implementation
  • 4.4 Bibliography
  • 5. Operator: clustering
  • 5.1 Definitions
  • 5.2 Techniques
  • 5.2.1 Hash partition
  • 5.2.2 Graph-based clustering
  • 5.3 Bibliography
  • 6. Operator: parsing
  • 6.1 Regular expressions
  • 6.2 Hidden Markov models
  • 6.2.1 Training HMMs
  • 6.2.2 Use of HMMs for parsing
  • 6.3 Bibliography
  • 7. Task: record matching
  • 7.1 Schema matching
  • 7.2 Record matching
  • 7.2.1 Bipartite graph construction
  • 7.2.2 Weighted edges
  • 7.2.3 Graph matching
  • 7.3 Bibliography
  • 8. Task: deduplication
  • 8.1 Graph partitioning approach
  • 8.1.1 Graph construction
  • 8.1.2 Graph partitioning
  • 8.2 Merging
  • 8.3 Using constraints for deduplication
  • 8.3.1 Candidate sets of partitions
  • 8.3.2 Maximizing constraint satisfaction
  • 8.4 Blocking
  • 8.5 Bibliography
  • 9. Data cleaning scripts
  • 9.1 Record matching scripts
  • 9.2 Deduplication scripts
  • 9.3 Support for script development
  • 9.3.1 User interface for developing scripts
  • 9.3.2 Configurable data cleaning scripts
  • 9.4 Bibliography