Loading...
Data cleaning : a practical perspective /
Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merg...
Main Authors: | , |
---|---|
Format: | eBook |
Language: | English |
Published: |
San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) :
Morgan & Claypool,
2013.
|
Series: | Synthesis digital library of engineering and computer science.
Synthesis lectures on data management ; # 36. |
Subjects: | |
Online Access: | Abstract with links to full text |
Table of Contents:
- 1. Introduction
- 1.1 Enterprise data warehouse
- 1.2 Comparison shopping database
- 1.3 Data cleaning tasks
- 1.4 Record matching
- 1.5 Schema matching
- 1.6 Deduplication
- 1.7 Data standardization
- 1.8 Data profiling
- 1.9 Focus of this book
- 10. Conclusion
- Bibliography
- Authors' biographies.
- 2. Technological approaches
- 2.1 Domain-specific verticals
- 2.2 Generic platforms
- 2.3 Operator-based approach
- 2.4 Generic data cleaning operators
- 2.4.1 Similarity join
- 2.4.2 Clustering
- 2.4.3 Parsing
- 2.5 Bibliography
- 3. Similarity functions
- 3.1 Edit distance
- 3.2 Jaccard similarity
- 3.3 Cosine similarity
- 3.4 Soundex
- 3.5 Combinations and learning similarity functions
- 3.6 Bibliography
- 4. Operator: similarity join
- 4.1 Set similarity join (SSJoin)
- 4.2 Instantiations
- 4.2.1 Edit distance
- 4.2.2 Jaccard containment and similarity
- 4.3 Implementing the SSJoin operator
- 4.3.1 Basic SSJoin implementation
- 4.3.2 Filtered SSJoin implementation
- 4.4 Bibliography
- 5. Operator: clustering
- 5.1 Definitions
- 5.2 Techniques
- 5.2.1 Hash partition
- 5.2.2 Graph-based clustering
- 5.3 Bibliography
- 6. Operator: parsing
- 6.1 Regular expressions
- 6.2 Hidden Markov models
- 6.2.1 Training HMMs
- 6.2.2 Use of HMMs for parsing
- 6.3 Bibliography
- 7. Task: record matching
- 7.1 Schema matching
- 7.2 Record matching
- 7.2.1 Bipartite graph construction
- 7.2.2 Weighted edges
- 7.2.3 Graph matching
- 7.3 Bibliography
- 8. Task: deduplication
- 8.1 Graph partitioning approach
- 8.1.1 Graph construction
- 8.1.2 Graph partitioning
- 8.2 Merging
- 8.3 Using constraints for deduplication
- 8.3.1 Candidate sets of partitions
- 8.3.2 Maximizing constraint satisfaction
- 8.4 Blocking
- 8.5 Bibliography
- 9. Data cleaning scripts
- 9.1 Record matching scripts
- 9.2 Deduplication scripts
- 9.3 Support for script development
- 9.3.1 User interface for developing scripts
- 9.3.2 Configurable data cleaning scripts
- 9.4 Bibliography