Loading...

Data cleaning : a practical perspective /

Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merg...

Full description

Bibliographic Details
Main Authors: Ganti, Venkatesh (Author), Das Sarma, Anish (Author)
Format: eBook
Language:English
Published: San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, 2013.
Series:Synthesis digital library of engineering and computer science.
Synthesis lectures on data management ; # 36.
Subjects:
ETL
Online Access:Abstract with links to full text
LEADER 06921nam a2200925 i 4500
001 201307DTM036
005 20160320103534.0
006 m eo d
007 cr cn |||m|||a
008 131016s2013 caua foab 000 0 eng d
020 |a 9781608456789  |q (electronic bk.) 
020 |z 9781608456772  |q (pbk.) 
024 7 |a 10.2200/S00523ED1V01Y201307DTM036  |2 doi 
035 |a (CaBNVSL)swl00402795 
035 |a (OCoLC)860909369 
040 |a CaBNVSL  |b eng  |e rda  |c CaBNVSL  |d CaBNVSL 
050 4 |a QA76.9.D3  |b G253 2013 
082 0 4 |a 005.7565  |2 23 
100 1 |a Ganti, Venkatesh.,  |e author. 
245 1 0 |a Data cleaning :  |b a practical perspective /  |c Venkatesh Ganti, Anish Das Sarma. 
264 1 |a San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) :  |b Morgan & Claypool,  |c 2013. 
300 |a 1 PDF (xv, 69 pages) :  |b illustrations. 
336 |a text  |2 rdacontent 
337 |a electronic  |2 isbdmedia 
338 |a online resource  |2 rdacarrier 
490 1 |a Synthesis lectures on data management,  |x 2153-5426 ;  |v # 36 
500 |a Part of: Synthesis digital library of engineering and computer science. 
500 |a Series from website. 
504 |a Includes bibliographical references (pages 65-67). 
505 0 |a 1. Introduction -- 1.1 Enterprise data warehouse -- 1.2 Comparison shopping database -- 1.3 Data cleaning tasks -- 1.4 Record matching -- 1.5 Schema matching -- 1.6 Deduplication -- 1.7 Data standardization -- 1.8 Data profiling -- 1.9 Focus of this book --  
505 8 |a 10. Conclusion -- Bibliography -- Authors' biographies. 
505 8 |a 2. Technological approaches -- 2.1 Domain-specific verticals -- 2.2 Generic platforms -- 2.3 Operator-based approach -- 2.4 Generic data cleaning operators -- 2.4.1 Similarity join -- 2.4.2 Clustering -- 2.4.3 Parsing -- 2.5 Bibliography --  
505 8 |a 3. Similarity functions -- 3.1 Edit distance -- 3.2 Jaccard similarity -- 3.3 Cosine similarity -- 3.4 Soundex -- 3.5 Combinations and learning similarity functions -- 3.6 Bibliography --  
505 8 |a 4. Operator: similarity join -- 4.1 Set similarity join (SSJoin) -- 4.2 Instantiations -- 4.2.1 Edit distance -- 4.2.2 Jaccard containment and similarity -- 4.3 Implementing the SSJoin operator -- 4.3.1 Basic SSJoin implementation -- 4.3.2 Filtered SSJoin implementation -- 4.4 Bibliography --  
505 8 |a 5. Operator: clustering -- -- 5.1 Definitions -- 5.2 Techniques -- 5.2.1 Hash partition -- 5.2.2 Graph-based clustering -- 5.3 Bibliography --  
505 8 |a 6. Operator: parsing -- 6.1 Regular expressions -- 6.2 Hidden Markov models -- 6.2.1 Training HMMs -- 6.2.2 Use of HMMs for parsing -- 6.3 Bibliography --  
505 8 |a 7. Task: record matching -- 7.1 Schema matching -- 7.2 Record matching -- 7.2.1 Bipartite graph construction -- 7.2.2 Weighted edges -- 7.2.3 Graph matching -- 7.3 Bibliography --  
505 8 |a 8. Task: deduplication -- 8.1 Graph partitioning approach -- 8.1.1 Graph construction -- 8.1.2 Graph partitioning -- 8.2 Merging -- 8.3 Using constraints for deduplication -- 8.3.1 Candidate sets of partitions -- 8.3.2 Maximizing constraint satisfaction -- 8.4 Blocking -- 8.5 Bibliography --  
505 8 |a 9. Data cleaning scripts -- 9.1 Record matching scripts -- 9.2 Deduplication scripts -- 9.3 Support for script development -- 9.3.1 User interface for developing scripts -- 9.3.2 Configurable data cleaning scripts -- 9.4 Bibliography --  
506 |a Abstract freely available; full-text restricted to subscribers or individual document purchasers. 
510 0 |a Compendex 
510 0 |a Google book search 
510 0 |a Google scholar 
510 0 |a INSPEC 
520 3 |a Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merging data collected independently across different databases. These errors in data warehouses often result in erroneous upstream reports, and could impact business decisions negatively. Therefore, one of the critical challenges while maintaining large data warehouses is that of ensuring the quality of data in the data warehouse remains high. The process of maintaining high data quality is commonly referred to as data cleaning. In this book, we first discuss the goals of data cleaning. Often, the goals of data cleaning are not well defined and could mean different solutions in different scenarios. Toward clarifying these goals, we abstract out a common set of data cleaning tasks that often need to be addressed. This abstraction allows us to develop solutions for these common data cleaning tasks. We then discuss a few popular approaches for developing such solutions. In particular, we focus on an operator-centric approach for developing a data cleaning platform. The operator-centric approach involves the development of customizable operators that could be used as building blocks for developing common solutions. This is similar to the approach of relational algebra for query processing. The basic set of operators can be put together to build complex queries. Finally, we discuss the development of custom scripts which leverage the basic data cleaning operators along with relational operators to implement effective solutions for data cleaning tasks. 
530 |a Also available in print. 
538 |a Mode of access: World Wide Web. 
538 |a System requirements: Adobe Acrobat Reader. 
588 |a Title from PDF title page (viewed on October 16, 2013). 
650 0 |a Data warehousing  |x Quality control. 
650 0 |a Database management. 
650 0 |a Electronic data processing  |x Data preparation. 
653 |a blocking 
653 |a clustering 
653 |a constrained deduplication 
653 |a cosine similarity 
653 |a data cleaning 
653 |a data cleaning scripts 
653 |a data standardization 
653 |a deduplication 
653 |a edit distance 
653 |a edit similarity 
653 |a ETL 
653 |a ETL data flows 
653 |a jaccard similarity 
653 |a parsing 
653 |a record matching 
653 |a schema matching 
653 |a segmentation 
653 |a set similarity join 
653 |a soundex 
653 |a string similarity functions 
700 1 |a Das Sarma, Anish.,  |e author. 
776 0 8 |i Print version:  |z 9781608456772 
830 0 |a Synthesis digital library of engineering and computer science. 
830 0 |a Synthesis lectures on data management ;  |v # 36.  |x 2153-5426 
856 4 8 |3 Abstract with links to full text  |u http://dx.doi.org/10.2200/S00523ED1V01Y201307DTM036 
942 |c EB 
999 |c 81071  |d 81071 
952 |0 0  |1 0  |4 0  |7 0  |9 73091  |a MGUL  |b MGUL  |d 2016-03-20  |l 0  |r 2016-03-20  |w 2016-03-20  |y EB