Data cleaning : a practical perspective /

Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merg...

Full description

Bibliographic Details
Main Authors:	Ganti, Venkatesh (Author), Das Sarma, Anish (Author)
Format:	eBook
Language:	English
Published:	San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, 2013.
Series:	Synthesis digital library of engineering and computer science. Synthesis lectures on data management ; # 36.
Subjects:	Data warehousing > Quality control. Database management. Electronic data processing > Data preparation. blocking clustering constrained deduplication cosine similarity data cleaning data cleaning scripts data standardization deduplication edit distance edit similarity ETL ETL data flows jaccard similarity parsing record matching schema matching segmentation set similarity join soundex string similarity functions
Online Access:	Abstract with links to full text


LEADER	06921nam a2200925 i 4500
001	201307DTM036
005	20160320103534.0
006	m eo d
007	cr cn \|\|\|m\|\|\|a
008	131016s2013 caua foab 000 0 eng d
020			\|a 9781608456789 \|q (electronic bk.)
020			\|z 9781608456772 \|q (pbk.)
024	7		\|a 10.2200/S00523ED1V01Y201307DTM036 \|2 doi
035			\|a (CaBNVSL)swl00402795
035			\|a (OCoLC)860909369
040			\|a CaBNVSL \|b eng \|e rda \|c CaBNVSL \|d CaBNVSL
050		4	\|a QA76.9.D3 \|b G253 2013
082	0	4	\|a 005.7565 \|2 23
100	1		\|a Ganti, Venkatesh., \|e author.
245	1	0	\|a Data cleaning : \|b a practical perspective / \|c Venkatesh Ganti, Anish Das Sarma.
264		1	\|a San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) : \|b Morgan & Claypool, \|c 2013.
300			\|a 1 PDF (xv, 69 pages) : \|b illustrations.
336			\|a text \|2 rdacontent
337			\|a electronic \|2 isbdmedia
338			\|a online resource \|2 rdacarrier
490	1		\|a Synthesis lectures on data management, \|x 2153-5426 ; \|v # 36
500			\|a Part of: Synthesis digital library of engineering and computer science.
500			\|a Series from website.
504			\|a Includes bibliographical references (pages 65-67).
505	0		\|a 1. Introduction -- 1.1 Enterprise data warehouse -- 1.2 Comparison shopping database -- 1.3 Data cleaning tasks -- 1.4 Record matching -- 1.5 Schema matching -- 1.6 Deduplication -- 1.7 Data standardization -- 1.8 Data profiling -- 1.9 Focus of this book --
505	8		\|a 10. Conclusion -- Bibliography -- Authors' biographies.
505	8		\|a 2. Technological approaches -- 2.1 Domain-specific verticals -- 2.2 Generic platforms -- 2.3 Operator-based approach -- 2.4 Generic data cleaning operators -- 2.4.1 Similarity join -- 2.4.2 Clustering -- 2.4.3 Parsing -- 2.5 Bibliography --
505	8		\|a 3. Similarity functions -- 3.1 Edit distance -- 3.2 Jaccard similarity -- 3.3 Cosine similarity -- 3.4 Soundex -- 3.5 Combinations and learning similarity functions -- 3.6 Bibliography --
505	8		\|a 4. Operator: similarity join -- 4.1 Set similarity join (SSJoin) -- 4.2 Instantiations -- 4.2.1 Edit distance -- 4.2.2 Jaccard containment and similarity -- 4.3 Implementing the SSJoin operator -- 4.3.1 Basic SSJoin implementation -- 4.3.2 Filtered SSJoin implementation -- 4.4 Bibliography --
505	8		\|a 5. Operator: clustering -- -- 5.1 Definitions -- 5.2 Techniques -- 5.2.1 Hash partition -- 5.2.2 Graph-based clustering -- 5.3 Bibliography --
505	8		\|a 6. Operator: parsing -- 6.1 Regular expressions -- 6.2 Hidden Markov models -- 6.2.1 Training HMMs -- 6.2.2 Use of HMMs for parsing -- 6.3 Bibliography --
505	8		\|a 7. Task: record matching -- 7.1 Schema matching -- 7.2 Record matching -- 7.2.1 Bipartite graph construction -- 7.2.2 Weighted edges -- 7.2.3 Graph matching -- 7.3 Bibliography --
505	8		\|a 8. Task: deduplication -- 8.1 Graph partitioning approach -- 8.1.1 Graph construction -- 8.1.2 Graph partitioning -- 8.2 Merging -- 8.3 Using constraints for deduplication -- 8.3.1 Candidate sets of partitions -- 8.3.2 Maximizing constraint satisfaction -- 8.4 Blocking -- 8.5 Bibliography --
505	8		\|a 9. Data cleaning scripts -- 9.1 Record matching scripts -- 9.2 Deduplication scripts -- 9.3 Support for script development -- 9.3.1 User interface for developing scripts -- 9.3.2 Configurable data cleaning scripts -- 9.4 Bibliography --
506			\|a Abstract freely available; full-text restricted to subscribers or individual document purchasers.
510	0		\|a Compendex
510	0		\|a Google book search
510	0		\|a Google scholar
510	0		\|a INSPEC
520	3		\|a Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merging data collected independently across different databases. These errors in data warehouses often result in erroneous upstream reports, and could impact business decisions negatively. Therefore, one of the critical challenges while maintaining large data warehouses is that of ensuring the quality of data in the data warehouse remains high. The process of maintaining high data quality is commonly referred to as data cleaning. In this book, we first discuss the goals of data cleaning. Often, the goals of data cleaning are not well defined and could mean different solutions in different scenarios. Toward clarifying these goals, we abstract out a common set of data cleaning tasks that often need to be addressed. This abstraction allows us to develop solutions for these common data cleaning tasks. We then discuss a few popular approaches for developing such solutions. In particular, we focus on an operator-centric approach for developing a data cleaning platform. The operator-centric approach involves the development of customizable operators that could be used as building blocks for developing common solutions. This is similar to the approach of relational algebra for query processing. The basic set of operators can be put together to build complex queries. Finally, we discuss the development of custom scripts which leverage the basic data cleaning operators along with relational operators to implement effective solutions for data cleaning tasks.
530			\|a Also available in print.
538			\|a Mode of access: World Wide Web.
538			\|a System requirements: Adobe Acrobat Reader.
588			\|a Title from PDF title page (viewed on October 16, 2013).
650		0	\|a Data warehousing \|x Quality control.
650		0	\|a Database management.
650		0	\|a Electronic data processing \|x Data preparation.
653			\|a blocking
653			\|a clustering
653			\|a constrained deduplication
653			\|a cosine similarity
653			\|a data cleaning
653			\|a data cleaning scripts
653			\|a data standardization
653			\|a deduplication
653			\|a edit distance
653			\|a edit similarity
653			\|a ETL
653			\|a ETL data flows
653			\|a jaccard similarity
653			\|a parsing
653			\|a record matching
653			\|a schema matching
653			\|a segmentation
653			\|a set similarity join
653			\|a soundex
653			\|a string similarity functions
700	1		\|a Das Sarma, Anish., \|e author.
776	0	8	\|i Print version: \|z 9781608456772
830		0	\|a Synthesis digital library of engineering and computer science.
830		0	\|a Synthesis lectures on data management ; \|v # 36. \|x 2153-5426
856	4	8	\|3 Abstract with links to full text \|u http://dx.doi.org/10.2200/S00523ED1V01Y201307DTM036
942			\|c EB
999			\|c 81071 \|d 81071
952			\|0 0 \|1 0 \|4 0 \|7 0 \|9 73091 \|a MGUL \|b MGUL \|d 2016-03-20 \|l 0 \|r 2016-03-20 \|w 2016-03-20 \|y EB

Data cleaning : a practical perspective /

Similar Items