Data Cleaning for Effective Data Science
2000
Addison Wesley (Verlag)
978-0-13-675335-3 (ISBN)
Addison Wesley (Verlag)
978-0-13-675335-3 (ISBN)
Most machine learning guides cover data cleaning briefly or skip it entirely. However, many data scientists and analysts spend most of their time on data cleaning and data quality tasks, and their effectiveness can make or break project success. In Data Cleaning for Effective Data Science, leading data science trainer David Mertz provides the most systematic guide to cleaning data for any project, using any library or toolset.
Mertz introduces many powerful techniques for analyzing, manipulating, and pre-processing data sources. He offers best practices for working with leading data formats such as JSON, CSV, SQL RDBMSes, HDF5, NoSQL databases, files in image formats, binary serialized data structures, and more.
Mertz also focuses on crucial issues within the data itself, including missing data, outliers, biasing trends, class imbalance, value imputation, over/under-sampling, normalization and/or randomization, and anomalies.
This guide is organized around downloadable datasets, each illuminating specific issues with data integrity or quality. Each chapter explores the best ways to diagnose, analyze, and remediate these issues, offering hands-on practice using tools such as Python, Pandas, sklearn.preprocessing, scipy.stats, R, and Tidyverse. While the examples are demonstrated with widely-used tools, Mertz's concepts are applicable with any toolset. Each chapter also links to additional datasets with more problems, exercises, and solutions.
Mertz introduces many powerful techniques for analyzing, manipulating, and pre-processing data sources. He offers best practices for working with leading data formats such as JSON, CSV, SQL RDBMSes, HDF5, NoSQL databases, files in image formats, binary serialized data structures, and more.
Mertz also focuses on crucial issues within the data itself, including missing data, outliers, biasing trends, class imbalance, value imputation, over/under-sampling, normalization and/or randomization, and anomalies.
This guide is organized around downloadable datasets, each illuminating specific issues with data integrity or quality. Each chapter explores the best ways to diagnose, analyze, and remediate these issues, offering hands-on practice using tools such as Python, Pandas, sklearn.preprocessing, scipy.stats, R, and Tidyverse. While the examples are demonstrated with widely-used tools, Mertz's concepts are applicable with any toolset. Each chapter also links to additional datasets with more problems, exercises, and solutions.
1. Introduction
2. Data Ingestion - Tabular Formats
3. Data Ingestion - Hierarchical Formats
4. Data Ingestion - Other Data Sources
5. Anomaly Detection
6. Data Quality
7. Feature Engineering
8. Value Imputation
9. Using Machine Learning to Clean Data
10. Additional Exercises
Appendix 1. Discussion of Problem/Dataset 1
Appendix 2. Discussion of Problem/Dataset 2
Erscheinungsdatum | 03.02.2021 |
---|---|
Reihe/Serie | Addison-Wesley Data & Analytics Series |
Verlagsort | Boston |
Sprache | englisch |
Themenwelt | Informatik ► Datenbanken ► Data Warehouse / Data Mining |
ISBN-10 | 0-13-675335-3 / 0136753353 |
ISBN-13 | 978-0-13-675335-3 / 9780136753353 |
Zustand | Neuware |
Haben Sie eine Frage zum Produkt? |
Mehr entdecken
aus dem Bereich
aus dem Bereich
Datenanalyse für Künstliche Intelligenz
Buch | Softcover (2024)
De Gruyter Oldenbourg (Verlag)
CHF 104,90
Auswertung von Daten mit pandas, NumPy und IPython
Buch | Softcover (2023)
O'Reilly (Verlag)
CHF 62,85