Driving Data Quality with Data Contracts (eBook)
206 Seiten
Packt Publishing (Verlag)
978-1-83763-624-2 (ISBN)
Despite the passage of time and the evolution of technology and architecture, the challenges we face in building data platforms persist. Our data often remains unreliable, lacks trust, and fails to deliver the promised value.
With Driving Data Quality with Data Contracts, you'll discover the potential of data contracts to transform how you build your data platforms, finally overcoming these enduring problems. You'll learn how establishing contracts as the interface allows you to explicitly assign responsibility and accountability of the data to those who know it best-the data generators-and give them the autonomy to generate and manage data as required. The book will show you how data contracts ensure that consumers get quality data with clearly defined expectations, enabling them to build on that data with confidence to deliver valuable analytics, performant ML models, and trusted data-driven products.
By the end of this book, you'll have gained a comprehensive understanding of how data contracts can revolutionize your organization's data culture and provide a competitive advantage by unlocking the real value within your data.
Everything you need to know to apply data contracts and build a truly data-driven organization that harnesses quality data to deliver tangible business valuePurchase of the print or Kindle book includes a free PDF eBookKey FeaturesUnderstand data contracts and their power to resolving the problems in contemporary data platformsLearn how to design and implement a cutting-edge data platform powered by data contractsAccess practical guidance from the pioneer of data contracts to get expert insights on effective utilizationBook DescriptionDespite the passage of time and the evolution of technology and architecture, the challenges we face in building data platforms persist. Our data often remains unreliable, lacks trust, and fails to deliver the promised value. With Driving Data Quality with Data Contracts, you ll discover the potential of data contracts to transform how you build your data platforms, finally overcoming these enduring problems. You ll learn how establishing contracts as the interface allows you to explicitly assign responsibility and accountability of the data to those who know it best the data generators and give them the autonomy to generate and manage data as required. The book will show you how data contracts ensure that consumers get quality data with clearly defined expectations, enabling them to build on that data with confidence to deliver valuable analytics, performant ML models, and trusted data-driven products. By the end of this book, you ll have gained a comprehensive understanding of how data contracts can revolutionize your organization s data culture and provide a competitive advantage by unlocking the real value within your data.What you will learnGain insights into the intricacies and shortcomings of today s data architecturesUnderstand exactly how data contracts can solve prevalent data challengesDrive a fundamental transformation of your data culture by implementing data contractsDiscover what goes into a data contract and why it s importantDesign a modern data architecture that leverages the power of data contractsExplore sample implementations to get practical knowledge of using data contractsEmbrace best practices for the successful deployment of data contractsWho this book is forIf you re a data engineer, data leader, architect, or practitioner thinking about your data architecture and looking to design one that enables your organization to get the most value from your data, this book is for you. Additionally, staff engineers, product managers, and software engineering leaders and executives will also find valuable insights.]]>
1
A Brief History of Data Platforms
Before we can appreciate why we need to make a fundamental shift to a data contracts-backed data platform in order to improve the quality of our data, and ultimately the value we can get from that data, we need to understand the problems we are trying to solve. I’ve found the best way to do this is to look back at the recent generations of data architectures. By doing that, we’ll see that despite the vast improvements in the tooling available to us, we’ve been carrying through the same limitations in the architecture. That’s why we continue to struggle with the same old problems.
Despite these challenges, the importance of data continues to grow. As it is used in more and more business-critical applications, we can no longer accept data platforms that are unreliable, untrusted, and ineffective. We must find a better way.
By the end of this chapter, we’ll have explored the three most recent generations of data architectures at a high level, focusing on just the source and ingestion of upstream data, and the consumption of data downstream. We will gain an understanding of their limitations and bottlenecks and why we need to make a change. We’ll then be ready to learn about data contracts.
In this chapter, we’re going to cover the following main topics:
- The enterprise data warehouse
- The big data platform
- The modern data stack
- The state of today’s data platforms
- The ever-increasing use of data in business-critical applications
The enterprise data warehouse
We’ll start by looking at the data architecture that was prevalent in the late 1990s and early 2000s, which was centered around an enterprise data warehouse (EDW). As we discuss the architecture and its limitations, you’ll start to notice how many of those limitations continue to affect us today, despite over 20 years of advancement in tools and capabilities.
EDW is the collective term for a reporting and analytics solution. You’d typically engage with one or two big vendors who would provide these capabilities for you. It was expensive and only larger companies that could justify the investment.
The architecture was built around a large database in the center. This was likely an Oracle or MS SQL Server database, hosted on-premises (this was before the advent of cloud services). The extract, transform, and load (ETL) process was performed on data from source systems, or more accurately, the underlying database of those systems. That data could then be used to drive reporting and analytics.
The following diagram shows the EDW architecture:
Figure 1.1 – The EDW architecture
Because this ETL ran against the database of the source system, reliability was a problem. It created a load on the database that could negatively impact the performance of the upstream service. That, and the limitations of the technology we were using at the time, meant we could do few transforms on the data.
We also had to update the ETL process as the database schema and the data evolved over time, relying on the data generators to let us know when that happened. Otherwise, the pipeline would fail.
Those who owned databases were somewhat aware of the ETL work and the business value it drove. There were few barriers between the data generators and consumers and good communication.
However, the major limitation of this architecture was the database used for the data warehouse. It was very expensive and, as it was deployed on-premises, was of a fixed size and hard to scale. That created a limit on how much data could be stored there and made available for analytics.
It became the responsibility of the ETL developers to decide what data should be available, depending on the business needs, and to build and maintain that ETL process by getting access to the source systems and their underlying databases.
And so, this is where the bottleneck was. The ETL developers had to control what data went in, and they were the only ones who could make data available in the warehouse. Data would only be made available if it met a strong business need, and that typically meant the only data in the warehouse was data that drove the company KPIs. If you wanted some data to do some analysis and it wasn’t already in there, you had to put a ticket in their backlog and hope for the best. If it did ever get prioritized, it was probably too late for what you wanted it for.
Note
Let’s illustrate how different roles worked together with this architecture with an example.
Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She’s aware that some of the data from that database is extracted by a data analyst, Bukayo, and that is used to drive top-level business KPIs.
Bukayo can’t do much transformation on the data, due to the limitations of the technology and the cost of infrastructure, so the reporting he produces is largely on the raw data.
There are no defined expectations between Vivianne and Bukayo, and Bukayo relies on Vivianne telling him in advance whether there are any changes to the data or the schema.
The extraction is not reliable. The ETL process could affect the performance of the database, and so can be switched off when there is an incident. Schema and data changes are not always known in advance. The downstream database also has limited performance and cannot be easily scaled to deal with an increase in the data or usage.
Both Vivianne and Bukayo lack autonomy. Vivianne can’t change her database schema without getting approval from Bukayo. Bukayo can only get a subset of data, with little say over the format. Furthermore, any potential users downstream of Bukayo can only access the data he has extracted, severely limiting the accessibility of the organization’s data.
This won’t be the last time we see a bottleneck that prevents access to, and the use of, quality data. Let’s look now at the next generation of data architecture and the introduction of big data, which was made possible by the release of Apache Hadoop in 2006.
The big data platform
As the internet took off in the 1990s and the size and importance of data grew with it, the big tech companies started developing a new generation of data tooling and architectures that aimed to reduce the cost of storing and transforming vast quantities of data. In 2003, Google wrote a paper describing their Google File System, and in 2004 followed that up with another paper, titled MapReduce: Simplified Data Processing on Large Clusters. These ideas were then implemented at Yahoo! and open sourced as Apache Hadoop in 2006.
Apache Hadoop contained two core modules. The Hadoop Distributed File System (HDFS) gave us the ability to store almost limitless amounts of data reliably and efficiently on commodity hardware. Then the MapReduce engine gives us a model on which we could implement programs to process and transform this data, at scale, also on commodity hardware.
This led to the popularization of big data, which was the collective term for our reporting, ML, and analytics capabilities with HDFS and MapReduce as the foundation. These platforms used open source technology and could be on-premises or in the cloud. The reduced costs made this accessible to organizations of any size, who could either implement it themselves or use a packaged enterprise solution provided by the likes of Cloudera and MapR.
The following diagram shows the reference data platform architecture built upon Hadoop:
Figure 1.2 – The big data platform architecture
At the center of the architecture is the data lake, implemented on top of HDFS or a similar filesystem. Here, we could store an almost unlimited amount of semi-structured or unstructured data. This still needed to be put into an EDW in order to drive analytics, as data visualization tools such as Tableau needed a SQL-compatible database to connect to.
Because there were no expectations set on the structure of the data in the data lake, and no limits on the amount of data, it was very easy to write as much as you could and worry about how to use it later. This led to the concept of extract, load, and transform (ELT), as opposed to ETL, where the idea was to extract and load the data into the data lake first without any processing, then apply schemas and transforms later as part of loading to the data warehouse or reading the data in other downstream processes.
We then had much more data than ever before. With a low barrier to entry and cheap storage, data was easily added to the data lake, whether there was a consumer requirement in mind or not.
However, in practice, much of that data...
Erscheint lt. Verlag | 30.6.2023 |
---|---|
Vorwort | Kevin Hu |
Sprache | englisch |
Themenwelt | Sachbuch/Ratgeber ► Freizeit / Hobby ► Sammeln / Sammlerkataloge |
Informatik ► Datenbanken ► Data Warehouse / Data Mining | |
Informatik ► Software Entwicklung ► User Interfaces (HCI) | |
Mathematik / Informatik ► Informatik ► Theorie / Studium | |
ISBN-10 | 1-83763-624-9 / 1837636249 |
ISBN-13 | 978-1-83763-624-2 / 9781837636242 |
Haben Sie eine Frage zum Produkt? |
Digital Rights Management: ohne DRM
Dieses eBook enthält kein DRM oder Kopierschutz. Eine Weitergabe an Dritte ist jedoch rechtlich nicht zulässig, weil Sie beim Kauf nur die Rechte an der persönlichen Nutzung erwerben.
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür die kostenlose Software Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür eine kostenlose App.
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich