Machine Learning Upgrade - Caleb Kaiser, Kristen Kehrer

Blick ins Buch

Machine Learning Upgrade (eBook)

A Data Scientist's Guide to MLOps, LLMs, and ML Infrastructure

Caleb Kaiser, Kristen Kehrer (Autoren)

eBook Download: EPUB

2024
249 Seiten
Wiley (Verlag)
978-1-394-24964-0 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (EPUB)

A much-needed guide to implementing new technology in workspaces

From experts in the field comes Machine Learning Upgrade: A Data Scientist's Guide to MLOps, LLMs, and ML Infrastructure, a book that provides data scientists and managers with best practices at the intersection of management, large language models (LLMs), machine learning, and data science. This groundbreaking book will change the way that you view the pipeline of data science. The authors provide an introduction to modern machine learning, showing you how it can be viewed as a holistic, end-to-end system-not just shiny new gadget in an otherwise unchanged operational structure. By adopting a data-centric view of the world, you can begin to see unstructured data and LLMs as the foundation upon which you can build countless applications and business solutions. This book explores a whole world of decision making that hasn't been codified yet, enabling you to forge the future using emerging best practices.

Gain an understanding of the intersection between large language models and unstructured data
Follow the process of building an LLM-powered application while leveraging MLOps techniques such as data versioning and experiment tracking
Discover best practices for training, fine tuning, and evaluating LLMs
Integrate LLM applications within larger systems, monitor their performance, and retrain them on new data

This book is indispensable for data professionals and business leaders looking to understand LLMs and the entire data science pipeline.

Kristen Kehrer has been providing innovative and practical statistical modeling solutions since 2010. In 2018, she achieved recognition as a LinkedIn Top Voice in Data Science & Analytics. Kristen is also the founder of Data Moves Me, LLC.

Caleb Kaiser is a Full Stack Engineer at Comet. Caleb was previously on the Founding Team at Cortex Labs. Caleb also worked at Scribe Media on the Author Platform Team.

A much-needed guide to implementing new technology in workspaces From experts in the field comes Machine Learning Upgrade: A Data Scientist's Guide to MLOps, LLMs, and ML Infrastructure, a book that provides data scientists and managers with best practices at the intersection of management, large language models (LLMs), machine learning, and data science. This groundbreaking book will change the way that you view the pipeline of data science. The authors provide an introduction to modern machine learning, showing you how it can be viewed as a holistic, end-to-end system not just shiny new gadget in an otherwise unchanged operational structure. By adopting a data-centric view of the world, you can begin to see unstructured data and LLMs as the foundation upon which you can build countless applications and business solutions. This book explores a whole world of decision making that hasn't been codified yet, enabling you to forge the future using emerging best practices. Gain an understanding of the intersection between large language models and unstructured data Follow the process of building an LLM-powered application while leveraging MLOps techniques such as data versioning and experiment tracking Discover best practices for training, fine tuning, and evaluating LLMs Integrate LLM applications within larger systems, monitor their performance, and retrain them on new data This book is indispensable for data professionals and business leaders looking to understand LLMs and the entire data science pipeline.

Chapter 2
An End-to-End Approach

The focus of this book is on building end-to-end, production machine learning systems. With that in mind, we should begin by defining what these terms mean. We promise—this isn't just pedantry. Over the last 20 years, terms like end-to-end and production have been thrown around a lot in the world of data science, and depending on the time period, their definitions may vary wildly.

Imagine working on a business intelligence team at a shoe retailer in 2015, when working with data looked quite different than it does today. Your team is tasked with sales forecasting for the next quarter. What would your end-to-end system look like?

You begin by building your dataset. Being 2015, it's very likely that your company's data is a nightmare to access and ingest, but after considerable effort, your team is able to curate a clean dataset. Next, you focus on modeling. Your team will probably experiment with a variety of models, from ARIMA to random forests and maybe even gradient boosting (XGBoost embeddings was initially released in 2014, after all). After much tweaking and tuning, and ideally some robust validation, you finally have your model. Now, you can get to the business of predicting next quarter's sales and sharing your results. Sharing your results could mean many things here. You may have produced a dashboard for the chief revenue officer (CRO) or manually generated predictions each day using a tool like Statistical Package for Social Sciences (SPSS). Maybe you scheduled a job to run every day that would create the new day's actuals via a macro. Or you might start your day by inspecting the forecast and writing an email to share the results.

The hypothetical sales forecasting project is, in many ways, straightforward. This is not to say that it is easy. Any project that requires this much manual effort is difficult. Curating a dataset from years of unhygienic legacy data is arduous. Presenting your forecasts to a nontechnical audience without boring them is an art. There are a seemingly infinite number of experiments you might run in the modeling phase, and conducting a reliable validation process is rarely simple. But, if we break down the components of this project, there aren't that many architectural decisions to make.

Data ingestion: How your company has stored its data will ultimately guide this, but you will need to decide how you are going to ingest data and produce a dataset.
ML framework: In all likelihood, you will be using something like scikit-learn (a Python library for modeling) to build your model, but because it's 2015, you could also be using a statistical tool or some internal framework your team built.
Visualization library: If your company uses a particular dashboarding solution, you'll use that. Otherwise, you'll use whatever library you like or Excel to generate charts for your report.

And that's basically it. You don't need to make more architectural decisions. Because it's 2015, you probably aren't using any real experiment management solution outside of a spreadsheet. Data versioning isn't likely to be done in any official sort of way. Your model doesn't need to be “deployed” in any sense, although if you're using a macro, that might technically qualify. You will probably generate predictions by running a notebook or a local script, which might be stored with some kind of version control. But generally speaking, this is all that is encompassed by your end-to-end system, and it works—at least, for this particular system.

But what about a more complex machine learning system, like the YouTube search assistant we mentioned in Chapter 1? This system requires multiple models interacting in a pipeline. It involves a database with support for vectors to store embeddings.

A vector database is a database built for storing and querying high-dimensional data. Many popular techniques like RAG (Retrieval Augmented Generation) rely on manipulating text embeddings, which are high dimensional vectors, stored in vector databases.

The application must implement some retrieval logic to get relevant videos and excerpts, a way to generate transcripts from videos, and create embeddings for that text. Your inference pipeline needs to be accessible for real-time generation, and your front end can't simply be a chart in some report—you need a full-blown application. And, of course, your system needs to be able to scale to handle many concurrent users.

Beyond any individual technical difference, the most important difference to note is that this system is not a one-off bit of data analysis. It is an ongoing software project, one that needs to be maintained, monitored, and, ideally, improved.

In this chapter, we are going to introduce a framework for designing such a machine learning system. We will begin by examining our YouTube search assistant in a bit more detail.

Components of a YouTube Search Agent

First, let's describe our system generally. When a user inputs their question, the system searches YouTube for relevant videos and adds their transcripts to our ever-growing database. Then, the system extracts the most relevant embeddings and associated text, based on an embedding created for the user's search query from our entire database, and the text is then passed to our language model. In practice, the end result is shown in Figure 2.1.

Figure 2.1 YouTube search query

The model used here wasn't trained on data from 2023, so this is an example of using retrieval augmented generation (RAG) to share entirely new information with a language model. We'll talk more about RAG in a later chapter.

Let's think through the different components of this project. At a very high level, our components fit into a few discrete categories:

YouTube retrieval: We have a system for running YouTube searches and fetching the relevant videos. We then generate transcripts from these videos.
Embeddings storage: We use an embedding model to convert chunks of our transcripts into embeddings and then store them in a vector database. We then convert our user's initial question into an embedding, using the same embedding model, and performing a similarity search to retrieve the most relevant excerpts from our videos. Finally, we return the text associated with the embedding and input context for our final LLM inference.
Large language models: Throughout our system, we use LLMs in key places. We use a model to convert our users’ questions into relevant YouTube searches, to conduct our final question-answering task, and to generate our embeddings.
User interface: We take user inputs and display outputs inside our application.

Within each of those categories, we have many individual components that need to be designed and implemented. In Figure 2.2, we've laid out a diagram of the major components.

It's important to understand how interdependent the different components in this system are. Your embeddings are only as good as the text you are embedding, which means your transcription system is essential. At the same time, a perfect transcription system is useless if you are unable to find relevant videos, which means your YouTube retrieval system must be great.

Figure 2.2 Components of a YouTube search

Many of your design decisions will be made for you, based on the needs of your project. For example, if you need to fine-tune your models, then the field of potential LLM architectures you might use will narrow considerably, as many of the most popular hosted APIs don't allow fine-tuning and many popular model architectures would be prohibitively expensive to fine-tune on your own. At the same time, many of the decisions regarding infrastructure in a system like this might not be at all obvious to you—at least, not until something breaks.

What does it take to handle many concurrent users in a system like this? If the quality of your system's outputs starts to deteriorate, how would you even know? And once you did know, where would you begin to debug? If you have an entire team of data scientists and engineers working on improving this application, how can you attribute a change in your application's outputs to a particular change in the system?

Principles of a Production Machine Learning System

Architecture is a tricky subject in software engineering—mostly because no one is really sure what it is. Loosely, people tend to say “architecture” when they are discussing the fundamental logic of a system separate from the actual code that implements it. Unsurprisingly, these discussions have a tendency toward pageantry, producing lots of diagrams, taxonomies, and “methodologies” that are promptly ignored by the people who actually build things.

However, this is not to say that architecture is an ignorable concept, just that it needs to be understood through a more practical lens. We quite like one particular definition of architecture from Ralph Johnson, shared by Martin Fowler: “Architecture is about the important stuff. Whatever that is.”

In that spirit, we want to focus on what we consider “the important stuff” in designing a machine learning system. Our goal is not to outline a...

Erscheint lt. Verlag	29.7.2024
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
Themenwelt	Mathematik / Informatik ► Informatik ► Theorie / Studium
Schlagworte	BI • Business Intelligence • Data Science Book • Large Language Models • llm applications • llm book • llm coding • llm development • llm engineering • LLMS • machine learning • machine learning development • ML • Prompt Engineering • training ai
ISBN-10	1-394-24964-0 / 1394249640
ISBN-13	978-1-394-24964-0 / 9781394249640

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

EPUB (Adobe DRM)
Größe: 5,1 MB

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.