0% found this document useful (0 votes)
3 views10 pages

Building Generative AI

The document discusses strategies for loading and serving models in FastAPI, highlighting the trade-offs between dynamically swapping models and preloading them for efficiency. It emphasizes that while model swapping may be suitable for prototyping with limited memory, preloading during application startup is recommended for production to improve response times. Additionally, it introduces external model serving options, such as using cloud providers or BentoML, which can enhance performance and manage concurrent requests more effectively.

Uploaded by

xiaowang198808
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views10 pages

Building Generative AI

The document discusses strategies for loading and serving models in FastAPI, highlighting the trade-offs between dynamically swapping models and preloading them for efficiency. It emphasizes that while model swapping may be suitable for prototyping with limited memory, preloading during application startup is recommended for production to improve response times. Additionally, it introduces external model serving options, such as using cloud providers or BentoML, which can enhance performance and manage concurrent requests more effectively.

Uploaded by

xiaowang198808
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

As the model is unloaded after use, the memory is released to

be used by another process or model. With this approach, you


dynamically swap various models in a single request if
processing time isn’t a concern. This means other concurrent
requests must wait before the server responds to them.

When serving requests, FastAPI will queue incoming requests


and process them in a first in first out (FIFO) order. This
behavior will lead to long waiting times as a model needs to be
loaded and unloaded every time. In most cases, this strategy is
not recommended, but if you need to swap between multiple
large models and you don’t have sufficient RAM, then you can
adopt this strategy for prototyping. However, in production
scenarios, you should never use this strategy for obvious
reasons—your users will want to avoid the long wait times.

Figure 3-28 shows this model service strategy.


Figure 3-28. Loading and using models on every request

If you need to use different models in each request and have


limited memory, this method can work well for quickly trying
things on a less powerful machine with just a few users. The
trade-off is significantly slower processing time due to model
swapping. However, in production scenarios, it is better to get
larger RAM and use the model preloading strategy with FastAPI
application lifespan.
Be Compute Efficient: Preload Models with
the FastAPI Lifespan

The most compute-efficient strategy for loading models in


FastAPI is to use the application lifespan. With this approach,
you load models on application startup and unload them on
shutdown. During shutdown, you can also undertake any
cleanup steps required, such as filesystem cleanup or logging.

The main benefit of this strategy compared to the first one


mentioned is that you avoid reloading heavy models on each
request. You can load a heavy model once and then make
generations on every request coming using a preloaded model.
As a result, you will save several minutes in processing time in
exchange for a significant chunk of your RAM (or VRAM if using
GPU). However, your application user experience will improve
considerably due to shorter response times.

Figure 3-29 shows the model-serving strategy that uses


application lifespan.
Figure 3-29. Using the FastAPI application lifespan to preload models

You can implement model preloading using the application


lifespan, as shown in Example 3-16.

Example 3-16. Model preloading with application lifespan

# [Link]
from contextlib import asynccontextmanager
from typing import AsyncIterator
from fastapi import FastAPI, Response, status
from models import load_image_model, generate_ima
from utils import img_to_bytes

models = {}

@asynccontextmanager
async def lifespan(_: FastAPI) -> AsyncIterator[N
models["text2image"] = load_image_model()

yield

... # Run cleanup code here

[Link]()

app = FastAPI(lifespan=lifespan)

@[Link](
"/generate/image",
responses={status.HTTP_200_OK: {"content": {
response_class=Response,
)
def serve_text_to_image_model_controller(prompt:
output = generate_image(models["text2image"]
return Response(content=img_to_bytes(output)

Initialize an empty mutable dictionary at the global


application scope to hold one or multiple models.

Use the asynccontextmanager decorator to handle


startup and shutdown events as part of an async context
manager:

The context manager will run code before and after


the yield keyword.

The yield keyword in the decorated lifespan


function separates the startup and shutdown phases.

Code prior to the yield keyword runs at application


startup before any requests are handled.

When you want to terminate the application, FastAPI


will run the code after the yield keyword as part of
the shutdown phase.

Preload the model on startup onto the models dictionary.


Start handling requests as the startup phase is now
finished.

Clear the model on application shutdown.

Create the FastAPI server and pass it the lifespan function


to use.

Pass the global preloaded model instance to the


generation function.

If you start the application now, you should immediately see


model pipelines being loaded onto memory. Before you applied
these changes, the model pipelines used to load only when you
made your first request.

WARNING

You can preload more than one model into memory using the lifespan model-serving
strategy, but this isn’t practical with large GenAI models. Generative models can be
resource hungry, and in most cases you’ll need GPUs to speed up the generation
process. The most powerful consumer GPUs ship with only 24 GB of VRAM. Some
models require 18 GB of memory to perform inference, so try to deploy models on
separate application instances and GPUs instead.
STARTUP AND SHUTDOWN EVENTS

Before the introduction of lifespan async context managers in


FastAPI 0.93.0 for handling the application lifespan, separate
startup and shutdown event handler functions were commonly
used. Example 3-17 shows an example usage.

Example 3-17. Startup and shutdown events

# [Link]
from models import load_image_model

models = {}
app = FastAPI()

@app.on_event("startup")
def startup_event():
models["text2image"] = load_image_model()

@app.on_event("shutdown")
def shutdown_event():
with open("[Link]", mode="a") as logfile:
[Link]("Application shutdown")

A few resources across the web may use this alternative and
legacy approach, so it is worth knowing.
Be Lean: Serve Models Externally

Another strategy to serve GenAI models is to package them as


external services via other tools. You can then use your FastAPI
application as the logical layer between your client and the
external model server. In this logical layer, you can handle
coordination between models, communication with APIs,
management of users, security measures, monitoring activities,
content filtering, enhancing prompts, or any other required
logic.

Cloud providers

Cloud providers are constantly innovating serverless and


dedicated compute solutions that you can use to serve your
models externally. For instance, Azure Machine Learning Studio
now provides a PromptFlow tool that you can use to deploy and
customize OpenAI or open source language models. Upon
deployment, you will receive a model endpoint run on your
Azure compute ready for usage. However, there is a steep
learning curve in using PromptFlow or similar tools as they
may require particular dependencies and nontraditional steps
to be followed.
BentoML

Another great contender for serving models external to FastAPI


is BentoML. BentoML is inspired by FastAPI but implements a
different serving strategy, purpose built for AI models.

A huge improvement over FastAPI for handling concurrent


model requests is BentoML’s ability to run different requests on
different worker processes. It can parallelize CPU-bound
requests without you having to directly deal with Python
multiprocessing. On top of this, BentoML can also batch model
inferences such that the generation process for multiple users
can be done with a single model call.

I covered BentoML in detail in Chapter 2.

TIP

To run BentoML, you will need to install a few dependencies first:

$ pip install bentoml

You can see how to start a BentoML server in Example 3-18.

You might also like