As the model is unloaded after use, the memory is released to
be used by another process or model. With this approach, you
dynamically swap various models in a single request if
processing time isn’t a concern. This means other concurrent
requests must wait before the server responds to them.
When serving requests, FastAPI will queue incoming requests
and process them in a first in first out (FIFO) order. This
behavior will lead to long waiting times as a model needs to be
loaded and unloaded every time. In most cases, this strategy is
not recommended, but if you need to swap between multiple
large models and you don’t have sufficient RAM, then you can
adopt this strategy for prototyping. However, in production
scenarios, you should never use this strategy for obvious
reasons—your users will want to avoid the long wait times.
Figure 3-28 shows this model service strategy.
Figure 3-28. Loading and using models on every request
If you need to use different models in each request and have
limited memory, this method can work well for quickly trying
things on a less powerful machine with just a few users. The
trade-off is significantly slower processing time due to model
swapping. However, in production scenarios, it is better to get
larger RAM and use the model preloading strategy with FastAPI
application lifespan.
Be Compute Efficient: Preload Models with
the FastAPI Lifespan
The most compute-efficient strategy for loading models in
FastAPI is to use the application lifespan. With this approach,
you load models on application startup and unload them on
shutdown. During shutdown, you can also undertake any
cleanup steps required, such as filesystem cleanup or logging.
The main benefit of this strategy compared to the first one
mentioned is that you avoid reloading heavy models on each
request. You can load a heavy model once and then make
generations on every request coming using a preloaded model.
As a result, you will save several minutes in processing time in
exchange for a significant chunk of your RAM (or VRAM if using
GPU). However, your application user experience will improve
considerably due to shorter response times.
Figure 3-29 shows the model-serving strategy that uses
application lifespan.
Figure 3-29. Using the FastAPI application lifespan to preload models
You can implement model preloading using the application
lifespan, as shown in Example 3-16.
Example 3-16. Model preloading with application lifespan
# [Link]
from contextlib import asynccontextmanager
from typing import AsyncIterator
from fastapi import FastAPI, Response, status
from models import load_image_model, generate_ima
from utils import img_to_bytes
models = {}
@asynccontextmanager
async def lifespan(_: FastAPI) -> AsyncIterator[N
models["text2image"] = load_image_model()
yield
... # Run cleanup code here
[Link]()
app = FastAPI(lifespan=lifespan)
@[Link](
"/generate/image",
responses={status.HTTP_200_OK: {"content": {
response_class=Response,
)
def serve_text_to_image_model_controller(prompt:
output = generate_image(models["text2image"]
return Response(content=img_to_bytes(output)
Initialize an empty mutable dictionary at the global
application scope to hold one or multiple models.
Use the asynccontextmanager decorator to handle
startup and shutdown events as part of an async context
manager:
The context manager will run code before and after
the yield keyword.
The yield keyword in the decorated lifespan
function separates the startup and shutdown phases.
Code prior to the yield keyword runs at application
startup before any requests are handled.
When you want to terminate the application, FastAPI
will run the code after the yield keyword as part of
the shutdown phase.
Preload the model on startup onto the models dictionary.
Start handling requests as the startup phase is now
finished.
Clear the model on application shutdown.
Create the FastAPI server and pass it the lifespan function
to use.
Pass the global preloaded model instance to the
generation function.
If you start the application now, you should immediately see
model pipelines being loaded onto memory. Before you applied
these changes, the model pipelines used to load only when you
made your first request.
WARNING
You can preload more than one model into memory using the lifespan model-serving
strategy, but this isn’t practical with large GenAI models. Generative models can be
resource hungry, and in most cases you’ll need GPUs to speed up the generation
process. The most powerful consumer GPUs ship with only 24 GB of VRAM. Some
models require 18 GB of memory to perform inference, so try to deploy models on
separate application instances and GPUs instead.
STARTUP AND SHUTDOWN EVENTS
Before the introduction of lifespan async context managers in
FastAPI 0.93.0 for handling the application lifespan, separate
startup and shutdown event handler functions were commonly
used. Example 3-17 shows an example usage.
Example 3-17. Startup and shutdown events
# [Link]
from models import load_image_model
models = {}
app = FastAPI()
@app.on_event("startup")
def startup_event():
models["text2image"] = load_image_model()
@app.on_event("shutdown")
def shutdown_event():
with open("[Link]", mode="a") as logfile:
[Link]("Application shutdown")
A few resources across the web may use this alternative and
legacy approach, so it is worth knowing.
Be Lean: Serve Models Externally
Another strategy to serve GenAI models is to package them as
external services via other tools. You can then use your FastAPI
application as the logical layer between your client and the
external model server. In this logical layer, you can handle
coordination between models, communication with APIs,
management of users, security measures, monitoring activities,
content filtering, enhancing prompts, or any other required
logic.
Cloud providers
Cloud providers are constantly innovating serverless and
dedicated compute solutions that you can use to serve your
models externally. For instance, Azure Machine Learning Studio
now provides a PromptFlow tool that you can use to deploy and
customize OpenAI or open source language models. Upon
deployment, you will receive a model endpoint run on your
Azure compute ready for usage. However, there is a steep
learning curve in using PromptFlow or similar tools as they
may require particular dependencies and nontraditional steps
to be followed.
BentoML
Another great contender for serving models external to FastAPI
is BentoML. BentoML is inspired by FastAPI but implements a
different serving strategy, purpose built for AI models.
A huge improvement over FastAPI for handling concurrent
model requests is BentoML’s ability to run different requests on
different worker processes. It can parallelize CPU-bound
requests without you having to directly deal with Python
multiprocessing. On top of this, BentoML can also batch model
inferences such that the generation process for multiple users
can be done with a single model call.
I covered BentoML in detail in Chapter 2.
TIP
To run BentoML, you will need to install a few dependencies first:
$ pip install bentoml
You can see how to start a BentoML server in Example 3-18.