0% found this document useful (0 votes)

3 views10 pages

Building Generative AI

The document discusses strategies for loading and serving models in FastAPI, highlighting the trade-offs between dynamically swapping models and preloading them for efficiency. It emphasizes that while model swapping may be suitable for prototyping with limited memory, preloading during application startup is recommended for production to improve response times. Additionally, it introduces external model serving options, such as using cloud providers or BentoML, which can enhance performance and manage concurrent requests more effectively.

Uploaded by

xiaowang198808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views10 pages

Building Generative AI

Uploaded by

xiaowang198808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

As the model is unloaded after use, the memory is released to

be used by another process or model. With this approach, you

dynamically swap various models in a single request if
processing time isn’t a concern. This means other concurrent
requests must wait before the server responds to them.

When serving requests, FastAPI will queue incoming requests

and process them in a first in first out (FIFO) order. This
behavior will lead to long waiting times as a model needs to be
loaded and unloaded every time. In most cases, this strategy is
not recommended, but if you need to swap between multiple
large models and you don’t have sufficient RAM, then you can
adopt this strategy for prototyping. However, in production
scenarios, you should never use this strategy for obvious
reasons—your users will want to avoid the long wait times.

Figure 3-28 shows this model service strategy.

Figure 3-28. Loading and using models on every request

If you need to use different models in each request and have

limited memory, this method can work well for quickly trying
things on a less powerful machine with just a few users. The
trade-off is significantly slower processing time due to model
swapping. However, in production scenarios, it is better to get
larger RAM and use the model preloading strategy with FastAPI
application lifespan.
Be Compute Efficient: Preload Models with
the FastAPI Lifespan

The most compute-efficient strategy for loading models in

FastAPI is to use the application lifespan. With this approach,
you load models on application startup and unload them on
shutdown. During shutdown, you can also undertake any
cleanup steps required, such as filesystem cleanup or logging.

The main benefit of this strategy compared to the first one

mentioned is that you avoid reloading heavy models on each
request. You can load a heavy model once and then make
generations on every request coming using a preloaded model.
As a result, you will save several minutes in processing time in
exchange for a significant chunk of your RAM (or VRAM if using
GPU). However, your application user experience will improve
considerably due to shorter response times.

Figure 3-29 shows the model-serving strategy that uses

application lifespan.
Figure 3-29. Using the FastAPI application lifespan to preload models

You can implement model preloading using the application

lifespan, as shown in Example 3-16.

Example 3-16. Model preloading with application lifespan

# [Link]
from contextlib import asynccontextmanager
from typing import AsyncIterator
from fastapi import FastAPI, Response, status
from models import load_image_model, generate_ima
from utils import img_to_bytes

models = {}

@asynccontextmanager
async def lifespan(_: FastAPI) -> AsyncIterator[N
models["text2image"] = load_image_model()

yield

... # Run cleanup code here

[Link]()

app = FastAPI(lifespan=lifespan)

@[Link](
"/generate/image",
responses={status.HTTP_200_OK: {"content": {
response_class=Response,
)
def serve_text_to_image_model_controller(prompt:
output = generate_image(models["text2image"]
return Response(content=img_to_bytes(output)

Initialize an empty mutable dictionary at the global

application scope to hold one or multiple models.

Use the asynccontextmanager decorator to handle

startup and shutdown events as part of an async context
manager:

The context manager will run code before and after

the yield keyword.

The yield keyword in the decorated lifespan

function separates the startup and shutdown phases.

Code prior to the yield keyword runs at application

startup before any requests are handled.

When you want to terminate the application, FastAPI

will run the code after the yield keyword as part of
the shutdown phase.

Preload the model on startup onto the models dictionary.

Start handling requests as the startup phase is now
finished.

Clear the model on application shutdown.

Create the FastAPI server and pass it the lifespan function

to use.

Pass the global preloaded model instance to the

generation function.

If you start the application now, you should immediately see

model pipelines being loaded onto memory. Before you applied
these changes, the model pipelines used to load only when you
made your first request.

WARNING

You can preload more than one model into memory using the lifespan model-serving
strategy, but this isn’t practical with large GenAI models. Generative models can be
resource hungry, and in most cases you’ll need GPUs to speed up the generation
process. The most powerful consumer GPUs ship with only 24 GB of VRAM. Some
models require 18 GB of memory to perform inference, so try to deploy models on
separate application instances and GPUs instead.
STARTUP AND SHUTDOWN EVENTS

Before the introduction of lifespan async context managers in

FastAPI 0.93.0 for handling the application lifespan, separate
startup and shutdown event handler functions were commonly
used. Example 3-17 shows an example usage.

Example 3-17. Startup and shutdown events

# [Link]
from models import load_image_model

models = {}
app = FastAPI()

@app.on_event("startup")
def startup_event():
models["text2image"] = load_image_model()

@app.on_event("shutdown")
def shutdown_event():
with open("[Link]", mode="a") as logfile:
[Link]("Application shutdown")

A few resources across the web may use this alternative and
legacy approach, so it is worth knowing.
Be Lean: Serve Models Externally

Another strategy to serve GenAI models is to package them as

external services via other tools. You can then use your FastAPI
application as the logical layer between your client and the
external model server. In this logical layer, you can handle
coordination between models, communication with APIs,
management of users, security measures, monitoring activities,
content filtering, enhancing prompts, or any other required
logic.

Cloud providers

Cloud providers are constantly innovating serverless and

dedicated compute solutions that you can use to serve your
models externally. For instance, Azure Machine Learning Studio
now provides a PromptFlow tool that you can use to deploy and
customize OpenAI or open source language models. Upon
deployment, you will receive a model endpoint run on your
Azure compute ready for usage. However, there is a steep
learning curve in using PromptFlow or similar tools as they
may require particular dependencies and nontraditional steps
to be followed.
BentoML

Another great contender for serving models external to FastAPI

is BentoML. BentoML is inspired by FastAPI but implements a
different serving strategy, purpose built for AI models.

A huge improvement over FastAPI for handling concurrent

model requests is BentoML’s ability to run different requests on
different worker processes. It can parallelize CPU-bound
requests without you having to directly deal with Python
multiprocessing. On top of this, BentoML can also batch model
inferences such that the generation process for multiple users
can be done with a single model call.

I covered BentoML in detail in Chapter 2.

TIP

To run BentoML, you will need to install a few dependencies first:

$ pip install bentoml

You can see how to start a BentoML server in Example 3-18.

FastAPI Lifespan Event Handling
No ratings yet
FastAPI Lifespan Event Handling
12 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI Services With FastAPI51
No ratings yet
Building Generative AI Services With FastAPI51
10 pages
Fastapi Interview
No ratings yet
Fastapi Interview
11 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
FastAPI: Building ML Web APIs
No ratings yet
FastAPI: Building ML Web APIs
22 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
FastAPI Architecture Handbook Overview
No ratings yet
FastAPI Architecture Handbook Overview
10 pages
FastAPI ML Deployment Guide
No ratings yet
FastAPI ML Deployment Guide
13 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
FastAPI: Features, Setup, and Comparison
No ratings yet
FastAPI: Features, Setup, and Comparison
9 pages
Building Generative AI Services With FastAPI51
No ratings yet
Building Generative AI Services With FastAPI51
10 pages
FastAPI Guide for Data Scientists
No ratings yet
FastAPI Guide for Data Scientists
30 pages
FastAPI Async Techniques Explained
No ratings yet
FastAPI Async Techniques Explained
10 pages
Fastapi Module3
No ratings yet
Fastapi Module3
22 pages
Deploy ML Models as Microservices with FastAPI
No ratings yet
Deploy ML Models as Microservices with FastAPI
7 pages
Building Generative AI Services With FastAPI51
No ratings yet
Building Generative AI Services With FastAPI51
10 pages
FastAPI for Data Science APIs
No ratings yet
FastAPI for Data Science APIs
14 pages
FastAPI for RESTful API Development
No ratings yet
FastAPI for RESTful API Development
3 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Building Generative AI Services With FastAPI51
No ratings yet
Building Generative AI Services With FastAPI51
10 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering LangChain A Applications
No ratings yet
Mastering LangChain A Applications
5 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Login and Logout Views: Begin Building Our Actual Login View, Let's Start With The
No ratings yet
Login and Logout Views: Begin Building Our Actual Login View, Let's Start With The
10 pages
Test - Py, Exampletest: Testing Flask Apps
No ratings yet
Test - Py, Exampletest: Testing Flask Apps
10 pages
Creating A URL Scheme: Templates and Views
No ratings yet
Creating A URL Scheme: Templates and Views
10 pages
Preprocessors and Postprocessors: Ajax and Restful Apis
No ratings yet
Preprocessors and Postprocessors: Ajax and Restful Apis
10 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Sessions: Authenticating Users
No ratings yet
Sessions: Authenticating Users
10 pages
Cleaning Up: - Deleted Status - Deleted
No ratings yet
Cleaning Up: - Deleted Status - Deleted
10 pages
Serving Static Files: Deploying Your Application
No ratings yet
Serving Static Files: Deploying Your Application
10 pages
Creating Your First Flask Application: WWW - It-Ebooks - Info
No ratings yet
Creating Your First Flask Application: WWW - It-Ebooks - Info
10 pages
Reading Values From The Request: Specifying A Name? As You Can See, The Flask Development Server Will Return A
No ratings yet
Reading Values From The Request: Specifying A Name? As You Can See, The Flask Development Server Will Return A
10 pages
Com/submit Errata
No ratings yet
Com/submit Errata
10 pages
Creating The Entry Table 29 Working With The Entry Model 30
No ratings yet
Creating The Entry Table 29 Working With The Entry Model 30
10 pages
FlexPod Datacenter with SUSE Rancher Guide
No ratings yet
FlexPod Datacenter with SUSE Rancher Guide
170 pages
Palo Alto Networks Certified Network Security Engineer (Pcnse) Exam Blueprint
No ratings yet
Palo Alto Networks Certified Network Security Engineer (Pcnse) Exam Blueprint
11 pages
AI Techniques for Educators Guide
No ratings yet
AI Techniques for Educators Guide
26 pages
Series 300 WirelessMonitoringSet
No ratings yet
Series 300 WirelessMonitoringSet
2 pages
Stronghold Game Cheat Codes Guide
No ratings yet
Stronghold Game Cheat Codes Guide
2 pages
Java GUI and Mobile UI Toolkits Overview
No ratings yet
Java GUI and Mobile UI Toolkits Overview
2 pages
Beginners Guide To Digital Painting in Procreate
100% (12)
Beginners Guide To Digital Painting in Procreate
61 pages
Finite Word Length Effects in IIR Filters
No ratings yet
Finite Word Length Effects in IIR Filters
28 pages
J99 Foundation Website Performance Report
No ratings yet
J99 Foundation Website Performance Report
8 pages
Arithmetic Operations with Message Box
No ratings yet
Arithmetic Operations with Message Box
5 pages
OWASP ZAP Scan Report: Target: Https://thesquarehub - In/ All Scanned Sites: Https://thesquarehub - in
No ratings yet
OWASP ZAP Scan Report: Target: Https://thesquarehub - In/ All Scanned Sites: Https://thesquarehub - in
26 pages
Overview of Distributed Systems
No ratings yet
Overview of Distributed Systems
14 pages
Green Grocery Store Web Application Report
No ratings yet
Green Grocery Store Web Application Report
42 pages
Kevin Roose's Bing Chatbot Transcript
No ratings yet
Kevin Roose's Bing Chatbot Transcript
26 pages
Understanding Generative AI Concepts
No ratings yet
Understanding Generative AI Concepts
33 pages
NIST SP 800-171A Security Assessment Guide
No ratings yet
NIST SP 800-171A Security Assessment Guide
137 pages
Network Security Engineer Profile
No ratings yet
Network Security Engineer Profile
4 pages
Backup and Recovery Strategies Explained
No ratings yet
Backup and Recovery Strategies Explained
5 pages
7-Day Machine Learning Crash Course
No ratings yet
7-Day Machine Learning Crash Course
7 pages
EasyTravel Application Monitoring Guide
No ratings yet
EasyTravel Application Monitoring Guide
143 pages
Netdiscover Commands for Kali Linux
No ratings yet
Netdiscover Commands for Kali Linux
9 pages
Network Layer Services Homework Guide
No ratings yet
Network Layer Services Homework Guide
2 pages
C Program for Producer-Consumer Problem
No ratings yet
C Program for Producer-Consumer Problem
4 pages
YouTube App ANR Report
No ratings yet
YouTube App ANR Report
62 pages
3x SuperTrend Indicator Script
No ratings yet
3x SuperTrend Indicator Script
4 pages
The Ultimate Guide To Magento SEO
No ratings yet
The Ultimate Guide To Magento SEO
18 pages
Mini Project Report Guidelines
No ratings yet
Mini Project Report Guidelines
9 pages
Tunable Metasurface Antennas for 5G
No ratings yet
Tunable Metasurface Antennas for 5G
57 pages
Post-Installation Guide for Ubuntu
No ratings yet
Post-Installation Guide for Ubuntu
4 pages
Oracle Fusion Functional Consultant Profile
No ratings yet
Oracle Fusion Functional Consultant Profile
3 pages

Building Generative AI

Uploaded by

Building Generative AI

Uploaded by

As the model is unloaded after use, the memory is released to

be used by another process or model. With this approach, you

When serving requests, FastAPI will queue incoming requests

Figure 3-28 shows this model service strategy.

If you need to use different models in each request and have

The most compute-efficient strategy for loading models in

The main benefit of this strategy compared to the first one

Figure 3-29 shows the model-serving strategy that uses

You can implement model preloading using the application

Example 3-16. Model preloading with application lifespan

... # Run cleanup code here

Initialize an empty mutable dictionary at the global

Use the asynccontextmanager decorator to handle

The context manager will run code before and after

The yield keyword in the decorated lifespan

Code prior to the yield keyword runs at application

When you want to terminate the application, FastAPI

Preload the model on startup onto the models dictionary.

Clear the model on application shutdown.

Create the FastAPI server and pass it the lifespan function

Pass the global preloaded model instance to the

If you start the application now, you should immediately see

Before the introduction of lifespan async context managers in

Example 3-17. Startup and shutdown events

Another strategy to serve GenAI models is to package them as

Cloud providers are constantly innovating serverless and

Another great contender for serving models external to FastAPI

A huge improvement over FastAPI for handling concurrent

I covered BentoML in detail in Chapter 2.

To run BentoML, you will need to install a few dependencies first:

$ pip install bentoml

You can see how to start a BentoML server in Example 3-18.

You might also like