0% found this document useful (0 votes)

5 views10 pages

Building Generative AI Services With FastAPI51

The document discusses techniques for throttling and traffic shaping to manage data transmission rates for GenAI services, emphasizing the importance of safeguarding against misuse. It outlines optimization strategies for AI services, including batch processing, caching, and prompt engineering to enhance performance and output quality. The chapter concludes with a focus on implementing these techniques to improve efficiency and reduce costs in AI service operations.

Uploaded by

xiaowang198808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views10 pages

Building Generative AI Services With FastAPI51

Uploaded by

xiaowang198808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

self.

throttle_rate = throttle_rate

async def chat_stream(

self, prompt: str, mode: str = "sse"
) -> AsyncGenerator[str, None]:
stream = ... # OpenAI chat completion st
async for chunk in stream:
await [Link](self.throttle_rat
if [Link][0].[Link] is
yield (
f"data: {[Link][0].del
if mode == "sse"
else [Link][0].delta.c
)
await [Link](0.05)

if mode == "sse":
yield f"data: [DONE]\n\n"

Set a fixed throttling rate or dynamically adjust based on

usage.

Slow down the streaming rate without blocking the event

loop.

You can then use the throttled stream within an SSE or

WebSocket endpoint. Or, you can limit the number of active
WebSocket connections per your own custom policies.

Alongside the application-level throttling for real-time streams,

you can also leverage traffic shaping at the infrastructure layer.
TRAFFIC SHAPING

While rate-limiting approaches can help you manage incoming

requests, you can also use throttling techniques like traffic
shaping to control the rate of data transmission.

Traffic shaping prioritizes certain types of data transfer to help

you prevent congestion, smooth out traffic bursts, and maintain
a consistent data flow to optimize application bandwidth usage,
as shown in Figure 9-3. This is especially useful for GenAI
services requiring real-time data transmission including chat
and video streaming.

To implement traffic shaping, you can use the tc command-

line tool within Linux to configure control rules on targets such
as network interfaces and Docker containers. These control
rules set bandwidth limits, intentional latency delays, packet
loss, and IP limits on specific targets to regulate the application
throughput.

While the traffic shaping technique can help you manage the
network bandwidth, bear in mind that it involves complex
algorithms and requires constant monitoring and dynamic
control of the packet flow. It may also introduce delays due to
its queuing methods when the network is heavily loaded.
Figure 9-3. Traffic shaping

Using safeguards, rate limits, and throttles should provide

enough barriers in protecting your services from abuse and
misuse.

In the next section, you’ll learn more about optimization

techniques that can help you reduce latency, increase response
quality, and throughput alongside reducing the costs of your
GenAI services.
Summary
This chapter provided a comprehensive summary of attack
vectors for GenAI services and how to safeguard them against
adversarial attempts, misuse, and abuse.

You learned to implement input and output guardrails

alongside evaluation and content filtering mechanisms to
moderate service usage. Alongside guardrails, you also
developed API rate-limiting and throttling protections to
manage server load and prevent abuse.

In the next chapter, we will learn about optimizing AI services

through various techniques such as caching, batch processing,
model quantizing, prompt engineering, and model fine-tuning.

1
Inspired by OpenAI Cookbook’s “How to Implement LLM Guardrails”.
Chapter 10. Optimizing AI Services
CHAPTER GOALS

In this chapter, you will learn about:

Optimization techniques such as keyword, semantic, and

context caching
Advanced prompt engineering techniques to maximize
model generation quality and coherence
Model quantization and the difference that quantized models
make in model serving
Using batch processing APIs for larger AI workloads
The benefits, drawbacks, and use cases of model fine-tuning

In this chapter, you’ll learn to further optimize your services via

prompt engineering, model quantization, and caching
mechanisms.

Optimization Techniques
The objectives of optimizing an AI service are to either improve
output quality or performance (latency, throughput, costs, etc.).

Performance-related optimizations include the following:

Using batch processing APIs

Caching (keyword, semantic, context, or prompt)
Model quantization

Quality-related optimizations include the following:

Using structured outputs

Prompt engineering
Model fine-tuning

Let’s review each in more detail.

Batch Processing

Often you want an LLM to process batches of entries at the

same time. The most obvious solution is to submit multiple API
calls per entry. However, the obvious approach can be costly
and slow and may lead to your model provider rate limiting
you.

In such cases, you can leverage two separate techniques for

batch processing your data through an LLM:

Updating your structured output schemas to return multiple

examples at the same time
Identifying and using model provider APIs that are designed
for batch processing
The first solution requires you to update your Pydantic models
or template prompts to request a list of outputs per request. In
this case, you can batch process your data within a handful of
requests instead of one per entry.

An implementation of the first solution is shown in Example 10-

Example 10-1. Updating structured output schema for

parsing multiple items

from pydantic import BaseModel

class BatchDocumentClassification(BaseModel):
class Category(BaseModel):
document_id: str
category: list[str]

categories: list[Category]

Update the Pydantic model to include a list of Category

models.

You can now pass the new schema alongside a list of document
titles to the OpenAI client to process multiple entries in a single
API call. However, an alternative and possibly the best solution
will be to use a batch API, if available.

Luckily, model providers such as OpenAI already supply

relevant APIs for such use cases. Under the hood, these
providers may run task queues to process any single batch job
in the background while providing you with status updates
until the batch is complete to retrieve the results.

Compared to using standard endpoints directly, you’ll be able to

send asynchronous groups of requests with lower costs (up to
1
50% with OpenAI ), enjoy higher rate limits, and guarantee
completion times. The batch job service is ideal for processing
jobs that don’t require immediate responses such as using
OpenAI LLMs to parse, classify, or translate large volumes of
documents in the background.

To submit a batch job, you’ll need a jsonl file where each line
contains the details of an individual request to the API, as
shown in Example 10-2. Also as seen in this example, to create
the JSONL file, you can iterate over your entries and
dynamically generate the file.

Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Data Science Projects Overview at Reliance
No ratings yet
Data Science Projects Overview at Reliance
27 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Optimize LLMs with OpenVINO Server
No ratings yet
Optimize LLMs with OpenVINO Server
22 pages
Accelerating Deep Learning for Recommendations
No ratings yet
Accelerating Deep Learning for Recommendations
5 pages
Python Code for Context Engineering
No ratings yet
Python Code for Context Engineering
11 pages
Building Generative AI Services With FastAPI51
No ratings yet
Building Generative AI Services With FastAPI51
10 pages
Model Optimization Techniques Overview
No ratings yet
Model Optimization Techniques Overview
17 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Effective Prompt Engineering Guide
No ratings yet
Effective Prompt Engineering Guide
25 pages
Architectural Enhancements for Assessment AI
No ratings yet
Architectural Enhancements for Assessment AI
8 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Enhancing LLM Reliability and Agency
No ratings yet
Enhancing LLM Reliability and Agency
4 pages
Building Generative AI Services With FastAPI51
No ratings yet
Building Generative AI Services With FastAPI51
10 pages
Deploying Transformer Models for NLP
No ratings yet
Deploying Transformer Models for NLP
4 pages
Techniques For Efficient LLM Deployment
No ratings yet
Techniques For Efficient LLM Deployment
15 pages
Fastapi Interview
No ratings yet
Fastapi Interview
11 pages
OpenAI API Guide for Python Developers
No ratings yet
OpenAI API Guide for Python Developers
7 pages
Improvement Roadmap - MD
No ratings yet
Improvement Roadmap - MD
22 pages
OpenAI Chat Completion Client Guide
No ratings yet
OpenAI Chat Completion Client Guide
12 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Xplain How You Would Design A Backend Service in P
No ratings yet
Xplain How You Would Design A Backend Service in P
23 pages
Mastering LangChain A Applications
No ratings yet
Mastering LangChain A Applications
5 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Building Generative AI Services With FastAPI51
No ratings yet
Building Generative AI Services With FastAPI51
10 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Serving Static Files: Deploying Your Application
No ratings yet
Serving Static Files: Deploying Your Application
10 pages
Test - Py, Exampletest: Testing Flask Apps
No ratings yet
Test - Py, Exampletest: Testing Flask Apps
10 pages
Login and Logout Views: Begin Building Our Actual Login View, Let's Start With The
No ratings yet
Login and Logout Views: Begin Building Our Actual Login View, Let's Start With The
10 pages
Sessions: Authenticating Users
No ratings yet
Sessions: Authenticating Users
10 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Preprocessors and Postprocessors: Ajax and Restful Apis
No ratings yet
Preprocessors and Postprocessors: Ajax and Restful Apis
10 pages
Creating A URL Scheme: Templates and Views
No ratings yet
Creating A URL Scheme: Templates and Views
10 pages
Cleaning Up: - Deleted Status - Deleted
No ratings yet
Cleaning Up: - Deleted Status - Deleted
10 pages
Reading Values From The Request: Specifying A Name? As You Can See, The Flask Development Server Will Return A
No ratings yet
Reading Values From The Request: Specifying A Name? As You Can See, The Flask Development Server Will Return A
10 pages
Creating The Entry Table 29 Working With The Entry Model 30
No ratings yet
Creating The Entry Table 29 Working With The Entry Model 30
10 pages
Creating Your First Flask Application: WWW - It-Ebooks - Info
No ratings yet
Creating Your First Flask Application: WWW - It-Ebooks - Info
10 pages
Com/submit Errata
No ratings yet
Com/submit Errata
10 pages
Generative AI: Transforming Industries
No ratings yet
Generative AI: Transforming Industries
43 pages
SAP QM Data Migration and Management Guide
No ratings yet
SAP QM Data Migration and Management Guide
4 pages
Azure Well Architected
No ratings yet
Azure Well Architected
91 pages
Xamarin Consultant Available in Melbourne
No ratings yet
Xamarin Consultant Available in Melbourne
6 pages
Understanding Interprocess Communication
No ratings yet
Understanding Interprocess Communication
30 pages
Database Administration Course Syllabus
No ratings yet
Database Administration Course Syllabus
4 pages
Define Order Number Ranges in SAP
No ratings yet
Define Order Number Ranges in SAP
5 pages
Migrating to Parcel Fabric in ArcGIS Pro
No ratings yet
Migrating to Parcel Fabric in ArcGIS Pro
42 pages
IISP Knowledge Framework v1.0 - 2017-August
No ratings yet
IISP Knowledge Framework v1.0 - 2017-August
195 pages
Understanding Network Attack Types
No ratings yet
Understanding Network Attack Types
9 pages
Gigastudio 4 Implementation Guide
No ratings yet
Gigastudio 4 Implementation Guide
52 pages
NSC March 2016 Exam MS Final
No ratings yet
NSC March 2016 Exam MS Final
13 pages
QA Job Email List for Recruiters
No ratings yet
QA Job Email List for Recruiters
61 pages
The Role of Commercial Banks in The Digital Economy in Uzbekistan
No ratings yet
The Role of Commercial Banks in The Digital Economy in Uzbekistan
5 pages
معوقات التجارة الإلكترونية في الجزائر
No ratings yet
معوقات التجارة الإلكترونية في الجزائر
21 pages
ZoomOn PremiseDeployment
No ratings yet
ZoomOn PremiseDeployment
18 pages
Introduction to Object-Oriented Programming
No ratings yet
Introduction to Object-Oriented Programming
14 pages
Ticket Writing Guide
No ratings yet
Ticket Writing Guide
4 pages
FortiAnalyzer ADOM Management Guide
No ratings yet
FortiAnalyzer ADOM Management Guide
5 pages
Experiment No 3 Hadoop MapReduce Application For Counting Frequency of Word
No ratings yet
Experiment No 3 Hadoop MapReduce Application For Counting Frequency of Word
9 pages
Image Encryption Project Report
No ratings yet
Image Encryption Project Report
40 pages
Spark RDD Operations and MongoDB Basics
No ratings yet
Spark RDD Operations and MongoDB Basics
40 pages
ArchiMate 3.0 Overview
0% (1)
ArchiMate 3.0 Overview
1 page
Universal Messaging Developer Guide: Innovation Release
No ratings yet
Universal Messaging Developer Guide: Innovation Release
287 pages
WGU Cybersecurity Transfer Credit Guide
100% (1)
WGU Cybersecurity Transfer Credit Guide
3 pages
Sunbeam Internship Program Overview
No ratings yet
Sunbeam Internship Program Overview
14 pages
Information Security Management Policy
No ratings yet
Information Security Management Policy
7 pages
X33FCON 2024: Maldev Packer Workshop
No ratings yet
X33FCON 2024: Maldev Packer Workshop
24 pages
Overview of Joint Regional Security Stacks
No ratings yet
Overview of Joint Regional Security Stacks
2 pages
Internship Review: SJC Institute IT
No ratings yet
Internship Review: SJC Institute IT
11 pages