self.
throttle_rate = throttle_rate
async def chat_stream(
self, prompt: str, mode: str = "sse"
) -> AsyncGenerator[str, None]:
stream = ... # OpenAI chat completion st
async for chunk in stream:
await [Link](self.throttle_rat
if [Link][0].[Link] is
yield (
f"data: {[Link][0].del
if mode == "sse"
else [Link][0].delta.c
)
await [Link](0.05)
if mode == "sse":
yield f"data: [DONE]\n\n"
Set a fixed throttling rate or dynamically adjust based on
usage.
Slow down the streaming rate without blocking the event
loop.
You can then use the throttled stream within an SSE or
WebSocket endpoint. Or, you can limit the number of active
WebSocket connections per your own custom policies.
Alongside the application-level throttling for real-time streams,
you can also leverage traffic shaping at the infrastructure layer.
TRAFFIC SHAPING
While rate-limiting approaches can help you manage incoming
requests, you can also use throttling techniques like traffic
shaping to control the rate of data transmission.
Traffic shaping prioritizes certain types of data transfer to help
you prevent congestion, smooth out traffic bursts, and maintain
a consistent data flow to optimize application bandwidth usage,
as shown in Figure 9-3. This is especially useful for GenAI
services requiring real-time data transmission including chat
and video streaming.
To implement traffic shaping, you can use the tc command-
line tool within Linux to configure control rules on targets such
as network interfaces and Docker containers. These control
rules set bandwidth limits, intentional latency delays, packet
loss, and IP limits on specific targets to regulate the application
throughput.
While the traffic shaping technique can help you manage the
network bandwidth, bear in mind that it involves complex
algorithms and requires constant monitoring and dynamic
control of the packet flow. It may also introduce delays due to
its queuing methods when the network is heavily loaded.
Figure 9-3. Traffic shaping
Using safeguards, rate limits, and throttles should provide
enough barriers in protecting your services from abuse and
misuse.
In the next section, you’ll learn more about optimization
techniques that can help you reduce latency, increase response
quality, and throughput alongside reducing the costs of your
GenAI services.
Summary
This chapter provided a comprehensive summary of attack
vectors for GenAI services and how to safeguard them against
adversarial attempts, misuse, and abuse.
You learned to implement input and output guardrails
alongside evaluation and content filtering mechanisms to
moderate service usage. Alongside guardrails, you also
developed API rate-limiting and throttling protections to
manage server load and prevent abuse.
In the next chapter, we will learn about optimizing AI services
through various techniques such as caching, batch processing,
model quantizing, prompt engineering, and model fine-tuning.
1
Inspired by OpenAI Cookbook’s “How to Implement LLM Guardrails”.
Chapter 10. Optimizing AI Services
CHAPTER GOALS
In this chapter, you will learn about:
Optimization techniques such as keyword, semantic, and
context caching
Advanced prompt engineering techniques to maximize
model generation quality and coherence
Model quantization and the difference that quantized models
make in model serving
Using batch processing APIs for larger AI workloads
The benefits, drawbacks, and use cases of model fine-tuning
In this chapter, you’ll learn to further optimize your services via
prompt engineering, model quantization, and caching
mechanisms.
Optimization Techniques
The objectives of optimizing an AI service are to either improve
output quality or performance (latency, throughput, costs, etc.).
Performance-related optimizations include the following:
Using batch processing APIs
Caching (keyword, semantic, context, or prompt)
Model quantization
Quality-related optimizations include the following:
Using structured outputs
Prompt engineering
Model fine-tuning
Let’s review each in more detail.
Batch Processing
Often you want an LLM to process batches of entries at the
same time. The most obvious solution is to submit multiple API
calls per entry. However, the obvious approach can be costly
and slow and may lead to your model provider rate limiting
you.
In such cases, you can leverage two separate techniques for
batch processing your data through an LLM:
Updating your structured output schemas to return multiple
examples at the same time
Identifying and using model provider APIs that are designed
for batch processing
The first solution requires you to update your Pydantic models
or template prompts to request a list of outputs per request. In
this case, you can batch process your data within a handful of
requests instead of one per entry.
An implementation of the first solution is shown in Example 10-
1.
Example 10-1. Updating structured output schema for
parsing multiple items
from pydantic import BaseModel
class BatchDocumentClassification(BaseModel):
class Category(BaseModel):
document_id: str
category: list[str]
categories: list[Category]
Update the Pydantic model to include a list of Category
models.
You can now pass the new schema alongside a list of document
titles to the OpenAI client to process multiple entries in a single
API call. However, an alternative and possibly the best solution
will be to use a batch API, if available.
Luckily, model providers such as OpenAI already supply
relevant APIs for such use cases. Under the hood, these
providers may run task queues to process any single batch job
in the background while providing you with status updates
until the batch is complete to retrieve the results.
Compared to using standard endpoints directly, you’ll be able to
send asynchronous groups of requests with lower costs (up to
1
50% with OpenAI ), enjoy higher rate limits, and guarantee
completion times. The batch job service is ideal for processing
jobs that don’t require immediate responses such as using
OpenAI LLMs to parse, classify, or translate large volumes of
documents in the background.
To submit a batch job, you’ll need a jsonl file where each line
contains the details of an individual request to the API, as
shown in Example 10-2. Also as seen in this example, to create
the JSONL file, you can iterate over your entries and
dynamically generate the file.