0% found this document useful (0 votes)
5 views10 pages

Building Generative AI Services With FastAPI51

The document discusses techniques for throttling and traffic shaping to manage data transmission rates for GenAI services, emphasizing the importance of safeguarding against misuse. It outlines optimization strategies for AI services, including batch processing, caching, and prompt engineering to enhance performance and output quality. The chapter concludes with a focus on implementing these techniques to improve efficiency and reduce costs in AI service operations.

Uploaded by

xiaowang198808
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Building Generative AI Services With FastAPI51

The document discusses techniques for throttling and traffic shaping to manage data transmission rates for GenAI services, emphasizing the importance of safeguarding against misuse. It outlines optimization strategies for AI services, including batch processing, caching, and prompt engineering to enhance performance and output quality. The chapter concludes with a focus on implementing these techniques to improve efficiency and reduce costs in AI service operations.

Uploaded by

xiaowang198808
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

self.

throttle_rate = throttle_rate

async def chat_stream(


self, prompt: str, mode: str = "sse"
) -> AsyncGenerator[str, None]:
stream = ... # OpenAI chat completion st
async for chunk in stream:
await [Link](self.throttle_rat
if [Link][0].[Link] is
yield (
f"data: {[Link][0].del
if mode == "sse"
else [Link][0].delta.c
)
await [Link](0.05)

if mode == "sse":
yield f"data: [DONE]\n\n"

Set a fixed throttling rate or dynamically adjust based on


usage.

Slow down the streaming rate without blocking the event


loop.

You can then use the throttled stream within an SSE or


WebSocket endpoint. Or, you can limit the number of active
WebSocket connections per your own custom policies.

Alongside the application-level throttling for real-time streams,


you can also leverage traffic shaping at the infrastructure layer.
TRAFFIC SHAPING

While rate-limiting approaches can help you manage incoming


requests, you can also use throttling techniques like traffic
shaping to control the rate of data transmission.

Traffic shaping prioritizes certain types of data transfer to help


you prevent congestion, smooth out traffic bursts, and maintain
a consistent data flow to optimize application bandwidth usage,
as shown in Figure 9-3. This is especially useful for GenAI
services requiring real-time data transmission including chat
and video streaming.

To implement traffic shaping, you can use the tc command-


line tool within Linux to configure control rules on targets such
as network interfaces and Docker containers. These control
rules set bandwidth limits, intentional latency delays, packet
loss, and IP limits on specific targets to regulate the application
throughput.

While the traffic shaping technique can help you manage the
network bandwidth, bear in mind that it involves complex
algorithms and requires constant monitoring and dynamic
control of the packet flow. It may also introduce delays due to
its queuing methods when the network is heavily loaded.
Figure 9-3. Traffic shaping

Using safeguards, rate limits, and throttles should provide


enough barriers in protecting your services from abuse and
misuse.

In the next section, you’ll learn more about optimization


techniques that can help you reduce latency, increase response
quality, and throughput alongside reducing the costs of your
GenAI services.
Summary
This chapter provided a comprehensive summary of attack
vectors for GenAI services and how to safeguard them against
adversarial attempts, misuse, and abuse.

You learned to implement input and output guardrails


alongside evaluation and content filtering mechanisms to
moderate service usage. Alongside guardrails, you also
developed API rate-limiting and throttling protections to
manage server load and prevent abuse.

In the next chapter, we will learn about optimizing AI services


through various techniques such as caching, batch processing,
model quantizing, prompt engineering, and model fine-tuning.

1
Inspired by OpenAI Cookbook’s “How to Implement LLM Guardrails”.
Chapter 10. Optimizing AI Services
CHAPTER GOALS

In this chapter, you will learn about:

Optimization techniques such as keyword, semantic, and


context caching
Advanced prompt engineering techniques to maximize
model generation quality and coherence
Model quantization and the difference that quantized models
make in model serving
Using batch processing APIs for larger AI workloads
The benefits, drawbacks, and use cases of model fine-tuning

In this chapter, you’ll learn to further optimize your services via


prompt engineering, model quantization, and caching
mechanisms.

Optimization Techniques
The objectives of optimizing an AI service are to either improve
output quality or performance (latency, throughput, costs, etc.).

Performance-related optimizations include the following:

Using batch processing APIs


Caching (keyword, semantic, context, or prompt)
Model quantization

Quality-related optimizations include the following:

Using structured outputs


Prompt engineering
Model fine-tuning

Let’s review each in more detail.

Batch Processing

Often you want an LLM to process batches of entries at the


same time. The most obvious solution is to submit multiple API
calls per entry. However, the obvious approach can be costly
and slow and may lead to your model provider rate limiting
you.

In such cases, you can leverage two separate techniques for


batch processing your data through an LLM:

Updating your structured output schemas to return multiple


examples at the same time
Identifying and using model provider APIs that are designed
for batch processing
The first solution requires you to update your Pydantic models
or template prompts to request a list of outputs per request. In
this case, you can batch process your data within a handful of
requests instead of one per entry.

An implementation of the first solution is shown in Example 10-


1.

Example 10-1. Updating structured output schema for


parsing multiple items

from pydantic import BaseModel

class BatchDocumentClassification(BaseModel):
class Category(BaseModel):
document_id: str
category: list[str]

categories: list[Category]

Update the Pydantic model to include a list of Category


models.

You can now pass the new schema alongside a list of document
titles to the OpenAI client to process multiple entries in a single
API call. However, an alternative and possibly the best solution
will be to use a batch API, if available.

Luckily, model providers such as OpenAI already supply


relevant APIs for such use cases. Under the hood, these
providers may run task queues to process any single batch job
in the background while providing you with status updates
until the batch is complete to retrieve the results.

Compared to using standard endpoints directly, you’ll be able to


send asynchronous groups of requests with lower costs (up to
1
50% with OpenAI ), enjoy higher rate limits, and guarantee
completion times. The batch job service is ideal for processing
jobs that don’t require immediate responses such as using
OpenAI LLMs to parse, classify, or translate large volumes of
documents in the background.

To submit a batch job, you’ll need a jsonl file where each line
contains the details of an individual request to the API, as
shown in Example 10-2. Also as seen in this example, to create
the JSONL file, you can iterate over your entries and
dynamically generate the file.

You might also like