0% found this document useful (0 votes)

2 views10 pages

Building Generative AI

The document provides examples of how to serve HTML files as static assets using FastAPI, and explains the importance of CORS in managing cross-origin requests. It also discusses implementing server-sent events (SSE) for real-time streaming of model outputs, including how to set up a Hugging Face inference server and handle streaming with both GET and POST requests. The document emphasizes the need for careful CORS configuration in production environments to ensure security while developing applications that require real-time data streaming.

Uploaded by

xiaowang198808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views10 pages

Building Generative AI

Uploaded by

xiaowang198808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Example 6-5.

Mounting HTML files on the server as static

assets

# [Link]

from [Link] import StaticFiles

[Link]("/pages", StaticFiles(directory="pages

Mount the pages directory onto the /pages to serve its

content as static assets. Once mounted, you can access
each file by visiting <origin>/pages/<filename> .

By implementing Example 6-5, you serve the HTML from the

same origin as your API server. This avoids triggering the
browser’s CORS security mechanism, which can block outgoing
requests reaching your server.

You can now access the HTML page by visiting

[Link] .

Cross-origin resource sharing

If you try to open the Example 6-4 HTML file in your browser
directly and click the Start Streaming button, you will notice
that nothing happens. You can check the browser’s network tab
to view what happened to the outgoing requests.

After some investigations, you should notice that your browser

has blocked outgoing requests to your server as its preflight
cross-origin resource sharing (CORS) checks with your server
have failed.

CORS is a security mechanism implemented in browsers to

control how resources on a web page can be requested from
another domain, and is relevant only when sending requests
directly from the browser instead of a server. Browsers use
CORS to check whether they’re allowed to send requests to the
server from a different origin (i.e., domain) than the server.

For example, if your client is hosted on

[Link] and it needs to fetch data from an API
hosted on [Link] , the browser will block
this request unless the API server has CORS enabled.

For now, you can bypass these CORS errors by adding a CORS
middleware on your server, as you can see in Example 6-6, to
allow any incoming requests from browsers.
Example 6-6. Apply CORS settings

# [Link]

from [Link] import CORSMiddlewa

app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

Allow incoming requests from any origins, methods ( GET ,

POST , etc.) and headers.

Streamlit avoids triggering the CORS mechanism by sending

requests on its internal server even though the generated UI
runs on the browser.

On the other hand, the FastAPI documentation page makes

requests from the same origin as the server (i.e.,
[Link] ), so requests by default don’t
trigger the CORS security mechanism.
WARNING

In Example 6-6, you configure the CORS middleware to process any incoming
requests, effectively bypassing the CORS security mechanism for easier development.
In production, you should allow only a handful of origins, methods, and headers to
be processed by your server.

If you followed Example 6-5 or 6-6, you should now be able to

view the incoming stream from your SSE endpoint (see
Figure 6-8).

Figure 6-8. Incoming stream from the SSE endpoint

Congratulations! You now have a full working solution where

model responses are directly streamed to your client as soon as
generated data becomes available. By implementing this
feature, your users will now have a more pleasant experience
interacting with your chatbot as they receive responses to their
queries in real time.

Your solution also implemented concurrency using an

asynchronous client for interacting with the Azure OpenAI API
to stream faster responses to your users. You can try using a
synchronous client to compare the differences in generation
speeds. With an asynchronous client, the generation speed can
be so fast that you will receive a block of text at once even
though it is actually being streamed to the browser.

Streaming LLM outputs from Hugging Face models

Now that you’ve learned how to implement SSE endpoints with

model providers such as Azure OpenAI, you may be wondering
if you can stream model outputs from open source models
you’ve previously downloaded from Hugging Face.

Although Hugging Face’s transformers library implements a

TextStreamer component that you can pass to your model
pipeline, the easiest solution is to run a separate inference
server such as HF Inference Server to implement model
streaming.
Example 6-7 shows how to set up a simple model inference
server using Docker by providing a model-id .

Example 6-7. Serving HF LLM models via HF Inference

Server

$ docker run --runtime nvidia --gpus all \

-v ~/.cache/huggingface:/root/.cache/huggingf
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8080:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1

Use Docker to download and run the latest vllm/vllm-

openai container on all available NVIDIA GPUs.

Share a volume with the Docker container to avoid

downloading weights every run.

Set the secret environment variable to access gated

4
models like mistralai/Mistral-7B-v0.1 .

Run the inference server on localhost port 8080 by

mapping host port 8080 to exposed Docker container
port 8000 .
Enable inter-process communication (IPC) between the
container and the host to allow the container to access the
host’s shared memory.

The vLLM inference server uses the OpenAI API

Specification for LLM serving.

Download and use the gated mistralai/Mistral-7B-

v0.1 from Hugging Face Hub.

With the model server running, you can now use an

AsyncInferenceClient to generate outputs in a streaming
format, as shown in Example 6-8.

Example 6-8. Consuming the LLM output stream from HF

Inference Stream

import asyncio
from typing import AsyncGenerator
from huggingface_hub import AsyncInferenceClient

client = AsyncInferenceClient("[Link]

async def chat_stream(prompt: str) -> AsyncGenera

stream = await client.text_generation(prompt
async for token in stream:
yield token
await [Link](0.05)

While Example 6-8 shows how to use the Hugging Face

inference server, you can still use other model-serving
frameworks such as vLLM that support streaming model
responses.

Before we move on to talking about WebSocket, let’s look at

consuming another variant of SSE endpoints using the POST
method.

SSE with POST Request

The EventSource specification expects GET endpoints on the

server to correctly consume the incoming SSE stream. This
makes implementing real-time applications with SSE
straightforward as the EventSource interface can handle
issues such as connection drops and automatic reconnection.

However, using HTTP GET requests comes with its own

limitations. GET requests are normally less secure than other
5
request methods and more vulnerable to XSS attacks. In
addition, since GET requests can’t have any request body, you
can only transfer data as part of the URL’s query parameters to
the server. The issue is that there is a URL length limit you need
to consider and any query parameters must be encoded
correctly into the request URL. Therefore, you can’t just append
the whole conversation history to the URL as a parameter. Your
server must handle maintaining the history of the conversation
and keeping track of conversational context with GET SSE
endpoints.

A common workaround to the aforementioned limitation is to

implement a POST SSE endpoint even if the SSE specification
doesn’t support it. As a result, the implementation will be more
complex.

First let’s implement the POST endpoint on the server in

Example 6-9.

Example 6-9. Implementing SSE endpoint on the server

# [Link]

from typing import Annotated

from fastapi import Body, FastAPI
from [Link] import StreamingResponse
from stream import azure_chat_client

@[Link]("/generate/text/stream")
async def serve_text_to_text_stream_controller(
prompt: Annotated[str, Body()]
) -> StreamingResponse:
return StreamingResponse(
azure_chat_client.chat_stream(prompt), me
)

With the POST endpoint for streaming chat outputs

implemented, you can now develop the client logic to process
the SSE stream.

You will have to manually process the incoming streaming

yourself using the browser’s fetch web interface, as shown in
Example 6-10.

Example 6-10. Implementing SSE on the client using the

browser EventSource API

{# pages/[Link] #}

<!DOCTYPE html>
<html lang="en">
<head>
<title>SSE With Post Request</title>
</head>
<body>
<button id="streambtn">Start Streaming</button>

Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI Services With FastAPI51
No ratings yet
Building Generative AI Services With FastAPI51
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI Services With FastAPI51
No ratings yet
Building Generative AI Services With FastAPI51
10 pages
Data Science Projects Overview at Reliance
No ratings yet
Data Science Projects Overview at Reliance
27 pages
Fastapi Documentation
No ratings yet
Fastapi Documentation
15 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Multi-Source Sentiment Analysis Manual
No ratings yet
Multi-Source Sentiment Analysis Manual
23 pages
Fast API
No ratings yet
Fast API
23 pages
High-Performance Web Apps With FastAPI: The Asynchronous Web Framework Based On Modern Python 1st Edition Malhar Lathkar Ebook All Formats Available
100% (3)
High-Performance Web Apps With FastAPI: The Asynchronous Web Framework Based On Modern Python 1st Edition Malhar Lathkar Ebook All Formats Available
39 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Aether Analyst
No ratings yet
Aether Analyst
14 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
FastAPI Model Deployment Guide
No ratings yet
FastAPI Model Deployment Guide
4 pages
Comprehensive Technical Assessment and Diagnostics Guide For Integration Engineering
No ratings yet
Comprehensive Technical Assessment and Diagnostics Guide For Integration Engineering
20 pages
Intelligent IDS Dashboard Overview
No ratings yet
Intelligent IDS Dashboard Overview
16 pages
Model-Driven Engineering for REST APIs
No ratings yet
Model-Driven Engineering for REST APIs
200 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Production Grade API Architectures FastAPI Pydantic 2026
No ratings yet
Production Grade API Architectures FastAPI Pydantic 2026
16 pages
Mastering LangChain A Applications
No ratings yet
Mastering LangChain A Applications
5 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Learning Flask Framework
No ratings yet
Learning Flask Framework
3 pages
Building Generative AI
No ratings yet
Building Generative AI
10 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Building Generative AI Services With FastAPI51
No ratings yet
Building Generative AI Services With FastAPI51
10 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Serving Static Files: Deploying Your Application
No ratings yet
Serving Static Files: Deploying Your Application
10 pages
Test - Py, Exampletest: Testing Flask Apps
No ratings yet
Test - Py, Exampletest: Testing Flask Apps
10 pages
Login and Logout Views: Begin Building Our Actual Login View, Let's Start With The
No ratings yet
Login and Logout Views: Begin Building Our Actual Login View, Let's Start With The
10 pages
Sessions: Authenticating Users
No ratings yet
Sessions: Authenticating Users
10 pages
Mastering 2025
No ratings yet
Mastering 2025
5 pages
Preprocessors and Postprocessors: Ajax and Restful Apis
No ratings yet
Preprocessors and Postprocessors: Ajax and Restful Apis
10 pages
Creating A URL Scheme: Templates and Views
No ratings yet
Creating A URL Scheme: Templates and Views
10 pages
Cleaning Up: - Deleted Status - Deleted
No ratings yet
Cleaning Up: - Deleted Status - Deleted
10 pages
Reading Values From The Request: Specifying A Name? As You Can See, The Flask Development Server Will Return A
No ratings yet
Reading Values From The Request: Specifying A Name? As You Can See, The Flask Development Server Will Return A
10 pages
Creating The Entry Table 29 Working With The Entry Model 30
No ratings yet
Creating The Entry Table 29 Working With The Entry Model 30
10 pages
Creating Your First Flask Application: WWW - It-Ebooks - Info
No ratings yet
Creating Your First Flask Application: WWW - It-Ebooks - Info
10 pages
Com/submit Errata
No ratings yet
Com/submit Errata
10 pages
ASP.NET MVC Framework Components Explained
No ratings yet
ASP.NET MVC Framework Components Explained
5 pages
Thymeleaf Pagination in Spring Boot
No ratings yet
Thymeleaf Pagination in Spring Boot
32 pages
Full-Stack Hospital Management System
No ratings yet
Full-Stack Hospital Management System
2 pages
Question Bank For Fluency With Information Technology 7th Edition by Lawrence Snyder
No ratings yet
Question Bank For Fluency With Information Technology 7th Edition by Lawrence Snyder
17 pages
Web Development Essentials Guide
No ratings yet
Web Development Essentials Guide
4 pages
FairPlay Streaming Programming Guide
No ratings yet
FairPlay Streaming Programming Guide
66 pages
PICT CGPA Calculator Project Report
No ratings yet
PICT CGPA Calculator Project Report
25 pages
TTML2 Specification Overview
No ratings yet
TTML2 Specification Overview
377 pages
What's New in Python: A. M. Kuchling
No ratings yet
What's New in Python: A. M. Kuchling
23 pages
Sco 207 Web Development Technologies
No ratings yet
Sco 207 Web Development Technologies
2 pages
ServiceNow Instance Hardening Customer Security Document
0% (1)
ServiceNow Instance Hardening Customer Security Document
19 pages
Understanding Computer Hardware and Word Functions
No ratings yet
Understanding Computer Hardware and Word Functions
8 pages
Online Car Rental System Project Report
No ratings yet
Online Car Rental System Project Report
77 pages
Ldap Abscisse
No ratings yet
Ldap Abscisse
3 pages
Computer Networking Midterm Revision Guide
No ratings yet
Computer Networking Midterm Revision Guide
23 pages
Computers in Daily Life: Impact & Uses
100% (3)
Computers in Daily Life: Impact & Uses
36 pages
Basics of Web Design: HTML5 & CSS (5th Edition) Terry Felke-Morris Ebook Power Reader Edition
100% (2)
Basics of Web Design: HTML5 & CSS (5th Edition) Terry Felke-Morris Ebook Power Reader Edition
41 pages
Editing Documents in Microsoft Word
No ratings yet
Editing Documents in Microsoft Word
9 pages
Overview of the Android Platform
No ratings yet
Overview of the Android Platform
44 pages
Changepoint SAML Configuration Guide
No ratings yet
Changepoint SAML Configuration Guide
3 pages
Web Development Question Bank
100% (1)
Web Development Question Bank
1 page
Collaborative ICT Content Development
100% (1)
Collaborative ICT Content Development
14 pages
Adding Hyperlinks in RTF for BI Publisher
No ratings yet
Adding Hyperlinks in RTF for BI Publisher
2 pages
TYBSC CS Sem-VI Practical Exam 2023
100% (1)
TYBSC CS Sem-VI Practical Exam 2023
30 pages
Convert NDS To ZIP Online. Quick, Secure & FREE! - Ezyzip
No ratings yet
Convert NDS To ZIP Online. Quick, Secure & FREE! - Ezyzip
1 page
Elementary Knowledge of Computers
No ratings yet
Elementary Knowledge of Computers
3 pages
Python Basics: Operators and Loops
0% (2)
Python Basics: Operators and Loops
49 pages
Software Types and MS Word Formatting Guide
No ratings yet
Software Types and MS Word Formatting Guide
7 pages
SINUMERIK 828D Machine Data Manual
No ratings yet
SINUMERIK 828D Machine Data Manual
712 pages
WEX Service Manual for Grifols Installation
No ratings yet
WEX Service Manual for Grifols Installation
34 pages