0% found this document useful (0 votes)
2 views10 pages

Building Generative AI

The document provides examples of how to serve HTML files as static assets using FastAPI, and explains the importance of CORS in managing cross-origin requests. It also discusses implementing server-sent events (SSE) for real-time streaming of model outputs, including how to set up a Hugging Face inference server and handle streaming with both GET and POST requests. The document emphasizes the need for careful CORS configuration in production environments to ensure security while developing applications that require real-time data streaming.

Uploaded by

xiaowang198808
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

Building Generative AI

The document provides examples of how to serve HTML files as static assets using FastAPI, and explains the importance of CORS in managing cross-origin requests. It also discusses implementing server-sent events (SSE) for real-time streaming of model outputs, including how to set up a Hugging Face inference server and handle streaming with both GET and POST requests. The document emphasizes the need for careful CORS configuration in production environments to ensure security while developing applications that require real-time data streaming.

Uploaded by

xiaowang198808
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Example 6-5.

Mounting HTML files on the server as static


assets

# [Link]

from [Link] import StaticFiles

[Link]("/pages", StaticFiles(directory="pages

Mount the pages directory onto the /pages to serve its


content as static assets. Once mounted, you can access
each file by visiting <origin>/pages/<filename> .

By implementing Example 6-5, you serve the HTML from the


same origin as your API server. This avoids triggering the
browser’s CORS security mechanism, which can block outgoing
requests reaching your server.

You can now access the HTML page by visiting


[Link] .

Cross-origin resource sharing

If you try to open the Example 6-4 HTML file in your browser
directly and click the Start Streaming button, you will notice
that nothing happens. You can check the browser’s network tab
to view what happened to the outgoing requests.

After some investigations, you should notice that your browser


has blocked outgoing requests to your server as its preflight
cross-origin resource sharing (CORS) checks with your server
have failed.

CORS is a security mechanism implemented in browsers to


control how resources on a web page can be requested from
another domain, and is relevant only when sending requests
directly from the browser instead of a server. Browsers use
CORS to check whether they’re allowed to send requests to the
server from a different origin (i.e., domain) than the server.

For example, if your client is hosted on


[Link] and it needs to fetch data from an API
hosted on [Link] , the browser will block
this request unless the API server has CORS enabled.

For now, you can bypass these CORS errors by adding a CORS
middleware on your server, as you can see in Example 6-6, to
allow any incoming requests from browsers.
Example 6-6. Apply CORS settings

# [Link]

from [Link] import CORSMiddlewa

app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

Allow incoming requests from any origins, methods ( GET ,


POST , etc.) and headers.

Streamlit avoids triggering the CORS mechanism by sending


requests on its internal server even though the generated UI
runs on the browser.

On the other hand, the FastAPI documentation page makes


requests from the same origin as the server (i.e.,
[Link] ), so requests by default don’t
trigger the CORS security mechanism.
WARNING

In Example 6-6, you configure the CORS middleware to process any incoming
requests, effectively bypassing the CORS security mechanism for easier development.
In production, you should allow only a handful of origins, methods, and headers to
be processed by your server.

If you followed Example 6-5 or 6-6, you should now be able to


view the incoming stream from your SSE endpoint (see
Figure 6-8).

Figure 6-8. Incoming stream from the SSE endpoint

Congratulations! You now have a full working solution where


model responses are directly streamed to your client as soon as
generated data becomes available. By implementing this
feature, your users will now have a more pleasant experience
interacting with your chatbot as they receive responses to their
queries in real time.

Your solution also implemented concurrency using an


asynchronous client for interacting with the Azure OpenAI API
to stream faster responses to your users. You can try using a
synchronous client to compare the differences in generation
speeds. With an asynchronous client, the generation speed can
be so fast that you will receive a block of text at once even
though it is actually being streamed to the browser.

Streaming LLM outputs from Hugging Face models

Now that you’ve learned how to implement SSE endpoints with


model providers such as Azure OpenAI, you may be wondering
if you can stream model outputs from open source models
you’ve previously downloaded from Hugging Face.

Although Hugging Face’s transformers library implements a


TextStreamer component that you can pass to your model
pipeline, the easiest solution is to run a separate inference
server such as HF Inference Server to implement model
streaming.
Example 6-7 shows how to set up a simple model inference
server using Docker by providing a model-id .

Example 6-7. Serving HF LLM models via HF Inference


Server

$ docker run --runtime nvidia --gpus all \


-v ~/.cache/huggingface:/root/.cache/huggingf
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8080:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1

Use Docker to download and run the latest vllm/vllm-


openai container on all available NVIDIA GPUs.

Share a volume with the Docker container to avoid


downloading weights every run.

Set the secret environment variable to access gated


4
models like mistralai/Mistral-7B-v0.1 .

Run the inference server on localhost port 8080 by


mapping host port 8080 to exposed Docker container
port 8000 .
Enable inter-process communication (IPC) between the
container and the host to allow the container to access the
host’s shared memory.

The vLLM inference server uses the OpenAI API


Specification for LLM serving.

Download and use the gated mistralai/Mistral-7B-


v0.1 from Hugging Face Hub.

With the model server running, you can now use an


AsyncInferenceClient to generate outputs in a streaming
format, as shown in Example 6-8.

Example 6-8. Consuming the LLM output stream from HF


Inference Stream

import asyncio
from typing import AsyncGenerator
from huggingface_hub import AsyncInferenceClient

client = AsyncInferenceClient("[Link]

async def chat_stream(prompt: str) -> AsyncGenera


stream = await client.text_generation(prompt
async for token in stream:
yield token
await [Link](0.05)

While Example 6-8 shows how to use the Hugging Face


inference server, you can still use other model-serving
frameworks such as vLLM that support streaming model
responses.

Before we move on to talking about WebSocket, let’s look at


consuming another variant of SSE endpoints using the POST
method.

SSE with POST Request

The EventSource specification expects GET endpoints on the


server to correctly consume the incoming SSE stream. This
makes implementing real-time applications with SSE
straightforward as the EventSource interface can handle
issues such as connection drops and automatic reconnection.

However, using HTTP GET requests comes with its own


limitations. GET requests are normally less secure than other
5
request methods and more vulnerable to XSS attacks. In
addition, since GET requests can’t have any request body, you
can only transfer data as part of the URL’s query parameters to
the server. The issue is that there is a URL length limit you need
to consider and any query parameters must be encoded
correctly into the request URL. Therefore, you can’t just append
the whole conversation history to the URL as a parameter. Your
server must handle maintaining the history of the conversation
and keeping track of conversational context with GET SSE
endpoints.

A common workaround to the aforementioned limitation is to


implement a POST SSE endpoint even if the SSE specification
doesn’t support it. As a result, the implementation will be more
complex.

First let’s implement the POST endpoint on the server in


Example 6-9.

Example 6-9. Implementing SSE endpoint on the server

# [Link]

from typing import Annotated


from fastapi import Body, FastAPI
from [Link] import StreamingResponse
from stream import azure_chat_client

@[Link]("/generate/text/stream")
async def serve_text_to_text_stream_controller(
prompt: Annotated[str, Body()]
) -> StreamingResponse:
return StreamingResponse(
azure_chat_client.chat_stream(prompt), me
)

With the POST endpoint for streaming chat outputs


implemented, you can now develop the client logic to process
the SSE stream.

You will have to manually process the incoming streaming


yourself using the browser’s fetch web interface, as shown in
Example 6-10.

Example 6-10. Implementing SSE on the client using the


browser EventSource API

{# pages/[Link] #}

<!DOCTYPE html>
<html lang="en">
<head>
<title>SSE With Post Request</title>
</head>
<body>
<button id="streambtn">Start Streaming</button>

You might also like