Example 6-5.
Mounting HTML files on the server as static
assets
# [Link]
from [Link] import StaticFiles
[Link]("/pages", StaticFiles(directory="pages
Mount the pages directory onto the /pages to serve its
content as static assets. Once mounted, you can access
each file by visiting <origin>/pages/<filename> .
By implementing Example 6-5, you serve the HTML from the
same origin as your API server. This avoids triggering the
browser’s CORS security mechanism, which can block outgoing
requests reaching your server.
You can now access the HTML page by visiting
[Link] .
Cross-origin resource sharing
If you try to open the Example 6-4 HTML file in your browser
directly and click the Start Streaming button, you will notice
that nothing happens. You can check the browser’s network tab
to view what happened to the outgoing requests.
After some investigations, you should notice that your browser
has blocked outgoing requests to your server as its preflight
cross-origin resource sharing (CORS) checks with your server
have failed.
CORS is a security mechanism implemented in browsers to
control how resources on a web page can be requested from
another domain, and is relevant only when sending requests
directly from the browser instead of a server. Browsers use
CORS to check whether they’re allowed to send requests to the
server from a different origin (i.e., domain) than the server.
For example, if your client is hosted on
[Link] and it needs to fetch data from an API
hosted on [Link] , the browser will block
this request unless the API server has CORS enabled.
For now, you can bypass these CORS errors by adding a CORS
middleware on your server, as you can see in Example 6-6, to
allow any incoming requests from browsers.
Example 6-6. Apply CORS settings
# [Link]
from [Link] import CORSMiddlewa
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
Allow incoming requests from any origins, methods ( GET ,
POST , etc.) and headers.
Streamlit avoids triggering the CORS mechanism by sending
requests on its internal server even though the generated UI
runs on the browser.
On the other hand, the FastAPI documentation page makes
requests from the same origin as the server (i.e.,
[Link] ), so requests by default don’t
trigger the CORS security mechanism.
WARNING
In Example 6-6, you configure the CORS middleware to process any incoming
requests, effectively bypassing the CORS security mechanism for easier development.
In production, you should allow only a handful of origins, methods, and headers to
be processed by your server.
If you followed Example 6-5 or 6-6, you should now be able to
view the incoming stream from your SSE endpoint (see
Figure 6-8).
Figure 6-8. Incoming stream from the SSE endpoint
Congratulations! You now have a full working solution where
model responses are directly streamed to your client as soon as
generated data becomes available. By implementing this
feature, your users will now have a more pleasant experience
interacting with your chatbot as they receive responses to their
queries in real time.
Your solution also implemented concurrency using an
asynchronous client for interacting with the Azure OpenAI API
to stream faster responses to your users. You can try using a
synchronous client to compare the differences in generation
speeds. With an asynchronous client, the generation speed can
be so fast that you will receive a block of text at once even
though it is actually being streamed to the browser.
Streaming LLM outputs from Hugging Face models
Now that you’ve learned how to implement SSE endpoints with
model providers such as Azure OpenAI, you may be wondering
if you can stream model outputs from open source models
you’ve previously downloaded from Hugging Face.
Although Hugging Face’s transformers library implements a
TextStreamer component that you can pass to your model
pipeline, the easiest solution is to run a separate inference
server such as HF Inference Server to implement model
streaming.
Example 6-7 shows how to set up a simple model inference
server using Docker by providing a model-id .
Example 6-7. Serving HF LLM models via HF Inference
Server
$ docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingf
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8080:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
Use Docker to download and run the latest vllm/vllm-
openai container on all available NVIDIA GPUs.
Share a volume with the Docker container to avoid
downloading weights every run.
Set the secret environment variable to access gated
4
models like mistralai/Mistral-7B-v0.1 .
Run the inference server on localhost port 8080 by
mapping host port 8080 to exposed Docker container
port 8000 .
Enable inter-process communication (IPC) between the
container and the host to allow the container to access the
host’s shared memory.
The vLLM inference server uses the OpenAI API
Specification for LLM serving.
Download and use the gated mistralai/Mistral-7B-
v0.1 from Hugging Face Hub.
With the model server running, you can now use an
AsyncInferenceClient to generate outputs in a streaming
format, as shown in Example 6-8.
Example 6-8. Consuming the LLM output stream from HF
Inference Stream
import asyncio
from typing import AsyncGenerator
from huggingface_hub import AsyncInferenceClient
client = AsyncInferenceClient("[Link]
async def chat_stream(prompt: str) -> AsyncGenera
stream = await client.text_generation(prompt
async for token in stream:
yield token
await [Link](0.05)
While Example 6-8 shows how to use the Hugging Face
inference server, you can still use other model-serving
frameworks such as vLLM that support streaming model
responses.
Before we move on to talking about WebSocket, let’s look at
consuming another variant of SSE endpoints using the POST
method.
SSE with POST Request
The EventSource specification expects GET endpoints on the
server to correctly consume the incoming SSE stream. This
makes implementing real-time applications with SSE
straightforward as the EventSource interface can handle
issues such as connection drops and automatic reconnection.
However, using HTTP GET requests comes with its own
limitations. GET requests are normally less secure than other
5
request methods and more vulnerable to XSS attacks. In
addition, since GET requests can’t have any request body, you
can only transfer data as part of the URL’s query parameters to
the server. The issue is that there is a URL length limit you need
to consider and any query parameters must be encoded
correctly into the request URL. Therefore, you can’t just append
the whole conversation history to the URL as a parameter. Your
server must handle maintaining the history of the conversation
and keeping track of conversational context with GET SSE
endpoints.
A common workaround to the aforementioned limitation is to
implement a POST SSE endpoint even if the SSE specification
doesn’t support it. As a result, the implementation will be more
complex.
First let’s implement the POST endpoint on the server in
Example 6-9.
Example 6-9. Implementing SSE endpoint on the server
# [Link]
from typing import Annotated
from fastapi import Body, FastAPI
from [Link] import StreamingResponse
from stream import azure_chat_client
@[Link]("/generate/text/stream")
async def serve_text_to_text_stream_controller(
prompt: Annotated[str, Body()]
) -> StreamingResponse:
return StreamingResponse(
azure_chat_client.chat_stream(prompt), me
)
With the POST endpoint for streaming chat outputs
implemented, you can now develop the client logic to process
the SSE stream.
You will have to manually process the incoming streaming
yourself using the browser’s fetch web interface, as shown in
Example 6-10.
Example 6-10. Implementing SSE on the client using the
browser EventSource API
{# pages/[Link] #}
<!DOCTYPE html>
<html lang="en">
<head>
<title>SSE With Post Request</title>
</head>
<body>
<button id="streambtn">Start Streaming</button>