Vertical vs. horizontal
When to add GPUs per replica (vertical)
More GPUs per replica increase generation speed, lower time to first token, and raise the maximum requests per second a single replica can handle. Reach for vertical scaling when:- Compute-bound: Your workload is bottlenecked by GPU compute. Adding GPUs to one replica directly speeds up each request.
- Memory-intensive: The model or context window is large enough that one GPU can’t hold it. Adding GPUs to a replica gives you the memory headroom.
- Single-node parallelism works: Your workload benefits from data parallelism or model parallelism within a single node.
- Low-latency requirements: Each request needs to complete quickly. Multiple GPUs in one replica process the request faster than one GPU could.
endpoints create. List options for your model with together endpoints hardware --model <model_id>, then create the endpoint with the SKU you want:
Shell
When to add replicas (horizontal)
More replicas raise the maximum requests per second the endpoint can serve in aggregate. Reach for horizontal scaling when:- Concurrent requests: Your application receives a high volume of simultaneous requests. Replicas spread that load.
- I/O-bound workloads: Requests spend significant time waiting on data load or write. Replicas let you do more of that waiting in parallel.
- Fault tolerance: A second replica means a single hardware failure doesn’t take your endpoint offline.
- Multi-node parallelism works: Your workload scales well across nodes (data parallelism, distributed inference).
--min-replicas and --max-replicas at create time, or update them later on a running endpoint:
Shell