Get asynchronous, high-throughput, and cost-effective inference for your large-scale data processing needs with Gemini's batch inference (formerly known as batch prediction). This guide walks you through the value of batch inference, how it works, its limitations, and best practices for optimal results.
In many real-world scenarios, you don't need an immediate response from a language model. Instead, you might have a large dataset of prompts that you need to process efficiently and affordably. This is where batch inference shines.
Key benefits include:
Batch inference is optimized for large-scale processing tasks like:
The following base and tuned Gemini models support batch inference:
Batch inference supports using the global endpoint for base Gemini models. It doesn't support the global endpoint for tuned Gemini models.
The global endpoint helps improve overall availability by serving your requests from any region that's supported by the model that you're using. Note that it doesn't support data residency requirements. If you have data residency requirements, use the regional endpoints.
While batch inference is powerful, it's important to be aware of the following limitations.
To get the most out of batch inference with Gemini, we recommend the following best practices:
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-06-10 UTC.