Batch inference with Gemini

Get asynchronous, high-throughput, and cost-effective inference for your large-scale data processing needs with Gemini's batch inference (formerly known as batch prediction). This guide walks you through the value of batch inference, how it works, its limitations, and best practices for optimal results.

Why use batch inference?

In many real-world scenarios, you don't need an immediate response from a language model. Instead, you might have a large dataset of prompts that you need to process efficiently and affordably. This is where batch inference shines.

Key benefits include:

Batch inference is optimized for large-scale processing tasks like:

Gemini models that support batch inference

The following base and tuned Gemini models support batch inference:

Click to expand supported models

Global endpoint model support

Batch inference supports using the global endpoint for base Gemini models. It doesn't support the global endpoint for tuned Gemini models.

The global endpoint helps improve overall availability by serving your requests from any region that's supported by the model that you're using. Note that it doesn't support data residency requirements. If you have data residency requirements, use the regional endpoints.

Quotas and limits

While batch inference is powerful, it's important to be aware of the following limitations.

Best practices

To get the most out of batch inference with Gemini, we recommend the following best practices:

What's next

  • Create a batch job with Cloud Storage
  • Create a batch job with BigQuery