This page documents Dataflow pipeline options. For information about how to use these options, see Setting pipeline options.
Basic options
This table describes basic pipeline options that are used by many jobs.
This page documents Dataflow pipeline options. For information about how to use these options, see Setting pipeline options.
This table describes basic pipeline options that are used by many jobs.
| Field | Type | Description |
|---|---|---|
dataflowServiceOptions |
String |
Specifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don't have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.29.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, see Service options. |
enableStreamingEngine |
boolean |
Specifies whether Dataflow Streaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources. The default value is false. When set to the default value, the steps of your streaming pipeline are run entirely on worker VMs. Supported in Flex Templates. |
experiments |
String |
Enables experimental or pre-GA Dataflow features, using
the following syntax:
|
gcpTempLocation |
String |
Cloud Storage path for temporary files. Must be a valid
Cloud Storage URL, beginning with If not set, the value of Supported in Flex Templates. |
jobName |
String |
The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details. Also used when updating an existing pipeline. If not set, Dataflow generates a unique name automatically. |
labels |
String |
User-defined
labels, also known as
Supported in Flex Templates. |
project |
String |
The project ID for your Google Cloud project. The project is required if you want to run your pipeline using the Dataflow managed service. If not set, defaults to the project that is configured in the gcloud CLI. |
region |
String |
Specifies a region for deploying your Dataflow jobs. If not set, defaults to |
runner |
Class (NameOfRunner) |
The The default value is |
stagingLocation |
String |
Cloud Storage path for staging local files. Must be a valid Cloud Storage URL,
beginning with If not set, defaults to the value of |
tempLocation |
String |
Cloud Storage path for temporary files. Must be a valid Cloud Storage URL,
beginning with If you set both Supported in Flex Templates. |
dataflow_service_optionsstrSpecifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don't have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.29.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, see Service options.
experimentsstrEnables experimental or pre-GA Dataflow features, using
the following syntax:
--experiments=experiment. When setting multiple experiments programmatically, pass a comma-separated list.
enable_streaming_engineboolSpecifies whether Dataflow Streaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources.
The default value depends on your pipeline configuration. For more information, see Use Streaming Engine. When set to false, the steps of your streaming pipeline are run entirely on worker VMs.
Supported in Flex Templates.
job_namestrThe name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details.
If not set, Dataflow generates a unique name automatically.
labelsstrUser-defined
labels, also known as
additional-user-labels.
User-specified labels are available in billing exports, which you can use for cost attribution.
For each label, specify a "key=value" pair.
Keys must conform to the regular expression: [\p{Ll}\p{Lo}][\p{Ll}\p{Lo}\p{N}_-]{0,62}.
Values must conform to the regular expression: [\p{Ll}\p{Lo}\p{N}_-]{0,63}.
For example, to define two user labels: --labels "name=wrench" --labels "mass=1_3kg".
Supported in Flex Templates.
no_wait_until_finishboolBy default, the "with" statement waits for the job to complete. Set this flag to bypass this behavior and continue execution immediately.
pickle_librarystrThe pickle library to use for data serialization. Supported values are
dill, cloudpickle, and default.
To use the cloudpickle option, set the option both at the
start of the code and as a pipeline option.
You must set the option in both places because pickling starts when
PTransforms are constructed, which happens before
pipeline construction. To include at the start of the code, add lines
similar to the following:
from apache_beam.internal import pickler pickler.set_library(pickler.USE_CLOUDPICKLE)
If not set, defaults to dill.
projectstrThe project ID for your Google Cloud project. The project is required if you want to run your pipeline using the Dataflow managed service.
If not set, throws an error.
regionstrSpecifies a region for deploying your Dataflow jobs.
If not set, defaults to us-central1.
runnerstrThe PipelineRunner to use. This option lets you determine the
PipelineRunner at runtime. To run your pipeline on
Dataflow, use DataflowRunner. To run your pipeline
locally, use DirectRunner.
The default value is DirectRunner (local mode).
sdk_locationstrPath to the Apache Beam SDK. Must be a valid URL,
Cloud Storage path, or local path to an Apache Beam SDK tar
or tar archive file.
To install the Apache Beam SDK from within a container,
use the value container.
If not set, defaults to the current version of the Apache Beam SDK.
Supported in Flex Templates.
staging_locationstrCloud Storage path for staging local files. Must be a valid Cloud Storage URL,
beginning with gs://BUCKET-NAME/.
If not set, defaults to a staging directory within temp_location. You must
specify at least one of temp_location or staging_location to run
your pipeline on Google Cloud.
temp_locationstrCloud Storage path for temporary files. Must be a valid Cloud Storage URL,
beginning with gs://BUCKET-NAME/. In the temp_location filename,
the at sign (@) can't be followed by a number or by an asterisk (*).
You must specify either temp_location or
staging_location (or both). If temp_location
is not set, temp_location defaults to the value for
staging_location.
Supported in Flex Templates.
| Field | Type | Description |
|---|---|---|
dataflow_service_options |
str |
Specifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don't have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.40.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, see Service options. |
experiments |
str |
Enables experimental or pre-GA Dataflow features, using
the following syntax:
--experiments=experiment. When setting multiple experiments programmatically, pass a comma-separated list. |
job_name |
str |
The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details. If not set, Dataflow generates a unique name automatically. |
project |
str |
The project ID for your Google Cloud project. The project is required if you want to run your pipeline using the Dataflow managed service. If not set, returns an error. |
region |
str |
Specifies a region for deploying your Dataflow jobs. If not set, returns an error. |
runner |
str |
The The default value is |
staging_location |
str |
Cloud Storage path for staging local files. Must be a valid Cloud Storage URL,
beginning with If not set, returns an error. |
temp_location |
str |
Cloud Storage path for temporary files. Must be a valid Cloud Storage URL,
beginning with If |
This table describes pipeline options that you can set to manage resource utilization.
| Field | Type | Description |
|---|---|---|
autoscalingAlgorithm |
String |
The autoscaling mode for your Dataflow job. Possible values are
Defaults to |
flexRSGoal |
String |
Specifies Flexible Resource Scheduling (FlexRS) for
autoscaled batch jobs. Affects the If unspecified, defaults to |
maxNumWorkers |
int |
The maximum number of Compute Engine instances to be made available to your pipeline
during execution. This value can be higher than the initial number of workers (specified
by If unspecified, the Dataflow service determines an appropriate number of workers. Supported in Flex Templates. |
no_use_multiple_sdk_containers |
Configures Dataflow worker VMs to start only one containerized Apache Beam SDK process. Doesn't decrease the total number of threads, therefore all threads run in a single Apache Beam SDK process. CPU utilization might be limited and performance might be reduced when you use this option. When you use this option with a worker machine type that has many vCPU cores, to prevent stuck workers, consider reducing the number of worker harness threads. If not specified, Dataflow starts one Apache Beam SDK process per VM core. Supported in Flex Templates. Can be set by the template or by using the
Usage: |
|
numberOfWorkerHarnessThreads |
int |
This option influences the number of concurrent units of work that can be assigned to one worker VM at a time.
Lower values might reduce memory usage by decreasing parallelism.
This value influences the upper bound of parallelism, but the actual number of threads on the worker
might not match this value depending on other constraints.
The implementation depends on the SDK language and other runtime parameters.
If unspecified, the Dataflow service determines an appropriate value. Supported in Flex Templates. |
numWorkers |
int |
The initial number of Compute Engine instances to use when executing your pipeline. This option determines how many workers the Dataflow service starts up when your job begins. If unspecified, the Dataflow service determines an appropriate number of workers. Supported in Flex Templates. |
| Field | Type | Description |
|---|---|---|
autoscaling_algorithm |
str |
The autoscaling mode for your Dataflow job. Possible values are
Defaults to |
flexrs_goal |
str |
Specifies Flexible Resource Scheduling (FlexRS) for
autoscaled batch jobs. Affects the If unspecified, defaults to |
max_num_workers |
int |
The maximum number of Compute Engine instances to be made available to your pipeline
during execution. This value can be higher than the initial number of workers (specified
by If unspecified, the Dataflow service determines an appropriate number of workers. Supported in Flex Templates. |
number_of_worker_harness_threads |
int |
This option influences the number of concurrent units of work that can be assigned to one worker VM at a time.
Lower values might reduce memory usage by decreasing parallelism.
This value influences the upper bound of parallelism, but the actual number of threads on the worker
might not match this value depending on other constraints.
The implementation depends on the SDK language and other runtime parameters.
If unspecified, the Dataflow service determines an appropriate value. Supported in Flex Templates. |
no_use_multiple_sdk_containers |
Configures Dataflow worker VMs to start only one containerized Apache Beam Python SDK process. Does not decrease the total number of threads, therefore all threads run in a single Apache Beam SDK process. Due to Python's global interpreter lock (GIL), CPU utilization might be limited and performance reduced. When using this option with a worker machine type that has many vCPU cores, to prevent stuck workers, consider reducing the number of worker harness threads. If not specified, Dataflow starts one Apache Beam SDK process per VM core. This experiment only affects Python pipelines that use Dataflow Runner V2. Supported in Flex Templates. Can be set by the template or by using the |
|
num_workers |
int |
The number of Compute Engine instances to use when executing your pipeline. If unspecified, the Dataflow service determines an appropriate number of workers. Supported in Flex Templates. |
| Field | Type | Description |
|---|---|---|
autoscaling_algorithm |
str |
The autoscaling mode for your Dataflow job. Possible values are
Defaults to |
flexrs_goal |
str |
Specifies Flexible Resource Scheduling (FlexRS) for
autoscaled batch jobs. Affects the If unspecified, defaults to |
max_num_workers |
int |
The maximum number of Compute Engine instances to be made available to your pipeline
during execution. This value can be higher than the initial number of workers (specified
by If unspecified, the Dataflow service determines an appropriate number of workers. |
no_use_multiple_sdk_containers |
Configures Dataflow worker VMs to start only one containerized Apache Beam SDK process. Doesn't decrease the total number of threads, therefore all threads run in a single Apache Beam SDK process. CPU utilization might be limited and performance might be reduced when you use this option. When you use this option with a worker machine type that has many vCPU cores, to prevent stuck workers, consider reducing the number of worker harness threads. If not specified, Dataflow starts one Apache Beam SDK process per VM core. Supported in Flex Templates. Can be set by the template or by using the
Usage: |
|
number_of_worker_harness_threads |
int |
This option influences the number of concurrent units of work that can be assigned to one worker VM at a time.
Lower values might reduce memory usage by decreasing parallelism.
This value influences the upper bound of parallelism, but the actual number of threads on the worker
might not match this value depending on other constraints.
The implementation depends on the SDK language and other runtime parameters.
If unspecified, the Dataflow service determines an appropriate value. |
num_workers |
int |
The number of Compute Engine instances to use when executing your pipeline. If unspecified, the Dataflow service determines an appropriate number of workers. |
This table describes pipeline options that you can use to debug your job.
| Field | Type | Description |
|---|---|---|
hotKeyLoggingEnabled |
boolean |
Specifies that when a hot key is detected in the pipeline, the literal, human-readable key is printed in the user's Cloud Logging project. If not set, only the presence of a hot key is logged. Note: Hot key detection and logging is disabled for streaming pipelines as of March 2022. |
| Field | Type | Description |
|---|---|---|
enable_hot_key_logging |
bool |
Specifies that when a hot key is detected in the pipeline, the literal, human-readable key is printed in the user's Cloud Logging project. Requires
Dataflow Runner V2
and Apache Beam SDK 2.29.0 or later. Must be set as a service
option, using the format
If not set, only the presence of a hot key is logged. Note: Hot key detection and logging is disabled for streaming pipelines as of March 2022. |
No debugging pipeline options are available.
This table describes pipeline options for controlling your account and networking.
dataflowKmsKeyStringSpecifies the usage and the name of a customer-managed encryption key (CMEK)
used to encrypt data at rest. You can control the encryption key through Cloud KMS.
You must also specify tempLocation to use this feature.
If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK.
Supported in Flex Templates.
gcpOauthScopesListSpecifies the OAuth scopes that will be requested when creating the default Google Cloud credentials. Might have no effect if you manually specify the Google Cloud credential or credential factory.
If not set, the following scopes are used:
"https://www.googleapis.com/auth/bigquery",
"https://www.googleapis.com/auth/bigquery.insertdata",
"https://www.googleapis.com/auth/cloud-platform",
"https://www.googleapis.com/auth/datastore",
"https://www.googleapis.com/auth/devstorage.full_control",
"https://www.googleapis.com/auth/pubsub",
"https://www.googleapis.com/auth/userinfo.email"
impersonateServiceAccountStringIf set, all API requests are made as the designated service account or as the target service account in an impersonation delegation chain. Specify either a single service account as the impersonator, or a comma-separated list of service accounts to create an impersonation delegation chain. This option is only used to submit Dataflow jobs.
If not set, Application Default Credentials are used to submit Dataflow jobs.
serviceAccountStringSpecifies a user-managed worker service account, using the format
my-service-account-name@<project-id>.iam.gserviceaccount.com. For more
information, see the Worker service account
section of the Dataflow security and permissions page.
If not set, workers use the Compute Engine service account of your project as the worker service account.
Supported in Flex Templates.
networkStringThe Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network.
If not set, Google Cloud assumes that you intend to use a network named default.
Supported in Flex Templates.
subnetworkStringThe Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork.
The Dataflow service determines the default value.
Supported in Flex Templates.
usePublicIpsbooleanSpecifies whether Dataflow workers use
external IP addresses.
If the value is set to false, Dataflow workers use
internal IP addresses for all communication. In this case, if the subnetwork
option is specified, the network option is ignored. Make sure that the
specified network or subnetwork has
Private Google Access
enabled. External IP addresses have an
associated cost.
You can also use the
WorkerIPAddressConfiguration
API field to specify how IP addresses are allocated to worker machines.
If not set, the default value is true and Dataflow workers use
external IP addresses.
| Field | Type | Description |
|---|---|---|
dataflow_kms_key |
str |
Specifies the usage and the name of a customer-managed encryption key (CMEK)
used to encrypt data at rest. You can control the encryption key through Cloud KMS.
You must also specify If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK. Supported in Flex Templates. |
gcp_oauth_scopes |
list[str] |
Specifies the OAuth scopes that will be requested when creating Google Cloud credentials. If set programmatically, must be set as a list of strings. If not set, the following scopes are used:
|
impersonate_service_account |
str |
If set, all API requests are made as the designated service account or as the target service account in an impersonation delegation chain. Specify either a single service account as the impersonator, or a comma-separated list of service accounts to create an impersonation delegation chain. This option is only used to submit Dataflow jobs. If not set, Application Default Credentials are used to submit Dataflow jobs. |
service_account_email |
str |
Specifies a user-managed worker service account, using the format
If not set, workers use the Compute Engine service account of your project as the worker service account. Supported in Flex Templates. |
network |
str |
The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network. If not set, Google Cloud assumes that you intend to use a network named Supported in Flex Templates. |
subnetwork |
str |
The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork. The Dataflow service determines the default value. Supported in Flex Templates. |
use_public_ips |
Optional [bool] |
Specifies whether Dataflow workers must use external IP addresses.
External IP addresses have an associated cost. If the option is not explicitly enabled or disabled, the Dataflow workers use external IP addresses. Supported in Flex Templates. |
no_use_public_ips |
Command-line flag that sets Supported in Flex Templates. |
| Field | Type | Description |
|---|---|---|
dataflow_kms_key |
str |
Specifies the usage and the name of a customer-managed encryption key (CMEK)
used to encrypt data at rest. You can control the encryption key through Cloud KMS.
You must also specify If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK. |
network |
str |
The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network. If not set, Google Cloud assumes that you intend to use a network named |
service_account_email |
str |
Specifies a user-managed worker service account, using the format
If not set, workers use the Compute Engine service account of your project as the worker service account. |
subnetwork |
str |
The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork. The Dataflow service determines the default value. |
no_use_public_ips |
bool |
Specifies that Dataflow workers must not use
external IP addresses.
If the value is set to If not set, Dataflow workers use external IP addresses. |
This table describes pipeline options that let you manage the state of your Dataflow pipelines across job instances.
| Field | Type | Description |
|---|---|---|
createFromSnapshot |
String |
Specifies the snapshot ID to use when creating a streaming job. Snapshots save the state of a streaming pipeline and allow you to start a new version of your job from that state. For more information on snapshots, see Using snapshots. If not set, no snapshot is used to create a job. |
enableStreamingEngine |
boolean |
Specifies whether Dataflow Streaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources. The default value is false. This default means that the steps of your streaming pipeline are executed entirely on worker VMs. Supported in Flex Templates. |
update |
boolean |
Replaces the existing job with a new job that runs your updated pipeline code. For more information, read Updating an existing pipeline. The default value is |
| Field | Type | Description |
|---|---|---|
create_from_snapshot |
String |
Specifies the snapshot ID to use when creating a streaming job. Snapshots save the state of a streaming pipeline and allow you to start a new version of your job from that state. For more information on snapshots, see Using snapshots. If not set, no snapshot is used to create a job. |
enable_streaming_engine |
bool |
Specifies whether Dataflow Streaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources. The default value is false. This default means that the steps of your streaming pipeline are executed entirely on worker VMs. Supported in Flex Templates. |
update |
bool |
Replaces the existing job with a new job that runs your updated pipeline code. For more information, read Updating an existing pipeline. The default value is |
| Field | Type | Description |
|---|---|---|
update |
bool |
Replaces the existing job with a new job that runs your updated pipeline code. For more information, read Updating an existing pipeline. Requires Apache Beam SDK 2.40.0 or later. The default value is |
This table describes pipeline options that apply to the Dataflow worker level.
| Field | Type | Description |
|---|---|---|
diskSizeGb |
int |
The disk size, in gigabytes, to use on each remote Compute Engine worker instance. For more information, see Disk size. Set to |
filesToStage |
List<String> |
A non-empty list of local files, directories of files, or archives (such as JAR or zip
files) to make available to each worker. If you set this option, then only those files
you specify are uploaded (the Java classpath is ignored). You must specify all
of your resources in the correct classpath order. Resources are not limited to code,
but can also include configuration files and other resources to make available to all
workers. Your code can access the listed resources using the standard Java
resource lookup methods.
Cautions: Specifying a directory path is suboptimal since
Dataflow zips the files before uploading, which involves a higher startup time
cost. Also, don't use this option to transfer data to workers that is meant to be processed
by the pipeline since doing so is significantly slower than using built-in
If |
workerDiskType |
String |
The type of Persistent Disk to use. For more information, see Disk type. The Dataflow service determines the default value. |
workerMachineType |
String |
The Compute Engine machine type that Dataflow uses when starting worker VMs. For more information, see Machine type. If you don't set this option, Dataflow chooses the machine type based on your job. Supported in Flex Templates. |
workerRegion |
String |
Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the Note: This option cannot be combined with If not set, defaults to the value set for Supported in Flex Templates. |
workerZone |
String |
Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the Note: This option cannot be combined with If you specify either Supported in Flex Templates. |
zone |
String |
(Deprecated) For Apache Beam SDK 2.17.0 or earlier, this option specifies the Compute Engine zone for launching worker instances to run your pipeline. If you specify Supported in Flex Templates. |
workerCacheMb |
int |
Specifies the size of cache for side inputs and user state. By default, the Dataflow allocate 100 MB of memory for caching side inputs and user state. A larger cache might improve the performance of jobs that use large iterable side inputs but also consumes more worker memory. Defaults to 100 MB. |
maxCacheMemoryUsageMb |
int |
For jobs that use Dataflow Runner v2, specifies the cache size for side inputs and user state
in the format Defaults to 100 MB. |
maxCacheMemoryUsagePercent |
int |
For jobs that use Dataflow Runner v2, specifies the cache size as a percentage of total VM space
in the format Defaults to 20%. |
elementProcessingTimeoutMinutes |
int |
For jobs that use Dataflow Runner v2, this flag specifies the timeout for any PTransform to finish processing a single element in the format Defaults to 0 (no timeout). |
disk_size_gbintThe disk size, in gigabytes, to use on each remote Compute Engine worker instance. For more information, see Disk size.
Set to 0 to use the default size defined in your Google Cloud project.
files_to_stagelist[str]Specifies a list of local files to upload to the worker's staging location.
This feature is available in Apache Beam SDK versions 2.63.0 and later. For Dataflow, the staging location is /tmp/staged/.
worker_disk_typestrThe type of Persistent Disk to use. For more information, see Disk type.
The Dataflow service determines the default value.
machine_typestrThe Compute Engine machine type that Dataflow uses when starting worker VMs. For more information, see Machine type.
If you don't set this option, Dataflow chooses the machine type based on your job.
Supported in Flex Templates.
worker_regionstrSpecifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. The zone for worker_region is automatically assigned.
Note: This option cannot be combined with worker_zone or zone.
If not set, defaults to the value set for region.
Supported in Flex Templates.
worker_zonestrSpecifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs.
Note: This option cannot be combined with worker_region or zone.
If you specify either region or worker_region, worker_zone
defaults to a zone from the corresponding region. You can override this behavior
by specifying a different zone.
Supported in Flex Templates.
zonestr(Deprecated) For Apache Beam SDK 2.17.0 or earlier, this option specifies the Compute Engine zone for launching worker instances to run your pipeline.
If you specify region, zone
defaults to a zone from the corresponding region. You can override this behavior
by specifying a different zone.
Supported in Flex Templates.
max_cache_memory_usage_mbintStarting in Apache Beam Python SDK version 2.52.0, you can use this option to control the cache size for side inputs and for user state. Applies for each SDK process. Increasing the amount of memory allocated to workers might improve the performance of jobs that use large iterable side inputs but also consumes more worker memory.
To increase the side input cache value, use one of the following pipeline options.
--max_cache_memory_usage_mb=N.For SDK versions 2.42.0 to 2.51.0, use --experiments=state_cache_size=N.
Replace N with the cache size, in MB.
For SDK versions 2.52.0-2.54.0, defaults to 100 MB.
For other SDK versions, defaults to 0 MB.
element_processing_timeout_minutesintFor jobs that use Dataflow Runner v2, this flag specifies the timeout for any PTransform to finish processing a single element in the format element_processing_timeout_minutes=N, where N is the number of minutes. The minimum supported timeout is 1 minute. If the timeout is exceeded, Runner v2 restarts the SDK harness. When failing to process a single element, Runner v2 will restart the SDK harness a maximum of 4 times for batch jobs, but there isn't a cap for restarting the SDK harness for streaming jobs. This feature is available in Apache Beam SDK versions 2.68.0 and later.
Defaults to 0 (no timeout).
| Field | Type | Description |
|---|---|---|
disk_size_gb |
int |
The disk size, in gigabytes, to use on each remote Compute Engine worker instance. For more information, see Disk size. Set to |
disk_type |
str |
The type of Persistent Disk to use. For more information, see Disk type. The Dataflow service determines the default value. |
worker_machine_type |
str |
The Compute Engine machine type that Dataflow uses when starting worker VMs. For more information, see Machine type. If you don't set this option, Dataflow chooses the machine type based on your job. |
worker_region |
str |
Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the Note: This option cannot be combined with If not set, defaults to the value set for |
worker_zone |
str |
Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the Note: This option cannot be combined with If you specify either |
element_processing_timeout |
duration |
For jobs that use Dataflow Runner v2, this flag specifies the timeout for any PTransform to finish processing a single element in the format Defaults to 0 (no timeout). |
When executing your pipeline locally, the default values for the properties in
PipelineOptions are usually sufficient.
You can find the default values for PipelineOptions in the Apache Beam SDK for Java
API reference; see the
PipelineOptions
class listing for complete details.
If your pipeline uses Google Cloud products such as BigQuery or
Cloud Storage for I/O, you might need to set certain
Google Cloud project and credential options. In such cases, you should
use GcpOptions.setProject to set your Google Cloud Project ID. You may also
need to set credentials explicitly. See the
GcpOptions
class for complete details.
You can find the default values for PipelineOptions in the Apache Beam SDK for
Python API reference; see the
PipelineOptions
module listing for complete details.
If your pipeline uses Google Cloud services such as
BigQuery or Cloud Storage for I/O, you might need to
set certain Google Cloud project and credential options. In such cases,
you should use options.view_as(GoogleCloudOptions).project to set your
Google Cloud Project ID. You may also need to set credentials
explicitly. See the
GoogleCloudOptions
class for complete details.
You can find the default values for PipelineOptions in the Apache Beam SDK for
Go API reference; see
jobopts
for more details.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-06-09 UTC.