I/O Standards

Overview

This Apache Beam I/O Standards document lays out the prescriptive guidance for 1P/3P developers developing an Apache Beam I/O connector. These guidelines aim to create best practices encompassing documentation, development and testing in a simple and concise manner.

What are built-in I/O Connectors?

An I/O connector (I/O) living in the Apache Beam Github repository is known as a Built-in I/O connector. Built-in I/O’s have their integration tests and performance tests routinely run by the Google Cloud Dataflow Team using the Dataflow Runner and metrics published publicly for reference. Otherwise, the following guidelines will apply to both unless explicitly stated.

Guidance

Documentation

This section lays out the superset of all documentation that is expected to be made available with an I/O. The Apache Beam documentation referenced throughout this section can be found here. And generally a good example to follow would be the built-in I/O, Snowflake I/O.

Built-in I/O

Provided code docs for the relevant language of the I/O. This should also have links to any external sources of information within the Apache Beam site or external documentation location.

Add a new page under I/O connector guides that covers specific tips and configurations. The following shows those for Parquet, Hadoop and others.

Formatting of the section headers in your Javadoc/Pythondoc should be consistent throughout such that programmatic information extraction for other pages can be enabled in the future.

I/O Connectors should include a table under Supported Features subheader that indicates the Relational Features utilized.

Relational Features are concepts that can help improve efficiency and can optionally be implemented by an I/O Connector. Using end user supplied pipeline configuration (SchemaIO) and user query (FieldAccessDescriptor) data, relational theory is applied to derive improvements such as faster pipeline execution, lower operation costs and less data read/written.

<div class="table-container-wrapper">
<table class="table table-bordered table-io-standards-relational-features">
   <tr>
      <th>
         <p><strong>Relational Feature</strong>
      </th>
      <th>
         <p><strong>Supported</strong>
      </th>
      <th>
         <p><strong>Notes</strong>
      </th>
   </tr>
   <tr>
      <td>
         <p>Column Pruning
      </td>
      <td>
         <p>Yes/No
      </td>
      <td>
         <p>To Be Filled
      </td>
   </tr>
   <tr>
      <td>
         <p>Filter Pushdown
      </td>
      <td>
         <p>Yes/No
      </td>
      <td>
         <p>To Be Filled
      </td>
   </tr>
   <tr>
      <td>
         <p>Table Statistics
      </td>
      <td>
         <p>Yes/No
      </td>
      <td>
         <p>To Be Filled
      </td>
   </tr>
   <tr>
      <td>
         <p>Partition Metadata
      </td>
      <td>
         <p>Yes/No
      </td>
      <td>
         <p>To Be Filled
      </td>
   </tr>
   <tr>
      <td>
         <p>Metastore
      </td>
      <td>
         <p>Yes/No
      </td>
      <td>
         <p>To Be Filled
      </td>
   </tr>
</table>
</div>

Add a page under Common pipeline patterns, if necessary, outlining common usage patterns involving your I/O.

Include a canonical read/write code snippet after the initial description for each supported language. The below example shows Hadoop with examples for Java.

Indicate how timestamps for elements are assigned. This includes batch sources to allow for future I/Os which may provide more useful information than current_time().

Outline any temporary resources (for example, files) that the connector will create.

Provide, under an Authentication subheader, how to acquire partner authorization material to securely access the source/sink.

I/Os should provide links to the Source/Sink documentation within Before you start header.

Indicate if there is native or X-language support in each language with a link to the docs.

Indicate known limitations under a Limitations header. If the limitation has a tracking issue, please link it inline.

I/O (not built-in)

Custom I/Os are not included in the Apache Beam Github repository. Some examples would be SolaceIO.

This section outlines API syntax, semantics and recommendations for features that should be adopted for new as well as existing Apache Beam I/O Connectors.

The I/O Connector development guidelines are written with the following principles in mind:

All SDKs

Pipeline Configuration / Execution / Streaming / Windowing semantics guidelines

An I/O should rarely rely on a PipelineOptions subclass to tune internal parameters.

A source must return elements in the GlobalWindow unless explicitly parameterized in the API by the user.

A sink should be Window agnostic and handle elements sent with any Windowing method, unless explicitly parameterized or expressed in its API.

A sink may change windowing of a PCollection internally in any way. However, the metadata that it returns as part of its result object must be:

A streaming sink (or any transform accessing an external service) may implement throttling of its requests to prevent from overloading the external service.

Java

General

The primary class used in working with the connector should be named {connector}IO

The class should be placed in the package org.apache.beam.sdk.io.{connector}

The unit/integration/performance tests should live under the package org.apache.beam.sdk.io.{connector}.testing. This will cause the various tests to work with the standard user-facing interfaces of the connector.

Unit tests should reside in the same package (i.e. org.apache.beam.sdk.io.{connector}), as they may often test internals of the connector.

An I/O transform should avoid receiving user lambdas to map elements from a user type to a connector-specific type. Instead, they should interface with a connector-specific data type (with schema information when possible).

When necessary, an I/O transform should receive a type parameter that specifies the input type (for sinks) or output type (for sources) of the transform.

It is highly discouraged to directly expose third-party libraries in the public API part of the I/O Connector for the following reasons:

Instead, we highly recommend exposing Beam-native interfaces and an adaptor that holds mapping logic.

Source and Sinks should be abstracted with a PTransform wrapper, and internal classes be declared protected or private. By doing so implementation details can be added/changed/modified without breaking implementation by dependencies.

Classes / Methods / Properties

Gives access to the class that represents reads within the I/O. The Read class should implement a fluent interface similar to the fluentbuilder pattern (e.g. withX(...).withY(...)). Together with default values, it provide a fail-fast (with immediate validation feedback after each .withX()) that is slightly less verbose than the builder pattern.

A few different sources implement runtime configuration for reading from a data source. This is a valuable pattern because it enables a purely batch source to become a more sophisticated streaming source.

As much as possible, this type of transform should have the type richness of a construction-time-configured transform:

Gives access to the class that represents writes within the I/O. The Write class should implement a fluent interface pattern (e.g. withX(...).withY(...)) as described further above for IO.Read.

Some data storage and external systems implement APIs that do not adjust easily to Read or Write semantics (e.g. FhirIO implements several different transforms that fetch or send data to Fhir).

These classes should be added only if it is impossible or prohibitively difficult to encapsulate their functionality as part of extra configuration of Read, Write and ReadAll transforms, to avoid increasing the cognitive load on users.

Some connectors rely on other user-facing classes to set configuration parameters.

The top-level I/O class will provide a static method to start constructing an I/O.Write transform. This returns a PTransform with a single input PCollection, and a Write.Result output.

The method to start constructing an I/O.Read transform. This returns a PTransform with a single output PCollection.

The above should be specified via configuration parameters if possible. If not possible, then a new static method may be introduced, but this must be exceptional, and documented in the I/O header as part of the API.

A Read transform must provide a from method where users can specify where to read from. If a transform can read from different kinds of sources (e.g. tables, queries, topics, partitions), then multiple implementations of this from method can be provided to accommodate this:

The input type for these methods can reflect the external source’s API (e.g. Kafka TopicPartition should use a Beam-implemented TopicPartition object).

A Write transform must provide a to method where users can specify where to write data. If a transform can write to different kinds of sources while still using the same input element type(e.g. tables, queries, topics, partitions), then multiple implementations of this from method can be provided to accommodate this:

The input type for these methods can reflect the external sink's API (e.g. Kafka TopicPartition should use a Beam-implemented TopicPartition object).

If different kinds of destinations require different types of input object types, then these should be done in separate I/O connectors.

A write transform may enable writing to more than one destination. This can be a complicated pattern that should be implemented carefully (it is the preferred pattern for connectors that will likely have multiple destinations in a single pipeline).

The preferred pattern for this is to define a DynamicDestinations interface (e.g. BigQueryIO.DynamicDestinations) that will allow the user to define all necessary parameters for the configuration of the destination.

withX provides a method for configuration to be passed to the Read method, where X represents the configuration to be created. With the exception of generic with statements ( defined below ) the I/O should attempt to match the name of the configuration option with that of the option name in the source.

These methods should return a new instance of the I/O rather than modifying the existing instance.

Some connectors in Java receive a configuration object as part of their configuration. This pattern is encouraged only for particular cases. In most cases, a connector can hold all necessary configuration parameters at the top level.

To determine whether a multi-parameter configuration object is an appropriate parameter for a high level transform, the configuration object must:

For sources that can receive Beam Row-typed PCollections, the format function should not be necessary, because Beam should be able to format the input data based on its schema.

For sinks providing Dynamic Destination functionality, elements may carry data that helps determine their destination. These data may need to be removed before writing to their final destination.

Sets the coder to use to encode/decode the element type of the output / input PCollection of this connector. In general, it is recommended that sources will:

Connector transforms should provide a method to override the interface between themselves and the external system that they communicate with. This can enable various uses:

Sets the coder to use to encode/decode the element type of the output / input PCollection of this connector. In general, it is recommended that sources will:

Types

The expand method of a Read transform must return a PCollection object with a type. The type may be parameterized or fixed to a class.

The type of the PCollection will usually be one of the following four options. For each of these option, the encoding / data is recommended to be as follows:

The expand method of any write transform must return a type IO.Write.Result object that extends a PCollectionTuple. This object allows transforms to return metadata about the results of its writing and allows this write to be followed by other PTransforms.

If the Write transform would not need to return any metadata, a Write.Result object is still preferable, because it will allow the transform to evolve its metadata over time.

Evolution

Over time, I/O need to evolve to address new use cases, or use new APIs under the covers. Some examples of necessary evolution of an I/O:

In general, one should resist adding a completely new static method for functionality that can be captured as configuration within an existing method.

An example of too many top-level methods that could be supported via configuration is PubsubIO

Python

General

If the I/O lives in Apache Beam it should be placed in the package apache_beam.io.{connector} or apache_beam.io.{namespace}.{connector}

There will be a module named {connector}.py which is the primary entry point used in working with the connector in a pipeline apache_beam.io.{connector} or apache_beam.io.{namespace}.{connector}

If the I/O implementation exists in a single module (a single file), then the file {connector}.py can hold it.

Otherwise, the connector code should be defined within a directory (connector package) with an __init__.py file that documents the public API.

If the connector defines other files containing utilities for its implementation, these files must clearly document the fact that they are not meant to be a public interface.

Classes / Methods / Properties

This gives access to the PTransform to read from a given data source. It allows you to configure it via the arguments that it receives. For long lists of optional parameters, they may be defined as parameters with a default value.

As much as possible, this type of transform should have the type richness and safety of a construction-time-configured transform:

This gives access to the PTransform to write into a given data sink. It allows you to configure it via the arguments that it receives. For long lists of optional parameters, they may be defined as parameters with a default value.

The first parameter in a Read or Write I/O connector must specify the source for readers or the destination for writers.

The preferred API pattern in Python is to pass callables (e.g. WriteToBigQuery) for all parameters that will need to be configured. In general, examples of callable parameters may be:

Using these callables also allows maintainers to add new parameterizable callables over time (with default values to avoid breaking existing users) that will define extra configuration parameters if necessary.

Some connectors in Python may receive a complex configuration object as part of their configuration. This pattern is discouraged, because a connector can hold all necessary configuration parameters at the top level.

To determine whether a multi-parameter configuration object is an appropriate parameter for a high level transform, the configuration object must:

Types

The expand method of a Read transform must return a PCollection object with a type, and be annotated with the type. Preferred PCollection types in Python are (in order of preference):

The expand method of any write transform must return a Python object with a fixed class type. The recommended name for the class is WriteTo{IO}Result. This object allows transforms to return metadata about the results of its writing.

If the Write transform would not need to return any metadata, a Python object with a class type is still preferable, because it will allow the transform to evolve its metadata over time.

The expand method of a Write transform must return a PCollection object with a type, and be annotated with the type. Preferred PCollection types in Python are the same as the output types for a ReadFromIO referenced in T1.

GoLang

General

Integration and Performance tests should live under the same package as the I/O itself

Typescript

Classes / Methods / Properties

Testing

An I/O should have unit tests, integration tests, and performance tests. In the following guidance we explain what each type of test aims to achieve, and provide a baseline standard of test coverage. Do note that the actual test cases and business logic of the actual test would vary depending on specifics of each source/sink but we have included some suggested test cases as a baseline.

This guide complements the Apache Beam I/O transform testing guide by adding specific test cases and scenarios. For general information regarding testing Beam I/O connectors, please refer to that guide.

Integration and performance tests should live under the package org.apache.beam.sdk.io.{connector}.testing. This will cause the various tests to work with the standard user-facing interfaces of the connector.

Unit tests should reside in the same package (i.e. org.apache.beam.sdk.io.{connector}), as they may often test internals of the connector.

Unit Tests

I/O unit tests need to efficiently test the functionality of the code. Given that unit tests are expected to be executed many times over multiple test suites (for example, for each Python version) these tests should execute relatively fast and should not have side effects. We recommend trying to achieve 100% code coverage through unit tests.

When possible, unit tests are favored over integration tests due to faster execution time and low resource usage. Additionally, unit tests can be easily included in pre-commit tests suites (for example, Jenkins beam_PreCommit_* test suites) hence has a better chance of discovering regressions early. Unit tests are also preferred for error conditions.

The unit testing class should be part of the same package as the IO and named {connector}IOTest.

Suggested Test Cases

Integration Tests

Integration tests test end-to-end interactions between the Beam runner and the data store a given I/O connects to. Since these usually involve remote RPC calls, integration tests take a longer time to execute. Additionally, Beam runners may use more than one worker when executing integration tests. Due to these costs, an integration test should only be implemented when a given scenario cannot be covered by a unit test.

The integration testing class should be part of the same package as the I/O and named {connector}IOIT.

Suggested Test Cases

Same as “write then read” but for sources that support reading a PCollection of source configs. All future (SDF) sources are expected to support this.

Performance Tests

Because the Performance testing framework is still in flux, performance tests can be a follow-up submission after the actual I/O code.

The Performance testing framework does not yet support GoLang or Typescript.

Performance benchmarks are a critical part of best practices for I/Os as they effectively address several areas:

Dashboard

Google runs performance tests routinely for built-in I/Os and publishes them to an externally viewable dashboard for Java and Python.

Guidance

Include a Resource Scalability section into your page under Built-in I/O connector guides documentation which will indicate the upper bounds which the IO has integration tests for.

An indication that KafkaIO has integration tests with xxxx topics. The documentation can state if the connector authors believe that the connector can scale beyond the integration test number, however this will make it clear to the user the limits of the tested paths.

Include expected performance characteristics of the I/O based on performance tests that the connector has in place.

I/O Standards

Overview

What are built-in I/O Connectors?

Guidance

Documentation

Built-in I/O

I/O (not built-in)

All SDKs

Pipeline Configuration / Execution / Streaming / Windowing semantics guidelines

Java

General

Classes / Methods / Properties

Types

Evolution

Python

General

Classes / Methods / Properties

Types

GoLang

General

Typescript

Classes / Methods / Properties

Testing

Unit Tests

Suggested Test Cases

Integration Tests

Suggested Test Cases

Performance Tests

Dashboard

Guidance

Have you found everything you were looking for?