Stay organized with collections
Save and categorize content based on your preferences.
External tables for Cloud Storage
This document describes how to create a Cloud Storage BigLake table.
A BigLake table lets you use
access delegation to query structured data in Cloud Storage. Access
delegation decouples access to the BigLake table from access
to the underlying datastore.
Before you begin
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator role
(roles/resourcemanager.projectCreator), which contains the
resourcemanager.projects.create permission. Learn how to grant
roles.
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains the serviceusage.services.enable permission. Learn how to grant
roles.
Optional: For Terraform, terraform-provider-google version 4.25.0 or
later is required. terraform-provider-google releases are listed on
GitHub.
You can download the latest version of Terraform from
HashiCorp Terraform downloads.
Create a Cloud resource connection or set up a default connection to
your external data source. Connections require additional roles and
permissions. For more information, see Create a Cloud resource
connection and the
Default connection overview.
Required roles
To create a BigLake table, you need the following
BigQuery Identity and Access Management (IAM) permissions:
bigquery.tables.create
bigquery.connections.delegate
The BigQuery Admin (roles/bigquery.admin) predefined
Identity and Access Management role includes these permissions.
If you are not a principal in this role, ask your administrator
to grant you access or to create the BigLake table for you.
When you use Cloud Storage to store data files, you can improve
performance by using
Cloud Storage
single-region or
dual-region buckets instead of
multi-region buckets.
Create external tables on unpartitioned data
If you're familiar with creating tables in BigQuery, the process
of creating an external table is similar.
Your table can use any file format that external tables support. For
more information, see
Limitations.
If you don't see the left pane, click last_pageExpand left pane to open the pane.
In the Explorer pane, expand your project, click Datasets, and
then select a dataset.
Expand the
more_vertActions option and click Create table.
In the Source section, specify the following details:
For Create table from, select Google Cloud Storage
For Select file from GCS bucket or use a URI pattern, browse to
select a bucket and file to use, or type the path in the format
gs://bucket_name/[folder_name/]file_name.
You can't specify multiple URIs in the Google Cloud console, but
you can select multiple files by specifying one asterisk (*)
wildcard character. For example, gs://mybucket/file_name*. For more
information, see
Wildcard support for Cloud Storage URIs.
The Cloud Storage bucket
must be in the same location as the dataset that contains the table
you're creating.
For File format, select the format that matches your file.
In the Destination section, specify the following details:
For Project, choose the project in which to create the table.
For Dataset, choose the dataset in which to create the table.
For Table, enter the name of the table you are creating.
For Table type, select External table.
Select Create a BigLake table using a Cloud Resource connection.
For Connection ID, select the connection that you created earlier.
In the Schema section, you can either enable
schema auto-detection or manually specify
a schema if you have a source file. If you don't have a source file, you
must manually specify a schema.
To enable schema auto-detection, select the Auto-detect option.
To manually specify a schema, leave the Auto-detect option
unchecked. Enable Edit as text and enter the table schema as a
JSON array.
To ignore rows with extra column values that do not match the schema,
expand the Advanced options section and select Unknown values.
Click Create table.
After the permanent table is created, you can run a query against the table
as if it were a native BigQuery table. After your query
completes, you can export the results
as CSV or JSON files, save the results
as a table, or save the results to Google Sheets.
PROJECT_ID: the name of your
project in which you want to create the table—for example, myproject
DATASET: the name of the
BigQuery dataset that you want to create the table
in—for example, mydataset
EXTERNAL_TABLE_NAME: the name of the table
that you want to create—for example, mytable
REGION: the region that contains the
connection—for example, us
CONNECTION_ID: the connection ID—for
example, myconnection
When you
view the connection details in the Google Cloud console, the connection
ID is the value in the last section of the fully qualified connection ID
that's shown in Connection ID—for example
projects/myproject/locations/connection_location/connections/myconnection.
To use a
default connection, specify DEFAULT instead of the
connection string containing
PROJECT_ID.REGION.CONNECTION_ID.
TABLE_FORMAT: the format of the table that
you want to create—for example, PARQUET
For more information about supported formats, see
Limitations.
BUCKET_PATH: the path to the
Cloud Storage bucket that contains the data for the
external table, in the format
['gs://bucket_name/[folder_name/]file_name'].
You can select multiple files from the bucket by specifying one asterisk (*)
wildcard character in the path. For example, ['gs://mybucket/file_name*']. For more
information, see
Wildcard support for Cloud Storage URIs.
You can specify multiple buckets for the uris option by providing multiple
paths.
STALENESS_INTERVAL: specifies whether
cached metadata is used by operations against the BigLake
table, and
how fresh the cached metadata must be in order for the operation to
use it. For more information about metadata caching considerations, see
Metadata caching for performance.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an
interval literal
value between 30 minutes and 7 days. For example, specify
INTERVAL 4 HOUR for a 4 hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
CACHE_MODE: specifies whether the metadata
cache is refreshed automatically or manually. For more information
on metadata caching considerations, see
Metadata caching for performance.
Set to AUTOMATIC for the metadata cache to be
refreshed at a system-defined interval, usually somewhere between 30 and
60 minutes.
CONNECTION_ID: the connection ID—for
example, myconnection
When you view the connection details
in the Google Cloud console, the connection ID is the value in the last section of the
fully qualified connection ID that is shown in Connection ID—for example
projects/myproject/locations/connection_location/connections/myconnection.
To use a default connection,
specify DEFAULT instead of the connection string containing
PROJECT_ID.REGION.CONNECTION_ID.
SOURCE_FORMAT: the format of the external data source.
For example, PARQUET.
BUCKET_PATH: the path to the
Cloud Storage bucket that contains the data for the
table, in the format gs://bucket_name/[folder_name/]file_pattern.
You can select multiple files from the bucket by specifying one asterisk (*)
wildcard character in the file_pattern. For example, gs://mybucket/file00*.parquet. For more
information, see
Wildcard support for Cloud Storage URIs.
You can specify multiple buckets for the uris option by providing multiple
paths.
The following examples show valid uris values:
gs://bucket/path1/myfile.csv
gs://bucket/path1/*.parquet
gs://bucket/path1/file1*, gs://bucket1/path1/*
When you specify uris values that target multiple files, all of those
files must share a compatible schema.
STALENESS_INTERVAL: specifies whether
cached metadata is used by operations against the
BigLake table, and
how fresh the cached metadata must be in order for the operation to
use it. For more information about metadata caching considerations, see
Metadata caching for performance.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an interval value between 30
minutes and 7 days, using the
Y-M D H:M:S format described in the
INTERVAL data type
documentation. For example, specify 0-0 0 4:0:0 for a 4
hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
DATASET: the name of the
BigQuery dataset that you want to create a table
in—for example, mydataset
EXTERNAL_TABLE_NAME: the name of the table
that you want to create—for example, mytable
Instead of creating a table definition file, you can pass the table
definition directly to the bq mk command.
Use the @connection decorator to specify the connection to use at the end
of the
--external_table_definition flag.
SOURCE_FORMAT: the format of the external data source
For example, CSV.
BUCKET_PATH: the path to the
Cloud Storage bucket that contains the data for the
table, in the format gs://bucket_name/[folder_name/]file_pattern.
You can select multiple files from the bucket by specifying one asterisk (*)
wildcard character in the file_pattern. For example, gs://mybucket/file00*.parquet. For more
information, see
Wildcard support for Cloud Storage URIs.
You can specify multiple buckets for the uris option by providing multiple
paths.
The following examples show valid uris values:
gs://bucket/path1/myfile.csv
gs://bucket/path1/*.parquet
gs://bucket/path1/file1*, gs://bucket1/path1/*
When you specify uris values that target multiple files, all of those
files must share a compatible schema.
PROJECT_ID: the name of your
project in which you want to create the table—for example, myproject
REGION: the region that contains the
connection, us
CONNECTION_ID: the connection ID—for
example, myconnection
When you view the connection details
in the Google Cloud console, the connection ID is the value in the last section of the
fully qualified connection ID that is shown in Connection ID—for example
projects/myproject/locations/connection_location/connections/myconnection.
To use a default connection,
specify DEFAULT instead of the connection string containing
PROJECT_ID.REGION.CONNECTION_ID.
DATASET_NAME: the name of the dataset where you
want to create the BigLake table
# This creates a bucket in the US region named "my-bucket" with a pseudorandom suffix.
resource "random_id" "default" {
byte_length = 8
}
resource "google_storage_bucket" "default" {
name = "my-bucket-${random_id.default.hex}"
location = "US"
force_destroy = true
uniform_bucket_level_access = true
}
# This queries the provider for project information.
data "google_project" "project" {}
# This creates a connection in the US region named "my-connection".
# This connection is used to access the bucket.
resource "google_bigquery_connection" "default" {
connection_id = "my-connection"
location = "US"
cloud_resource {}
}
# This grants the previous connection IAM role access to the bucket.
resource "google_project_iam_member" "default" {
role = "roles/storage.objectViewer"
project = data.google_project.project.id
member = "serviceAccount:${google_bigquery_connection.default.cloud_resource[0].service_account_id}"
}
# This makes the script wait for seven minutes before proceeding.
# This lets IAM permissions propagate.
resource "time_sleep" "default" {
create_duration = "7m"
depends_on = [google_project_iam_member.default]
}
# This defines a Google BigQuery dataset with
# default expiration times for partitions and tables, a
# description, a location, and a maximum time travel.
resource "google_bigquery_dataset" "default" {
dataset_id = "my_dataset"
default_partition_expiration_ms = 2592000000 # 30 days
default_table_expiration_ms = 31536000000 # 365 days
description = "My dataset description"
location = "US"
max_time_travel_hours = 96 # 4 days
# This defines a map of labels for the bucket resource,
# including the labels "billing_group" and "pii".
labels = {
billing_group = "accounting",
pii = "sensitive"
}
}
# This creates a BigQuery Table with automatic metadata caching.
resource "google_bigquery_table" "default" {
dataset_id = google_bigquery_dataset.default.dataset_id
table_id = "my_table"
schema = jsonencode([
{ "name" : "country", "type" : "STRING" },
{ "name" : "product", "type" : "STRING" },
{ "name" : "price", "type" : "INT64" }
])
external_data_configuration {
# This defines an external data configuration for the BigQuery table
# that reads Parquet data from the publish directory of the default
# Google Cloud Storage bucket.
autodetect = false
source_format = "PARQUET"
connection_id = google_bigquery_connection.default.name
source_uris = ["gs://${google_storage_bucket.default.name}/data/*"]
# This enables automatic metadata refresh.
metadata_cache_mode = "AUTOMATIC"
}
# This sets the maximum staleness of the metadata cache to 10 hours.
max_staleness = "0-0 0 10:0:0"
depends_on = [time_sleep.default]
}
To apply your Terraform configuration in a Google Cloud project, complete the steps in the
following sections.
Set the default Google Cloud project
where you want to apply your Terraform configurations.
You only need to run this command once per project, and you can run it in any directory.
export GOOGLE_CLOUD_PROJECT=PROJECT_ID
Environment variables are overridden if you set explicit values in the Terraform
configuration file.
Prepare the directory
Each Terraform configuration file must have its own directory (also
called a root module).
In Cloud Shell, create a directory and a new
file within that directory. The filename must have the
.tf extension—for example main.tf. In this
tutorial, the file is referred to as main.tf.
mkdir DIRECTORY && cd DIRECTORY && touch main.tf
If you are following a tutorial, you can copy the sample code in each section or step.
Copy the sample code into the newly created main.tf.
Optionally, copy the code from GitHub. This is recommended
when the Terraform snippet is part of an end-to-end solution.
Review and modify the sample parameters to apply to your environment.
Save your changes.
Initialize Terraform. You only need to do this once per directory.
terraform init
Optionally, to use the latest Google provider version, include the -upgrade
option:
terraform init -upgrade
Apply the changes
Review the configuration and verify that the resources that Terraform is going to create or
update match your expectations:
terraform plan
Make corrections to the configuration as necessary.
Apply the Terraform configuration by running the following command and entering yes
at the prompt:
terraform apply
Wait until Terraform displays the "Apply complete!" message.
Open your Google Cloud project to view
the results. In the Google Cloud console, navigate to your resources in the UI to make sure
that Terraform has created or updated them.
BigLake supports schema autodetection. However, if you did not
provide a schema and the service account was not granted access in the previous
steps, these steps fail with an access denied message if you try to autodetect
the schema.
Create BigLake tables on Apache Hive partitioned data
You can create a BigLake table for Hive partitioned data in
Cloud Storage. After you create an externally partitioned table, you
can't change the partition key. You need to recreate the table to change the
partition key.
You can update BigLake tables if necessary, for example to
change their
metadata caching.
To get table details such as source format and source URI, see
Get table information.
You can also use this same procedure to upgrade Cloud Storage-based
external tables to BigLake tables by associating the external
table to a connection. For more information, see
Upgrade external tables to BigLake tables.
To update a BigLake table, select one of the
following options:
PROJECT_ID: the name of the project that
contains the table
DATASET: the name of the dataset that
contains the table
EXTERNAL_TABLE_NAME: the name of the
table
REGION: the region that contains the
connection
CONNECTION_ID: the name of the connection
to use
To use a
default connection, specify DEFAULT instead of the
connection string containing
REGION.CONNECTION_ID.
TABLE_FORMAT: the format used by the table
You can't change this when updating the table.
BUCKET_PATH: the path to the
Cloud Storage bucket that contains the data for the
external table, in the format
['gs://bucket_name/[folder_name/]file_name'].
You can select multiple files from the bucket by specifying one asterisk (*)
wildcard character in the path. For example, ['gs://mybucket/file_name*']. For more
information, see
Wildcard support for Cloud Storage URIs.
You can specify multiple buckets for the uris option by providing multiple
paths.
STALENESS_INTERVAL: specifies whether
cached metadata is used by operations against the table, and
how fresh the cached metadata must be in order for the operation to
use it
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an
interval literal
value between 30 minutes and 7 days. For example, specify
INTERVAL 4 HOUR for a 4 hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
CACHE_MODE: specifies whether the metadata
cache is refreshed automatically or manually
PROJECT_ID: the name of the project that
contains the connection
REGION: the region that contains the
connection
CONNECTION_ID: the name of the connection
to use
TABLE_FORMAT: the format used by the
table. You can't change this when updating the table.
CACHE_MODE: specifies whether the metadata
cache is refreshed automatically or manually. For more information
on metadata caching considerations, see
Metadata caching for performance.
Set to AUTOMATIC for the metadata cache to be refreshed at a
system-defined interval, usually somewhere between 30 and
60 minutes.
You must set CACHE_MODE if
STALENESS_INTERVAL is set to a value
greater than 0.
BUCKET_PATH: the path to the
Cloud Storage bucket that contains the data for the
external table, in the format
gs://bucket_name/[folder_name/]file_name.
You can limit the files selected from the bucket by specifying one asterisk (*)
wildcard character in the path. For example, gs://mybucket/file_name*. For more
information, see
Wildcard support for Cloud Storage URIs.
You can specify multiple buckets for the uris option by providing multiple
paths.
The following examples show valid uris values:
gs://bucket/path1/myfile.csv
gs://bucket/path1/*.csv
gs://bucket/path1/*,gs://bucket/path2/file00*
When you specify uris values that target multiple files, all of those
files must share a compatible schema.
STALENESS_INTERVAL: specifies whether
cached metadata is used by operations against the
table, and how fresh the cached metadata must be in order for
the operation to use it. For more information about metadata
caching considerations, see
Metadata caching for performance.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an interval value between 30
minutes and 7 days, using the
Y-M D H:M:S format described in the
INTERVAL data type
documentation. For example, specify 0-0 0 4:0:0 for a 4
hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
DEFINITION_FILE: the name of the table
definition file that you created or updated.
PROJECT_ID: the name of the project
that contains the table
DATASET: the name of the dataset that
contains the table
EXTERNAL_TABLE_NAME: the name of the table
Example
The following example updates mytable to use cached metadata as long as
it has been refreshed in the last 4.5 hours, and also to refresh cached
metadata automatically:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-05-27 UTC."],[],[]]