Configure Storage Insights datasets
Stay organized with collections
Save and categorize content based on your preferences.
You can configure Storage Insights datasets to collect and analyze
metadata and activity data from your Cloud Storage buckets and objects.
Use the insights generated from these datasets to help identify opportunities
for cost optimization, perform security audits, and support operational
monitoring. This document shows you how to configure Storage Insights datasets.
Before you begin
Before you configure a dataset,
complete the following steps.
Get the required roles
To get the permissions that
you need to configure datasets,
ask your administrator to grant you the
following IAM roles on your source projects:
These predefined roles contain
the permissions required to configure datasets. To see the exact permissions that are
required, expand the Required permissions section:
Required permissions
The following permissions are required to configure datasets:
Configure a dataset:
storageinsights.datasetConfigs.create
storage.buckets.getObjectInsights
Link to BigQuery dataset:
storageinsights.datasetConfigs.linkDataset
To create a dataset configuration, follow these
steps. For more information about the fields you can specify
for the dataset configuration, see Dataset configuration properties.
Console
In the Google Cloud console, go to the Cloud Storage Storage Insights page.
In the Name your dataset section, enter a name for your dataset.
Optionally, enter a description for the dataset. Names identify dataset configurations and are
immutable. The name can contain up to 128 characters, including letters,
numbers, and underscores, and must start with a letter.
In the Define dataset scope section, do the following:
Select one of the following options:
To get storage metadata for all projects in the current
organization, select Include the organization.
To get storage metadata for all projects in the selected folders,
select Include folders (Sub-organization/departments).
For information about getting folder IDs, see
Viewing or listing folders and projects.
To add folders:
In the Folder 1 field, enter the folder ID.
Optionally, to add multiple folder IDs, click
+ Add another folder.
To get storage metadata for the selected projects, select
Include projects by providing project numbers. To learn how
to find project numbers, see Find the project name, number,
and ID. To add projects, do the following:
In the Project 1 field, enter the project number.
Optionally, to add multiple project numbers, click
+ Add another project.
To add projects or folders in bulk, select
Upload a list of projects/folders via CSV file. The CSV file
must contain the project numbers or folder IDs to
include in the dataset. You can specify up to 10,000 projects
or folders in one dataset configuration.
Specify whether to automatically include future buckets in the
selected resource.
Optionally, to specify filters on buckets based
on regions and bucket prefixes, expand the Filters (optional)
section. Filters are applied additively on buckets.
You can include or exclude buckets from specific regions. For example,
you can exclude buckets in the me-central1 and me-central2
regions. You can also include or exclude buckets by prefix.
For example,
to exclude buckets that start with my-bucket, enter the
my-bucket* prefix.
Click Continue.
In the Select retention period section, select a retention period
for the data in the dataset.
Activity data is included in the dataset by default, and inherits the retention
period of the dataset. To override the dataset retention period, select Specify a
retention period for activity data, and then select the number of days to retain activity data for.
To disable activity data, set the retention period to 0 days.
In the Select location to store configured dataset section, select
a location to store the dataset. For example,
us-central1.
DATASET_CONFIG_ID with the name for your
dataset configuration. Names identify dataset configurations and are
immutable. The name can contain up to 128 characters, including letters,
numbers, and underscores, and must start with a letter.
LOCATION with the location
to store the dataset. For example,
us-central1.
DATASET_RETENTION_PERIOD_DAYS with the retention period
for the data in the dataset.
SCOPE_FLAG with any one
of the following flags that defines the scope of the data to
collect:
--enable-organization-scope: Enables the dataset to
collect insights from all buckets within the organization.
--source-folders=[SOURCE_FOLDER_NUMBERS,...]:
Specifies a list of folder numbers to include in the dataset.
To learn how to find a folder number, see Listing all projects
and folders in your hierarchy.
--source-folders-file=FILE_PATH:
Specifies multiple folder numbers by uploading a CSV file to a
bucket.
--source-projects=[SOURCE_PROJECT_NUMBERS,...]:
Specifies a list of project numbers to include in the dataset.
For example, 464036093014. To find your project
number, see Find the project name, number, and ID.
--source-projects-file=FILE_PATH:
Specifies multiple project numbers by uploading a CSV file to a
bucket.
Optionally, use the following additional flags to configure
the dataset:
Use --include-buckets=BUCKET_NAMES_OR_REGEX
to include specific buckets by name or regular expression. You can't use this flag with --exclude-buckets.
Use --exclude-buckets=BUCKET_NAMES_OR_REGEX
to exclude specific buckets by name or regular expression. You can't use this flag with --include-buckets.
Use --project=DESTINATION_PROJECT_ID to specify
a project for storing your dataset configuration and generated
dataset. If you don't use this flag, the destination project is your
active project. For more information about project IDs, see
Creating and managing projects.
Use --auto-add-new-buckets to automatically include any
buckets added to source projects in the future.
Use --skip-verification to skip checks and failures from
the verification process, which includes checks for required
IAM permissions. If you use this flag, some or all buckets might
be excluded from the dataset.
Use --identity=IDENTITY_TYPE to specify the
scope of the service agent created with the dataset
configuration. Values are IDENTITY_TYPE_PER_CONFIG or
IDENTITY_TYPE_PER_PROJECT. If unspecified, the default is
IDENTITY_TYPE_PER_CONFIG. For details, see Service agent type.
Use --description=DESCRIPTION to add
a description for the dataset configuration.
Use
--activity-data-retention-period-days=ACTIVITY_RETENTION_PERIOD_DAYS
to specify the retention period for the activity data in the
dataset. By default, activity data is included in the dataset, and
inherits the retention
period of the dataset. To override the dataset retention period,
specify the number of days to retain activity data for. To exclude
activity data, set the
ACTIVITY_RETENTION_PERIOD_DAYS to 0.
The following example creates a dataset configuration named
my-dataset in the us-central1 region, for the organization with
the ID 123456789, with a retention period of 30 days, and a
scope limited to the projects 987654321 and 123123123:
Alternatively, you can add an organization, or one or multiple folders
that contain buckets and objects for which you want to update the metadata.
To include folders or organizations, use the
sourceFolders or organizationScope fields.
For more information, see the DatasetConfig
reference.
RETENTION_PERIOD_DAYS with the
number of days of data to capture in the dataset snapshot. For
example, 90.
ACTIVITY_DATA_RETENTION_PERIOD_DAYS with
the number of days of activity data to capture in the
dataset snapshot. By default, activity data is included in the
dataset, and inherits the retention
period of the dataset. To override the dataset retention period,
specify the number of days to retain activity data for. To exclude
activity data, set the ACTIVITY_RETENTION_PERIOD_DAYS to
0.
IDENTITY_TYPE with the type of service
account that gets created alongside the dataset configuration.
Values are IDENTITY_TYPE_PER_CONFIG or
IDENTITY_TYPE_PER_PROJECT. For details, see Service agent type.
JSON_FILE_NAME with the path to the JSON
file you created in the previous step. Alternatively, you can
pass an instance of DatasetConfig in the request body.
PROJECT_ID with the
ID of the project that the dataset configuration and dataset
will belong to.
LOCATION with the location
where the dataset and dataset configuration will reside. For example,
us-central1.
DATASET_CONFIG_ID with the name of your
dataset configuration. Names identify dataset configurations and are
immutable. The name can contain up to 128 characters, including letters,
numbers, and underscores, and must start with a letter.
SERVICE_ACCOUNT with the service account. For example, test-service-account@test-project.iam.gserviceaccount.com.
To enable Storage Insights to generate and write datasets, ask your
administrator to grant the service agent the Storage Insights Collector
Service role (roles/storage.insightsCollectorService) on the organization
that contains the source projects.
You must grant this role to every configuration-scoped service agent
created for each dataset configuration from which you want data. If you use
a project-scoped service agent, you must grant this role only once on the service agent to read and write datasets
for all dataset configurations within the project.
For instructions about granting roles for projects, see Manage access.
Link a dataset
To link a dataset to BigQuery, complete the following steps:
Console
In the Google Cloud console, go to the Cloud Storage Storage Insights page.
DESTINATION_PROJECT_ID with the ID of the
project that contains the dataset configuration. For more information
about project IDs, see Creating and managing projects.
DATASET_CONFIG_ID with the name of the
dataset configuration that generated the dataset to link.
LOCATION with the location of your
dataset and dataset configuration. For example, us-central1.
JSON_FILE_NAME with the path to the
JSON file you created.
PROJECT_ID with the
ID of the project to which the dataset configuration belongs.
LOCATION with the location where
the dataset and dataset configuration reside. For example,
us-central1.
DATASET_CONFIG_ID with the name
of the dataset configuration that generated the dataset to link.
SERVICE_ACCOUNT with the service account. For example, test-service-account@test-project.iam.gserviceaccount.com.
Analyze object data and metadata using BigQuery
To analyze object content or view object metadata, use the ref column, which is returned as part of an Storage Insights dataset, to run BigQuery ObjectRef
functions. Complete
the steps in the following sections.
Create a Cloud resource connection in BigQuery
In BigQuery, create a Cloud resource connection that accesses Cloud Storage.
The Cloud resource connection lets BigQuery access Cloud Storage object data
and metadata using its own service account. For details, see Create a Cloud resource connection.
Use the Cloud resource connection with Storage Insights dataset
To analyze data referenced in the ref column that is returned as part of a Storage Insights dataset, use the OBJ.MAKE_REF function to combine the URI from the ref column
with the connection that you created:
CONNECTION_ID: the ID of the Cloud resource connection
you created.
PROJECT_ID: the ID of the project that contains the
Storage Insights dataset.
INSIGHTS_DATASET: the name of the Storage Insights dataset.
For example, storageinsights_dataset.
Analyze a Storage Insights dataset using a custom model
BigQuery doesn't support creating models directly within a linked
dataset. To analyze your Storage Insights data with a custom model, you
must create and store the model in a standard BigQuery dataset. You can then
reference that model in your queries while targeting the linked dataset for
analysis:
Create a model in a BigQuery dataset (MODEL_DATASET)
in the same project as your linked dataset (INSIGHTS_DATASET):
MODEL_DATASET: the name of the dataset where you want to
create the model.
CONNECTION_ID: the ID of the Cloud resource connection
you created.
Run a query that references the model to analyze the data in your Storage Insights dataset.
The following example adds a description for images in the dataset:
SELECTname,resultASai_descriptionFROMAI.GENERATE_TEXT(MODEL`MODEL_DATASET.gemini_model`,(SELECTname,('Describe this image',OBJ.GET_ACCESS_URL(OBJ.FETCH_METADATA(OBJ.MAKE_REF(ref.uri,'CONNECTION_ID')),'r'))ASpromptFROM`INSIGHTS_DATASET.object_attributes_view`WHEREcontentTypeLIKE'image/%'ANDNOTnameLIKE'%/'LIMIT3));
Replace:
INSIGHTS_DATASET: the name of the Storage Insights dataset.
MODEL_DATASET: the name of the dataset where you created the model.
CONNECTION_ID: the ID of the Cloud resource connection
you created.
Analyze a Storage Insights dataset using a default model
You can use a default model to generate insights from unstructured data and help detect sensitive
information.
Generate insights from unstructured data
The following example query generates descriptions for JPEG images:
SELECTAI.GENERATE(('Return a JSON object with fields: "description" (max 20 words)',OBJ.GET_ACCESS_URL(OBJ.MAKE_REF(ref.uri,`CONNECTION_ID`),'r')))FROM`PROJECT_ID.INSIGHTS_DATASET.object_attributes_view`WHEREnameLIKE'returns/electronics/%'ANDcontentType='image/jpeg';
Replace:
CONNECTION_ID: the ID of the Cloud resource connection
you created.
PROJECT_ID: the ID of the project that contains the
Storage Insights dataset.
INSIGHTS_DATASET: the name of the Storage Insights dataset.
Automated sensitive data detection
You can use multimodal models to help detect sensitive data, such as
personally identifiable information (PII), in your documents.
The following query shows how you can scan PDF documents to check
for sensitive information:
SELECTAI.GENERATE(('Does this document contain any credit card numbers or home addresses? Answer "SAFE" or "SENSITIVE".',OBJ.GET_ACCESS_URL(OBJ.MAKE_REF(ref.uri,`CONNECTION_ID`),'r')))FROM`PROJECT_ID.INSIGHTS_DATASET.object_attributes_view`WHEREcontentType='application/pdf';
Replace:
CONNECTION_ID: the ID of the Cloud resource connection
you created.
PROJECT_ID: the ID of the project that contains the
Storage Insights dataset.
INSIGHTS_DATASET: the name of the Storage Insights dataset.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-06-10 UTC."],[],[]]