Configure Storage Insights datasets

You can configure Storage Insights datasets to collect and analyze metadata and activity data from your Cloud Storage buckets and objects. Use the insights generated from these datasets to help identify opportunities for cost optimization, perform security audits, and support operational monitoring. This document shows you how to configure Storage Insights datasets.

Before you begin

Before you configure a dataset, complete the following steps.

Get the required roles

To get the permissions that you need to configure datasets, ask your administrator to grant you the following IAM roles on your source projects:

To configure a dataset: Storage Insights Admin (roles/storageinsights.admin)
To link a dataset:
- Storage Insights Analyst (roles/storageinsights.analyst)
- BigQuery Admin (roles/bigquery.admin)

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to configure datasets. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to configure datasets:

Configure a dataset:
- storageinsights.datasetConfigs.create
- storage.buckets.getObjectInsights
Link to BigQuery dataset: storageinsights.datasetConfigs.linkDataset

You might also be able to get these permissions with custom roles or other predefined roles.

Enable the Storage Insights API

Console

Enable the storageinsights.googleapis.com API

Command line

To enable the Storage Insights API in your current project, run the gcloud services enable command:

gcloud services enable storageinsights.googleapis.com

For more information about enabling services for a Google Cloud project, see Enabling and disabling services.

Configure Storage Intelligence

Ensure that Storage Intelligence is configured for the project, folder, or organization that you want to analyze with datasets.

Create a dataset configuration

To create a dataset configuration, follow these steps. For more information about the fields you can specify for the dataset configuration, see Dataset configuration properties.

Console

In the Google Cloud console, go to the Cloud Storage Storage Insights page.

Go to Storage Insights
Click Configure dataset.
In the Name your dataset section, enter a name for your dataset. Optionally, enter a description for the dataset. Names identify dataset configurations and are immutable. The name can contain up to 128 characters, including letters, numbers, and underscores, and must start with a letter.
In the Define dataset scope section, do the following:
- Select one of the following options:
  - To get storage metadata for all projects in the current organization, select Include the organization.
  - To get storage metadata for all projects in the selected folders, select Include folders (Sub-organization/departments). For information about getting folder IDs, see Viewing or listing folders and projects. To add folders:
    1. In the Folder 1 field, enter the folder ID.
    2. Optionally, to add multiple folder IDs, click + Add another folder.
  - To get storage metadata for the selected projects, select Include projects by providing project numbers. To learn how to find project numbers, see Find the project name, number, and ID. To add projects, do the following:
    1. In the Project 1 field, enter the project number.
    2. Optionally, to add multiple project numbers, click + Add another project.
  - To add projects or folders in bulk, select Upload a list of projects/folders via CSV file. The CSV file must contain the project numbers or folder IDs to include in the dataset. You can specify up to 10,000 projects or folders in one dataset configuration.
- Specify whether to automatically include future buckets in the selected resource.
- Optionally, to specify filters on buckets based on regions and bucket prefixes, expand the Filters (optional) section. Filters are applied additively on buckets.
  
  You can include or exclude buckets from specific regions. For example, you can exclude buckets in the me-central1 and me-central2 regions. You can also include or exclude buckets by prefix. For example, to exclude buckets that start with my-bucket, enter the my-bucket* prefix.
Click Continue.
In the Select retention period section, select a retention period for the data in the dataset.
Activity data is included in the dataset by default, and inherits the retention period of the dataset. To override the dataset retention period, select Specify a retention period for activity data, and then select the number of days to retain activity data for. To disable activity data, set the retention period to 0 days.
In the Select location to store configured dataset section, select a location to store the dataset. For example, us-central1.
In the Select service account type section, select a service agent type for your dataset. Choose either a configuration-scoped or project-scoped service agent for your dataset.
Click Configure.

Command line

To create a dataset configuration, run the gcloud storage insights dataset-configs create command with the required flags:
```
gcloud storage insights dataset-configs create DATASET_CONFIG_ID \
  --location=LOCATION \
  --organization=SOURCE_ORG_NUMBER \
  --retention-period-days=DATASET_RETENTION_PERIOD_DAYS \
  (SCOPE_FLAG)
```
Replace:
- DATASET_CONFIG_ID with the name for your dataset configuration. Names identify dataset configurations and are immutable. The name can contain up to 128 characters, including letters, numbers, and underscores, and must start with a letter.
- LOCATION with the location to store the dataset. For example, us-central1.
- SOURCE_ORG_NUMBER with the ID of the organization to which the source projects belong. To find your organization ID, see Getting your organization resource ID.
- DATASET_RETENTION_PERIOD_DAYS with the retention period for the data in the dataset.
- SCOPE_FLAG with any one of the following flags that defines the scope of the data to collect:
  - --enable-organization-scope: Enables the dataset to collect insights from all buckets within the organization.
  - --source-folders=[SOURCE_FOLDER_NUMBERS,...]: Specifies a list of folder numbers to include in the dataset. To learn how to find a folder number, see Listing all projects and folders in your hierarchy.
  - --source-folders-file=FILE_PATH: Specifies multiple folder numbers by uploading a CSV file to a bucket.
  - --source-projects=[SOURCE_PROJECT_NUMBERS,...]: Specifies a list of project numbers to include in the dataset. For example, 464036093014. To find your project number, see Find the project name, number, and ID.
  - --source-projects-file=FILE_PATH: Specifies multiple project numbers by uploading a CSV file to a bucket.
Optionally, use the following additional flags to configure the dataset:
- Use --include-buckets=BUCKET_NAMES_OR_REGEX to include specific buckets by name or regular expression. You can't use this flag with --exclude-buckets.
- Use --exclude-buckets=BUCKET_NAMES_OR_REGEX to exclude specific buckets by name or regular expression. You can't use this flag with --include-buckets.
- Use --project=DESTINATION_PROJECT_ID to specify a project for storing your dataset configuration and generated dataset. If you don't use this flag, the destination project is your active project. For more information about project IDs, see Creating and managing projects.
- Use --auto-add-new-buckets to automatically include any buckets added to source projects in the future.
- Use --skip-verification to skip checks and failures from the verification process, which includes checks for required IAM permissions. If you use this flag, some or all buckets might be excluded from the dataset.
- Use --identity=IDENTITY_TYPE to specify the scope of the service agent created with the dataset configuration. Values are IDENTITY_TYPE_PER_CONFIG or IDENTITY_TYPE_PER_PROJECT. If unspecified, the default is IDENTITY_TYPE_PER_CONFIG. For details, see Service agent type.
- Use --description=DESCRIPTION to add a description for the dataset configuration.
- Use --activity-data-retention-period-days=ACTIVITY_RETENTION_PERIOD_DAYS to specify the retention period for the activity data in the dataset. By default, activity data is included in the dataset, and inherits the retention period of the dataset. To override the dataset retention period, specify the number of days to retain activity data for. To exclude activity data, set the ACTIVITY_RETENTION_PERIOD_DAYS to 0.
The following example creates a dataset configuration named my-dataset in the us-central1 region, for the organization with the ID 123456789, with a retention period of 30 days, and a scope limited to the projects 987654321 and 123123123:
```
gcloud storage insights dataset-configs create my-dataset \
--location=us-central1 \
--organization=123456789 \
--retention-period-days=30 \
--source-projects=987654321,123123123
```

JSON API

Grant the required permissions to the service agent

To enable Storage Insights to generate and write datasets, ask your administrator to grant the service agent the Storage Insights Collector Service role (roles/storage.insightsCollectorService) on the organization that contains the source projects. You must grant this role to every configuration-scoped service agent created for each dataset configuration from which you want data. If you use a project-scoped service agent, you must grant this role only once on the service agent to read and write datasets for all dataset configurations within the project.

Link a dataset

JSON API

Analyze object data and metadata using BigQuery

To analyze object content or view object metadata, use the ref column, which is returned as part of an Storage Insights dataset, to run BigQuery ObjectRef functions. Complete the steps in the following sections.

Create a Cloud resource connection in BigQuery

In BigQuery, create a Cloud resource connection that accesses Cloud Storage. The Cloud resource connection lets BigQuery access Cloud Storage object data and metadata using its own service account. For details, see Create a Cloud resource connection.

Use the Cloud resource connection with Storage Insights dataset

To analyze data referenced in the ref column that is returned as part of a Storage Insights dataset, use the OBJ.MAKE_REF function to combine the URI from the ref column with the connection that you created:

Analyze a Storage Insights dataset using a custom model

BigQuery doesn't support creating models directly within a linked dataset. To analyze your Storage Insights data with a custom model, you must create and store the model in a standard BigQuery dataset. You can then reference that model in your queries while targeting the linked dataset for analysis:

Analyze a Storage Insights dataset using a default model

You can use a default model to generate insights from unstructured data and help detect sensitive information.

Generate insights from unstructured data

Automated sensitive data detection

You can use multimodal models to help detect sensitive data, such as personally identifiable information (PII), in your documents.

The following query shows how you can scan PDF documents to check for sensitive information:

Configure Storage Insights datasets Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Get the required roles

Required permissions

Enable the Storage Insights API

Console

Command line

Configure Storage Intelligence

Create a dataset configuration

Console

Command line

JSON API

Grant the required permissions to the service agent

Link a dataset

Console

Command line

JSON API

Analyze object data and metadata using BigQuery

Create a Cloud resource connection in BigQuery

Use the Cloud resource connection with Storage Insights dataset

Analyze a Storage Insights dataset using a custom model

Analyze a Storage Insights dataset using a default model

Generate insights from unstructured data

Automated sensitive data detection

What's next