Refresh structured and unstructured data

This page describes refreshing structured and unstructured data.

Refresh structured data

You can refresh the data in a structured data store as long as you use a schema that is the same or backward compatible with the schema in the data store. For example, adding only new fields to an existing schema is backward compatible.

You can refresh structured data in the Google Cloud console or using the API.

Console

To use the Google Cloud console to refresh structured data from a branch of a data store, follow these steps:

  1. In the Google Cloud console, go to the Gemini Enterprise page.

    Gemini Enterprise

  2. In the navigation menu, click Data Stores.

  3. In the Name column, click the data store that you want to edit.

  4. On the Documents tab, click Import data.

  5. To refresh from Cloud Storage:

    1. In the Select a data source pane, select Cloud Storage.
    2. In the Import data from Cloud Storage pane, click Browse, select the bucket that contains your refreshed data, and then click Select. Alternatively, enter the bucket location directly in the gs:// field.
    3. Under Data Import Options, select an import option.
    4. Click Import.
  6. To refresh from BigQuery:

    1. In the Select a data source pane, select BigQuery.
    2. In the Import data from BigQuery pane, click Browse, select a table that contains your refreshed data, and then click Select. Alternatively, enter the table location directly in the BigQuery path field.
    3. Under Data Import Options, select an import option.
    4. Click Import.

REST

Use the documents.import method to refresh your data, specifying the appropriate reconciliationMode value.

To refresh structured data from BigQuery or Cloud Storage using the command line, follow these steps:

  • Find your data store ID. If you already have your data store ID, skip to the next step.

    1. In the Google Cloud console, go to the Gemini Enterprise page and in the navigation menu, click Data Stores.

      Go to the Data Stores page

    2. Click the name of your data store.

    3. On the Data page for your data store, get the data store ID.

  • To import your structured data from BigQuery call the following method. You can import either from BigQuery or Cloud Storage. To import from Cloud Storage, skip to the next step.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
    -d '{
      "bigquerySource": {
        "projectId": "PROJECT_ID",
        "datasetId":"DATASET_ID",
        "tableId": "TABLE_ID",
        "dataSchema": "DATA_SCHEMA_BQ",
      },
      "reconciliationMode": "RECONCILIATION_MODE",
      "autoGenerateIds": AUTO_GENERATE_IDS,
      "idField": "ID_FIELD",
      "errorConfig": {
        "gcsPrefix": "ERROR_DIRECTORY"
      }
    }'
    

    Replace the following:

    Here is the default BigQuery schema. Your BigQuery table must conform to this schema when you set dataSchema to document.

    [
     {
       "name": "id",
       "mode": "REQUIRED",
       "type": "STRING",
       "fields": []
     },
     {
       "name": "jsonData",
       "mode": "NULLABLE",
       "type": "STRING",
       "fields": []
     }
    ]
    
  • To import your structured data from Cloud Storage call the following method. You can either import from BigQuery or Cloud Storage. To import from BigQuery, go to the previous step.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
    -d '{
      "gcsSource": {
        "inputUris": ["GCS_PATHS"],
        "dataSchema": "DATA_SCHEMA_GCS",
      },
      "reconciliationMode": "RECONCILIATION_MODE",
      "idField": "ID_FIELD",
      "errorConfig": {
        "gcsPrefix": "ERROR_DIRECTORY"
      }
    }'
    

    Replace the following:

  • PROJECT_ID: the ID of your project.
  • DATA_STORE_ID: the ID of the data store.
  • GCS_PATHS: a list of comma-separated URIs to Cloud Storage locations from where you want to import. Each URI can be 2,000 characters long. URIs can match the full path for a storage object or can match the pattern for one or more objects. For example, gs://bucket/directory/*.json is a valid path.
  • DATA_SCHEMA_GCS: an optional field to specify the schema to use when parsing data from the BigQuery source. Can have the following values:
  • ERROR_DIRECTORY: an optional field to specify a Cloud Storage directory for error information about the import—for example, gs://<your-gcs-bucket>/directory/import_errors. Google recommends leaving this field empty to let Gemini Enterprise automatically create a temporary directory.
  • RECONCILIATION_MODE: an optional field to specify how the imported documents are reconciled with the existing documents in the destination data store. Can have the following values:
  • INCREMENTAL: the default value. Causes an incremental refresh of data from BigQuery to your data store. This does an upsert operation, which adds new documents and replaces existing documents with updated documents with the same ID.
  • FULL: causes a full rebase of the documents in your data store. Therefore, new and updated documents are added to your data store, and documents that are not in BigQuery are removed from your data store. The FULL mode is helpful if you want to automatically delete documents that you no longer need.

    Python

    Before trying this sample, follow the Python setup instructions in the Gemini Enterprise quickstart using client libraries. For more information, see the Gemini Enterprise Python API reference documentation.

    To authenticate to Gemini Enterprise, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

    
    from google.api_core.client_options import ClientOptions
    from google.cloud import discoveryengine
    
    # TODO(developer): Uncomment these variables before running the sample.
    # project_id = "YOUR_PROJECT_ID"
    # location = "YOUR_LOCATION" # Values: "global"
    # data_store_id = "YOUR_DATA_STORE_ID"
    # bigquery_dataset = "YOUR_BIGQUERY_DATASET"
    # bigquery_table = "YOUR_BIGQUERY_TABLE"
    
    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )
    
    # Create a client
    client = discoveryengine.DocumentServiceClient(client_options=client_options)
    
    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )
    
    request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        bigquery_source=discoveryengine.BigQuerySource(
            project_id=project_id,
            dataset_id=bigquery_dataset,
            table_id=bigquery_table,
            data_schema="custom",
        ),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
    )
    
    # Make the request
    operation = client.import_documents(request=request)
    
    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()
    
    # After the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)
    
    # Handle the response
    print(response)
    print(metadata)

    Refresh unstructured data

    You can refresh unstructured data in the Google Cloud console or using the API.

    Console

    To use the Google Cloud console to refresh unstructured data from a branch of a data store, follow these steps:

    1. In the Google Cloud console, go to the Gemini Enterprise page.

      Gemini Enterprise

    2. In the navigation menu, click Data Stores.

    3. In the Name column, click the data store that you want to edit.

    4. On the Documents tab, click Import data.

    5. To ingest from a Cloud Storage bucket (with or without metadata):

      1. In the Select a data source pane, select Cloud Storage.
      2. In the Import data from Cloud Storage pane, click Browse, select the bucket that contains your refreshed data, and then click Select. Alternatively, enter the bucket location directly in the gs:// field.
      3. Under Data Import Options, select an import option.
      4. Click Import.
    6. To ingest from BigQuery:

      1. In the Select a data source pane, select BigQuery.
      2. In the Import data from BigQuery pane, click Browse, select a table that contains your refreshed data, and then click Select. Alternatively, enter the table location directly in the BigQuery path field.
      3. Under Data Import Options, select an import option.
      4. Click Import.

    REST

    To refresh unstructured data using the API, re-import it using the documents.import method, specifying the appropriate reconciliationMode value. For more information about importing unstructured data, see Unstructured data.

    Python

    Before trying this sample, follow the Python setup instructions in the Gemini Enterprise quickstart using client libraries. For more information, see the Gemini Enterprise Python API reference documentation.

    To authenticate to Gemini Enterprise, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

    from google.api_core.client_options import ClientOptions
    from google.cloud import discoveryengine
    
    # TODO(developer): Uncomment these variables before running the sample.
    # project_id = "YOUR_PROJECT_ID"
    # location = "YOUR_LOCATION" # Values: "global"
    # data_store_id = "YOUR_DATA_STORE_ID"
    
    # Examples:
    # - Unstructured documents
    #   - `gs://bucket/directory/file.pdf`
    #   - `gs://bucket/directory/*.pdf`
    # - Unstructured documents with JSONL Metadata
    #   - `gs://bucket/directory/file.json`
    # - Unstructured documents with CSV Metadata
    #   - `gs://bucket/directory/file.csv`
    # gcs_uri = "YOUR_GCS_PATH"
    
    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )
    
    # Create a client
    client = discoveryengine.DocumentServiceClient(client_options=client_options)
    
    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )
    
    request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        gcs_source=discoveryengine.GcsSource(
            # Multiple URIs are supported
            input_uris=[gcs_uri],
            # Options:
            # - `content` - Unstructured documents (PDF, HTML, DOC, TXT, PPTX)
            # - `custom` - Unstructured documents with custom JSONL metadata
            # - `document` - Structured documents in the discoveryengine.Document format.
            # - `csv` - Unstructured documents with CSV metadata
            data_schema="content",
        ),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
    )
    
    # Make the request
    operation = client.import_documents(request=request)
    
    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()
    
    # After the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)
    
    # Handle the response
    print(response)
    print(metadata)