Parse and chunk documents

This page describes how to use Agent Search to parse and chunk your documents.

You can configure parsing or chunking settings in order to:

Specify how Agent Search parses content. You can specify how to parse unstructured content when you upload it to Agent Search. Agent Search provides a digital parser, OCR parser for PDFs, and a layout parser. You can also bring your own parsed documents. The layout parser is recommended when you have rich content and structural elements like sections, paragraphs, tables, images, and lists to be extracted from documents for search and answer generation.

See Improve content detection with parsing.
Use Agent Search for retrieval-augmented generation (RAG). Improve the output of LLMs with relevant data that you've uploaded to your Agent Search app. To do this, you'll turn on document chunking, which indexes your data as chunks to improve relevance and decrease computational load for LLMs. You'll also turn on the layout parser, which detects document elements such as headings and lists, to improve how documents are chunked.

For information about chunking for RAG and how to return chunks in search requests, see Chunk documents for RAG.

Parse documents

You can control content parsing in the following ways:

Specify parser type. You can specify the type of parsing to apply depending on file type:
- Digital parser. The digital parser is on by default for all file types unless a different parser type is specified. The digital parser processes ingested documents if no other default parser is specified for the data store or if the specified parser doesn't support the file type of an ingested document.
- OCR parsing for PDFs. This parser can be used as lower-cost alternative to the layout parser if you are uploading scanned PDFs or PDFs with text inside images. See the OCR parser for PDFs section of this document.
- Layout parser. If you plan to use Agent Search for RAG, turn on the layout parser for HTML, PDF, DOCX, PPTX, and XLSX files. See Chunk documents for RAG for information about this parser and how to turn it on. The layout parser can perform optical character recognition on images and scanned documents.
Bring your own parsed document. (Preview with allowlist) If you've already parsed your unstructured documents, you can import that pre-parsed content into Agent Search. See Bring your own parsed document.

Parser availability comparison

The following table lists the availability of each parser by document file types and shows which elements each parser can detect and parse.

File type	Digital parser	OCR parser for PDFs	Layout parser
HTML	Detects paragraph elements	Not applicable	Detects paragraph, table, image, list, title, and heading elements
PDF	Detects paragraph (digital text) elements	Detects paragraph elements	Detects paragraph, table, image, title, and heading elements
DOCX	Detects paragraph elements	Not applicable	Detects paragraph, table, image, list, title, heading elements
PPTX	Detects paragraph elements	Not applicable	Detects paragraph, table, image, list, title, heading elements
TXT	Detects paragraph elements	Not applicable	Not applicable
XLSX	Detects paragraph elements	Not applicable	Detects paragraph, table, title, heading elements
XLSM	Detects paragraph elements	Not applicable	Detects paragraph, table, title, heading elements

The maximum file size of an unstructured document that you can import is the same for all three parsers. See Prepare data for ingesting.

Digital parser

The digital parser extracts machine-readable text from documents. It detects text blocks, but not document elements such as tables, lists, and headings.

The digital parser is used as the default if you don't specify a different parser as the default during data store creation. Also, the digital parser is used if the specified parser doesn't support a file type that's being uploaded—for example, if the layout parser is specified, the digital parser is used to parse TXT files because the layout parser doesn't support text files.

OCR parser for PDFs

For scanned PDFs and PDFs where the text is part of an image, such as scanned documents and images like screenshots that contain text, then the OCR parser for PDFs can be good choice. If you have PDFs that have both non-searchable text (such as scanned text or infographics) and machine-readable text, you can set the field useNativeText to true when specifying the OCR parser. In this case, machine-readable text is merged with OCR parsing outputs to improve text extraction quality.

If your PDFs have complex hierarchy or visual or table components, then the Layout parser might give better results.

If you have searchable PDFs or other digital formats that are mostly composed of machine-readable text, you typically don't need to use the OCR parser for PDFs.

OCR processing features are available for custom search apps with unstructured data stores. Because the OCR parser only applies to PDF files, only PDF files that are ingested are processed by the OCR parser; other file types are processed by the digital parser.

The OCR processor can parse the first 500 pages of a PDF file. Pages beyond the 500 limit aren't processed.

Layout parser

Layout parsing lets Agent Search detect layouts for PDF, HTML, DOCX, PPTX, XLSX, and XLSM files. Agent Search can then identify content elements like text blocks, tables, lists, and structural elements such as titles and headings and use them to define the organization and hierarchy of a document.

You can either turn on layout parsing for all file types or specify which file types to turn it on for. The layout parser detects content elements like paragraphs, tables, lists, and structural elements like titles, headings, headers, and footnotes.

If you have complex non-searchable PDFs, such as scanned PDFs with complicated hierarchy or tables, or PDFs with text inside images, such as infographics, Google recommends layout parser instead of the OCR parser.

The layout parser is available only when using document chunking for RAG. When document chunking is turned on, Agent Search breaks documents up into chunks at ingestion time and can return documents as chunks. Detecting document layout enables content-aware chunking and enhances search and answer generation related to document elements. For more information about chunking documents for RAG, see Chunk documents for RAG.

You can select one or more of the following layout parser add-ons when you create your data store:

Image annotation
Table annotation
Gemini layout parsing (Public Preview)
Exclude HTML content

Image annotation

If image annotation is enabled, when an image is detected in a source document, a description (annotation) of the image and the image itself are assigned to a chunk. The annotation determines if the chunk should be returned in a search result. If an answer is generated, the annotation can be a source for the answer.

The layout parser can detect the following image types: BMP, GIF, JPEG, PNG, and TIFF.

To enable image annotation in layout parsing, do the following when you create the data store:

Select Document processing options > Layout parser settings > Enable image annotation.

Table annotation

If table annotation is enabled, when a table is detected in a source document, a description (annotation) of the table and the table itself are assigned to a chunk. The annotation determines if the chunk should be returned in a search result. If an answer is generated, the annotation can be a source for the answer.

To enable table annotation in layout parsing, do the following when you create the data store:

Select Document processing options > Layout parser settings > Enable table annotation.

Gemini layout parsing (Public Preview)

If Gemini layout parsing is enabled, Gemini is used to provide layout analysis and content extraction on PDF files. This feature provides high-quality table recognition, improved reading order, and more accurate text recognition. It is available for data stores that have unstructured documents. You can use Gemini parser add-on along with the table annotation add-on.

To enable Gemini layout parsing, do the following when you create the data store:

Select Document processing options > Layout parser settings > Enable Gemini enhancement.

Exclude HTML content

When using the layout parser for HTML documents, you can exclude specific parts of the HTML content from being processed. To improve data quality for search applications and RAG applications, you can exclude boilerplate or sections such as navigation menus, headers, footers, or sidebars.

The layoutParsingConfig provides the following fields for this purpose:

excludeHtmlElements: List of HTML tags to be excluded. Content within these tags is excluded.
excludeHtmlClasses: List of HTML class attributes to be excluded. HTML elements containing these class attributes, along with their content, are excluded.
excludeHtmlIds: List of HTML element ID attributes to be excluded. HTML elements with these ID attributes, along with their content, are excluded.

Specify a default parser

By including the documentProcessingConfig object when you create a data store, you can specify a default parser for that data store. If you don't include documentProcessingConfig.defaultParsingConfig, the digital parser is used. The digital parser is also used if the specified parser is not available for a file type.

REST

When creating a search data store using the API, include documentProcessingConfig.defaultParsingConfig in the data store creation request. You can specify the OCR parser, the layout parser, or the digital parser:

Example

The following example specifies OCR parser as the default parser during data store creation.

Specify parser overrides for file types

You can specify that a particular file type (PDF, HTML, DOCX, PPTX, XLSX, and XLSM) should be parsed by a different parser than the default parser. To do so, include the documentProcessingConfig field in your data store creation request and specify the override parser. If you don't specify a default parser, then the digital parser is the default.

REST

Example

The following example specifies during data store creation that PDF files should be processed by the OCR parser and that HTML files should be processed by the layout parser. In this case, any files other than PDF and HTML files would be processed by the digital parser.

Edit document parsing for existing data stores

If you already have a data store, you can change the default parser and add file format exceptions. However, the updated parser settings only apply to new documents imported to the data store. Documents already in the data store are not re-parsed with the new settings.

Configure layout parser to exclude HTML content

You can configure layout parser to exclude HTML content by specifying excludeHtmlElements, excludeHtmlClasses or excludeHtmlIds in documentProcessingConfig.defaultParsingConfig.layoutParsingConfig.

Example

This example specifies that when HTML files are processed by the layout parser, the following are skipped by the parser:

Get parsed documents in JSON

You can get a parsed document in JSON format by calling the getProcessedDocument method and specifying PARSED_DOCUMENT as the processed document type. Getting parsed documents in JSON can be helpful if you need to upload the parsed document elsewhere or if you decide to re-import parsed documents to Agent Search using the bring your own parsed document feature.

REST

Bring your own parsed document

You can import pre-parsed, unstructured documents into Agent Search data stores. For example, instead of importing a raw PDF document, you can parse the PDF yourself and import the parsing result instead. This lets you import your documents in a structured way, ensuring that search and answer generation have information about the document's layout and elements.

A parsed, unstructured document is represented by JSON that describes the unstructured document using a sequence of text, table, and list blocks. You import JSON files with your parsed unstructured document data in the same way that you import other types of unstructured documents, such as PDFs. When this feature is turned on, whenever a JSON file is uploaded and identified by either an application/json MIME type or a .JSON extension, it is treated as a parsed document.

To turn on this feature and for information about how to use it, contact your Google account team.

Chunk documents for RAG

By default, Agent Search is optimized for document retrieval, where your search app returns a document such as a PDF or web page with each search result.

Document chunking features are available for custom search apps with unstructured data stores.

Agent Search can instead be optimized for RAG, where your search app is primarily used to augment LLM output with your custom data. When document chunking is turned on, Agent Search breaks up your documents into chunks. In search results, your search app can return relevant chunks of data instead of full documents. Using chunked data for RAG increases relevance for LLM answers and reduces computational load for LLMs.

Limitations

Document chunking options

This section describes the options that you specify in order to turn on document chunking.

During data store creation, turn on the following options so that Agent Search can index your documents as chunks.

Turn on document chunking

You can turn on document chunking by including the documentProcessingConfig object in your data store creation request and turning on layout-aware document chunking and layout parsing.

Bring your own chunks (Preview with allowlist)

If you've already chunked your own documents, you can upload those to Agent Search instead of turning on document chunking options.

Bringing your own chunks is a Preview with allowlist feature. To use this feature, contact your Google account team.

List a document's chunks

Get chunks in JSON from a processed document

You can get all the chunks from a specific document in JSON format by calling the getProcessedDocument method. Getting chunks in JSON can be helpful if you need to upload chunks elsewhere or if you decide to re-import chunks to Agent Search using the bring your own chunks feature.

REST

Get specific chunks

Return chunks in search requests

After you've confirmed that your data has been chunked correctly, your Agent Search can return chunked data in its search results.

The response returns a chunk that is relevant to the search query. In addition, you can choose to return adjacent chunks that appear before and after the relevant chunk in the source document. Adjacent chunks can add context and accuracy.

REST

Example

The following example of a search query request sets SearchResultMode to chunks, requests one previous chunk and one next chunk, and limits the number of results to a single relevant chunk using pageSize.

The following example shows the response that is returned for the example query. The response contains the relevant chunks, the previous and next chunks, the original document's metadata, and the span of document pages that each chunk was derived from.

Response

{
  "results": [
    {
      "chunk": {
        "name": "projects/961309680810/locations/global/collections/default_collection/dataStores/allie-pdf-adjacent-chunks_1711394998841/branches/0/documents/0d8619f429d7f20b3575b14cd0ad0813/chunks/c17",
        "id": "c17",
        "content": "\n# ESS10: Stakeholder Engagement and Information Disclosure\nReaders should also refer to ESS10 and its guidance notes, plus the template available for a stakeholder engagement plan. More detail on stakeholder engagement in projects with risks related to animal health is contained in section 4 below. The type of stakeholders (men and women) that can be engaged by the Borrower as part of the project's environmental and social assessment and project design and implementation are diverse and vary based on the type of intervention. The stakeholders can include: Pastoralists, farmers, herders, women's groups, women farmers, community members, fishermen, youths, etc. Cooperatives members, farmer groups, women's livestock associations, water user associations, community councils, slaughterhouse workers, traders, etc. Veterinarians, para-veterinary professionals, animal health workers, community animal health workers, faculties and students in veterinary colleges, etc. 8 \n# 4. Good Practice in Animal Health Risk Assessment and Management\n\n# Approach\nRisk assessment provides the transparent, adequate and objective evaluation needed by interested parties to make decisions on health-related risks associated with project activities involving live animals. As the ESF requires, it is conducted throughout the project cycle, to provide or indicate likelihood and impact of a given hazard, identify factors that shape the risk, and find proportionate and appropriate management options. The level of risk may be reduced by mitigation measures, such as infrastructure (e.g., diagnostic laboratories, border control posts, quarantine stations), codes of practice (e.g., good animal husbandry practices, on-farm biosecurity, quarantine, vaccination), policies and regulations (e.g., rules for importing live animals, ban on growth hormones and promotors, feed standards, distance required between farms, vaccination), institutional capacity (e.g., veterinary services, surveillance and monitoring), changes in individual behavior (e.g., hygiene, hand washing, care for animals). Annex 2 provides examples of mitigation practices. This list is not an exhaustive one but a compendium of most practiced interventions and activities. The cited measures should take into account social, economic, as well as cultural, gender and occupational aspects, and other factors that may affect the acceptability of mitigation practices by project beneficiaries and other stakeholders. Risk assessment is reviewed and updated through the project cycle (for example to take into account increased trade and travel connectivity between rural and urban settings and how this may affect risks of disease occurrence and/or outbreak). Projects monitor changes in risks (likelihood and impact) by using data, triggers or indicators. ",
        "documentMetadata": {
          "uri": "gs://table_eval_set/pdf/worldbank/AnimalHealthGoodPracticeNote.pdf",
          "title": "AnimalHealthGoodPracticeNote"
        },
        "pageSpan": {
          "pageStart": 14,
          "pageEnd": 15
        },
        "chunkMetadata": {
          "previousChunks": [
            {
              "name": "projects/961309680810/locations/global/collections/default_collection/dataStores/allie-pdf-adjacent-chunks_1711394998841/branches/0/documents/0d8619f429d7f20b3575b14cd0ad0813/chunks/c16",
              "id": "c16",
              "content": "\n# ESS6: Biodiversity Conservation and Sustainable Management of Living Natural Resources\nThe risks associated with livestock interventions under ESS6 include animal welfare (in relation to housing, transport, and slaughter); diffusion of pathogens from domestic animals to wildlife, with risks for endemic species and biodiversity (e.g., sheep and goat plague in Mongolia affecting the saiga, an endemic species of wild antelope); the introduction of new breeds with potential risk of introducing exotic or new diseases; and the release of new species that are not endemic with competitive advantage, potentially putting endemic species at risk of extinction. Animal welfare relates to how an animal is coping with the conditions in which it lives. An animal is in a good state of welfare if it is healthy, comfortable, well nourished, safe, able to express innate behavior, 7 Good Practice Note - Animal Health and related risks and is not suffering from unpleasant states such as pain, fear or distress. Good animal welfare requires appropriate animal care, disease prevention and veterinary treatment; appropriate shelter, management and nutrition; humane handling, slaughter or culling. The OIE provides standards for animal welfare on farms, during transport and at the time of slaughter, for their welfare and for purposes of disease control, in its Terrestrial and Aquatic Codes. The 2014 IFC Good Practice Note: Improving Animal Welfare in Livestock Operations is another example of practical guidance provided to development practitioners for implementation in investments and operations. Pastoralists rely heavily on livestock as a source of food, income and social status. Emergency projects to restock the herds of pastoralists affected by drought, disease or other natural disaster should pay particular attention to animal welfare (in terms of transport, access to water, feed, and animal health) to avoid potential disease transmission and ensure humane treatment of animals. Restocking also entails assessing the assets of pastoralists and their ability to maintain livestock in good conditions (access to pasture and water, social relationship, technical knowledge, etc.). Pastoralist communities also need to be engaged by the project to determine the type of animals and breed and the minimum herd size to be considered for restocking. \n# Box 5. Safeguarding the welfare of animals and related risks in project activities\nIn Haiti, the RESEPAG project (Relaunching Agriculture: Strengthening Agriculture Public Services) financed housing for goats and provided technical recommendations for improving their welfare, which is critical to avoid the respiratory infections, including pneumonia, that are serious diseases for goats. To prevent these diseases, requires optimal sanitation and air quality in herd housing. This involves ensuring that buildings have adequate ventilation and dust levels are reduced to minimize the opportunity for infection. Good nutrition, water and minerals are also needed to support the goats immune function. The project paid particular attention to: (i) housing design to ensure good ventilation; (ii) locating housing close to water sources and away from human habitation and noisy areas; (iii) providing mineral blocks for micronutrients; (iv) ensuring availability of drinking water and clean food troughs. ",
              "documentMetadata": {
                "uri": "gs://table_eval_set/pdf/worldbank/AnimalHealthGoodPracticeNote.pdf",
                "title": "AnimalHealthGoodPracticeNote"
              },
              "pageSpan": {
                "pageStart": 13,
                "pageEnd": 14
              }
            }
          ],
          "nextChunks": [
            {
              "name": "projects/961309680810/locations/global/collections/default_collection/dataStores/allie-pdf-adjacent-chunks_1711394998841/branches/0/documents/0d8619f429d7f20b3575b14cd0ad0813/chunks/c18",
              "id": "c18",
              "content": "\n# Scoping of risks\nEarly scoping of risks related to animal health informs decisions to initiate more comprehensive risk assessment according to the type of livestock interventions and activities. It can be based on the following considerations: • • • • Type of livestock interventions supported by the project (such as expansion of feed resources, improvement of animal genetics, construction/upgrading and management of post-farm-gate facilities, etc. See also Annex 2); Geographic scope and scale of the livestock interventions; Human and animal populations that are likely to be affected (farmers, women, children, domestic animals, wildlife, etc.); and Changes in the project or project context (such as emerging disease outbreak, extreme weather or climatic conditions) that would require a re-assessment of risk levels, mitigation measures and their likely effect on risk reduction. Scenario planning can also help to identify project-specific vulnerabilities, country-wide or locally, and help shape pragmatic analyses that address single or multiple hazards. In this process, some populations may be identified as having disproportionate exposure or vulnerability to certain risks because of occupation, gender, age, cultural or religious affiliation, socio-economic or health status. For example, women and children may be the main caretakers of livestock in the case of 9 Good Practice Note - Animal Health and related risks household farming, which puts them into close contact with animals and animal products. In farms and slaughterhouses, workers and veterinarians are particularly exposed, as they may be in direct contact with sick animals (see Box 2 for an illustration). Fragility, conflict, and violence (FCV) can exacerbate risk, in terms of likelihood and impact. Migrants new to a geographic area may be immunologically naïve to endemic zoonotic diseases or they may inadvertently introduce exotic diseases; and refugees or internally displaced populations may have high population density with limited infrastructure, leaving them vulnerable to disease exposure. Factors such as lack of access to sanitation, hygiene, housing, and health and veterinary services may also affect disease prevalence, contributing to perpetuation of poverty in some populations. Risk assessment should identify populations at risk and prioritize vulnerable populations and circumstances where risks may be increased. It should be noted that activities that seem minor can still have major consequences. See Box 6 for an example illustrating how such small interventions in a project may have large-scale consequences. It highlights the need for risk assessment, even for simple livestock interventions and activities, and how this can help during the project cycle (from concept to implementation). ",
              "documentMetadata": {
                "uri": "gs://table_eval_set/pdf/worldbank/AnimalHealthGoodPracticeNote.pdf",
                "title": "AnimalHealthGoodPracticeNote"
              },
              "pageSpan": {
                "pageStart": 15,
                "pageEnd": 16
              }
            }
          ]
        }
      }
    }
  ],
  "totalSize": 61,
  "attributionToken": "jwHwjgoMCICPjbAGEISp2J0BEiQ2NjAzMmZhYS0wMDAwLTJjYzEtYWQxYS1hYzNlYjE0Mzc2MTQiB0dFTkVSSUMqUMLwnhXb7Ygtq8SKLa3Eii3d7Ygtj_enIqOAlyLm7Ygtt7eMLduPmiKN96cijr6dFcXL8xfdj5oi9-yILdSynRWCspoi-eyILYCymiLk7Ygt",
  "nextPageToken": "ANxYzNzQTMiV2MjFWLhFDZh1SMjNmMtADMwATL5EmZyMDM2YDJaMQv3yagQYAsciPgIwgExEgC",
  "guidedSearchResult": {},
  "summary": {}
}

Parse and chunk documents Stay organized with collections Save and categorize content based on your preferences.

Parse documents

Parser availability comparison

Digital parser

OCR parser for PDFs

Layout parser

Image annotation

Table annotation

Gemini layout parsing (Public Preview)

Exclude HTML content

Specify a default parser

REST

Console

Example

Specify parser overrides for file types

REST

Console

Example

Edit document parsing for existing data stores

Configure layout parser to exclude HTML content

REST

Example

Get parsed documents in JSON

REST

Bring your own parsed document

Chunk documents for RAG

Limitations

Document chunking options

Turn on document chunking

REST

Console

Bring your own chunks (Preview with allowlist)

List a document's chunks

REST

Get chunks in JSON from a processed document

REST

Get specific chunks

REST

Return chunks in search requests

REST

Example

Response

What's next