Parsing and chunking

Parsing & Chunking

R2R supports different parsing and chunking providers to extract text from various document formats and break it down into manageable pieces for efficient processing and retrieval.

To configure the parsing and chunking settings, update the [ingestion] section in your r2r.toml file:

1[ingestion]
2provider = "r2r" # or "unstructured_local" or "unstructured_api"
3# ... provider-specific settings ...

Runtime Configuration

In addition to configuring parsing and chunking settings in the r2r.toml file, you can also customize these settings at runtime when ingesting files using the Python SDK. This allows for more flexibility and control over the ingestion process on a per-file or per-request basis.

Some of the configurable options include:

  • Chunking strategy (e.g., “recursive”, “by_title”, “basic”)
  • Chunk size and overlap
  • Excluded parsers
  • Provider-specific settings (e.g., max characters, overlap, languages)

An exhaustive list of runtime ingestion inputs to the ingest-files endpoint is shown below:

file_paths
list[str]Required

A list of file paths or directory paths to ingest. If a directory path is provided, all files within the directory and its subdirectories will be ingested.

metadatas
Optional[list[dict]]

An optional list of metadata dictionaries corresponding to each file. If provided, the length should match the number of files being ingested.

document_ids
Optional[list[Union[UUID, str]]]

An optional list of document IDs to assign to the ingested files. If provided, the length should match the number of files being ingested.

versions
Optional[list[str]]

An optional list of version strings for the ingested files. If provided, the length should match the number of files being ingested.

ingestion_config
Optional[Union[dict, IngestionConfig]]

The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about configuration here.

For a comprehensive list of available runtime configuration options and examples of how to use them, refer to the Python SDK Ingestion Documentation.

Supported Providers

R2R offers two main parsing and chunking providers:

  1. R2R (default for ‘light’ installation):

    • Uses R2R’s built-in parsing and chunking logic.
    • Supports a wide range of file types, including TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video.
    • Configuration options:
      1[ingestion]
      2provider = "r2r"
      3chunking_strategy = "recursive"
      4chunk_size = 1_024
      5chunk_overlap = 512
      6excluded_parsers = ["mp4"]
    • chunking_strategy: The chunking method (“recursive”).
    • chunk_size: The target size for each chunk.
    • chunk_overlap: The number of characters to overlap between chunks.
    • excluded_parsers: List of parsers to exclude (e.g., [“mp4”]).
  2. Unstructured (default for ‘full’ installation):

    • Leverages Unstructured’s open-source ingestion platform.
    • Provides more advanced parsing capabilities.
    • Configuration options:
      1[ingestion]
      2provider = "unstructured_local"
      3strategy = "auto"
      4chunking_strategy = "by_title"
      5new_after_n_chars = 512
      6max_characters = 1_024
      7combine_under_n_chars = 128
      8overlap = 20
    • strategy: The overall chunking strategy (“auto”, “fast”, or “hi_res”).
    • chunking_strategy: The specific chunking method (“by_title” or “basic”).
    • new_after_n_chars: Soft maximum size for a chunk.
    • max_characters: Hard maximum size for a chunk.
    • combine_under_n_chars: Minimum size for combining small sections.
    • overlap: Number of characters to overlap between chunks.

Supported File Types

Both R2R and Unstructured providers support parsing a wide range of file types, including:

  • TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images (BMP, GIF, HEIC, JPEG, JPG, PNG, SVG, TIFF), audio (MP3), video (MP4), and more.

Refer to the Unstructured documentation for more details on their ingestion capabilities and limitations.

Configuring Parsing & Chunking

To configure parsing and chunking settings, update the [ingestion] section in your r2r.toml file with the desired provider and its specific settings.

For example, to use the R2R provider with custom chunk size and overlap:

1[ingestion]
2provider = "r2r"
3chunking_strategy = "recursive"
4chunk_size = 2_048
5chunk_overlap = 256
6excluded_parsers = ["mp4"]

Or, to use the Unstructured provider with a specific chunking strategy and character limits:

1[ingestion]
2provider = "unstructured_local"
3strategy = "hi_res"
4chunking_strategy = "basic"
5new_after_n_chars = 1_000
6max_characters = 2_000
7combine_under_n_chars = 256
8overlap = 50

Adjust the settings based on your specific requirements and the characteristics of your input documents.

Next Steps