Parsing and chunking — Build, scale, and manage user-facing Retrieval-Augmented Generation applications.

Parsing & Chunking

R2R supports different parsing and chunking providers to extract text from various document formats and break it down into manageable pieces for efficient processing and retrieval.

To configure the parsing and chunking settings, update the [ingestion] section in your r2r.toml file:

1 [ingestion]
2 provider = "r2r" # or "unstructured_local" or "unstructured_api"
3 # ... provider-specific settings ...

Runtime Configuration

In addition to configuring parsing and chunking settings in the r2r.toml file, you can also customize these settings at runtime when ingesting files using the Python SDK. This allows for more flexibility and control over the ingestion process on a per-file or per-request basis.

Some of the configurable options include:

Chunking strategy (e.g., “recursive”, “by_title”, “basic”)
Chunk size and overlap
Excluded parsers
Provider-specific settings (e.g., max characters, overlap, languages)

An exhaustive list of runtime ingestion inputs to the ingest-files endpoint is shown below:

file_paths

list[str]Required

A list of file paths or directory paths to ingest. If a directory path is provided, all files within the directory and its subdirectories will be ingested.

metadatas

Optional[list[dict]]

An optional list of metadata dictionaries corresponding to each file. If provided, the length should match the number of files being ingested.

document_ids

Optional[list[Union[UUID, str]]]

An optional list of document IDs to assign to the ingested files. If provided, the length should match the number of files being ingested.

versions

Optional[list[str]]

An optional list of version strings for the ingested files. If provided, the length should match the number of files being ingested.

ingestion_config

Optional[Union[dict, IngestionConfig]]

The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about configuration here.

Other Provider Options

Unstructured Provider Options

For a comprehensive list of available runtime configuration options and examples of how to use them, refer to the Python SDK Ingestion Documentation.

Supported Providers

R2R offers two main parsing and chunking providers:

R2R (default for ‘light’ installation):
- Uses R2R’s built-in parsing and chunking logic.
- Supports a wide range of file types, including TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video.
- Configuration options:
```
1 [ingestion]
2 provider = "r2r"
3 chunking_strategy = "recursive"
4 chunk_size = 1_024
5 chunk_overlap = 512
6 excluded_parsers = ["mp4"]
```
- chunking_strategy: The chunking method (“recursive”).
- chunk_size: The target size for each chunk.
- chunk_overlap: The number of characters to overlap between chunks.
- excluded_parsers: List of parsers to exclude (e.g., [“mp4”]).
Unstructured (default for ‘full’ installation):
- Leverages Unstructured’s open-source ingestion platform.
- Provides more advanced parsing capabilities.
- Configuration options:
```
1 [ingestion]
2 provider = "unstructured_local"
3 strategy = "auto"
4 chunking_strategy = "by_title"
5 new_after_n_chars = 512
6 max_characters = 1_024
7 combine_under_n_chars = 128
8 overlap = 20
```
- strategy: The overall chunking strategy (“auto”, “fast”, or “hi_res”).
- chunking_strategy: The specific chunking method (“by_title” or “basic”).
- new_after_n_chars: Soft maximum size for a chunk.
- max_characters: Hard maximum size for a chunk.
- combine_under_n_chars: Minimum size for combining small sections.
- overlap: Number of characters to overlap between chunks.

Supported File Types

Both R2R and Unstructured providers support parsing a wide range of file types, including:

TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images (BMP, GIF, HEIC, JPEG, JPG, PNG, SVG, TIFF), audio (MP3), video (MP4), and more.

Refer to the Unstructured documentation for more details on their ingestion capabilities and limitations.

Configuring Parsing & Chunking

To configure parsing and chunking settings, update the [ingestion] section in your r2r.toml file with the desired provider and its specific settings.

For example, to use the R2R provider with custom chunk size and overlap:

1 [ingestion]
2 provider = "r2r"
3 chunking_strategy = "recursive"
4 chunk_size = 2_048
5 chunk_overlap = 256
6 excluded_parsers = ["mp4"]

Or, to use the Unstructured provider with a specific chunking strategy and character limits:

1 [ingestion]
2 provider = "unstructured_local"
3 strategy = "hi_res"
4 chunking_strategy = "basic"
5 new_after_n_chars = 1_000
6 max_characters = 2_000
7 combine_under_n_chars = 256
8 overlap = 50

Adjust the settings based on your specific requirements and the characteristics of your input documents.

Next Steps

Learn more about Embedding Configuration.
Explore Knowledge Graph Configuration.
Check out Retrieval Configuration.

1	[ingestion]
2	provider = "r2r" # or "unstructured_local" or "unstructured_api"
3	# ... provider-specific settings ...

1	[ingestion]
2	provider = "r2r"
3	chunking_strategy = "recursive"
4	chunk_size = 1_024
5	chunk_overlap = 512
6	excluded_parsers = ["mp4"]

1	[ingestion]
2	provider = "unstructured_local"
3	strategy = "auto"
4	chunking_strategy = "by_title"
5	new_after_n_chars = 512
6	max_characters = 1_024
7	combine_under_n_chars = 128
8	overlap = 20

1	[ingestion]
2	provider = "unstructured_local"
3	strategy = "hi_res"
4	chunking_strategy = "basic"
5	new_after_n_chars = 1_000
6	max_characters = 2_000
7	combine_under_n_chars = 256
8	overlap = 50