Parsing and chunking
Parsing & Chunking
R2R supports different parsing and chunking providers to extract text from various document formats and break it down into manageable pieces for efficient processing and retrieval.
To configure the parsing and chunking settings, update the [ingestion]
section in your r2r.toml
file:
Runtime Configuration
In addition to configuring parsing and chunking settings in the r2r.toml
file, you can also customize these settings at runtime when ingesting files using the Python SDK. This allows for more flexibility and control over the ingestion process on a per-file or per-request basis.
Some of the configurable options include:
- Chunking strategy (e.g., “recursive”, “by_title”, “basic”)
- Chunk size and overlap
- Excluded parsers
- Provider-specific settings (e.g., max characters, overlap, languages)
An exhaustive list of runtime ingestion inputs to the ingest-files
endpoint is shown below:
A list of file paths or directory paths to ingest. If a directory path is provided, all files within the directory and its subdirectories will be ingested.
An optional list of metadata dictionaries corresponding to each file. If provided, the length should match the number of files being ingested.
An optional list of document IDs to assign to the ingested files. If provided, the length should match the number of files being ingested.
An optional list of version strings for the ingested files. If provided, the length should match the number of files being ingested.
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about configuration here.
For a comprehensive list of available runtime configuration options and examples of how to use them, refer to the Python SDK Ingestion Documentation.
Supported Providers
R2R offers two main parsing and chunking providers:
-
R2R (default for ‘light’ installation):
- Uses R2R’s built-in parsing and chunking logic.
- Supports a wide range of file types, including TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video.
- Configuration options:
chunking_strategy
: The chunking method (“recursive”).chunk_size
: The target size for each chunk.chunk_overlap
: The number of characters to overlap between chunks.excluded_parsers
: List of parsers to exclude (e.g., [“mp4”]).
-
Unstructured (default for ‘full’ installation):
- Leverages Unstructured’s open-source ingestion platform.
- Provides more advanced parsing capabilities.
- Configuration options:
strategy
: The overall chunking strategy (“auto”, “fast”, or “hi_res”).chunking_strategy
: The specific chunking method (“by_title” or “basic”).new_after_n_chars
: Soft maximum size for a chunk.max_characters
: Hard maximum size for a chunk.combine_under_n_chars
: Minimum size for combining small sections.overlap
: Number of characters to overlap between chunks.
Supported File Types
Both R2R and Unstructured providers support parsing a wide range of file types, including:
- TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images (BMP, GIF, HEIC, JPEG, JPG, PNG, SVG, TIFF), audio (MP3), video (MP4), and more.
Refer to the Unstructured documentation for more details on their ingestion capabilities and limitations.
Configuring Parsing & Chunking
To configure parsing and chunking settings, update the [ingestion]
section in your r2r.toml
file with the desired provider and its specific settings.
For example, to use the R2R provider with custom chunk size and overlap:
Or, to use the Unstructured provider with a specific chunking strategy and character limits:
Adjust the settings based on your specific requirements and the characteristics of your input documents.
Next Steps
- Learn more about Embedding Configuration.
- Explore Knowledge Graph Configuration.
- Check out Retrieval Configuration.