Deep Dive
Config Setup

Default Configuration

During the example pipeline creation, a default config.json is loaded and passed to the pipeline. It provides settings for the following services:

  • Embedding settings
  • Vector Database provider
  • LLM settings
  • Parsing logic
  • Evaluation provider
  • Logging
  • Application settings

The default values for the config are shown below:

{
  "app": {
    "max_logs": 100,
    "max_file_size_in_mb": 32
  },
  "completions": {
    "provider": "litellm"
  },
  "embedding": {
    "provider": "openai",
    "search_model": "text-embedding-3-small",
    "search_dimension": 512,
    "batch_size": 128,
    "text_splitter": {
      "type": "recursive_character",
      "chunk_size": 512,
      "chunk_overlap": 20
    },
    "rerank_model": "None"
  },  
  "eval": {
    "provider": "local",
    "llm": {
      "model": "gpt-4o",
      "provider": "openai"
    },
    "sampling_fraction": 1.0
  },
  "ingestion": {
    "selected_parsers": {
      "csv": "default",
      "docx": "default",
      "html": "default",
      "json": "default",
      "md": "default",
      "pdf": "default",
      "pptx": "default",
      "txt": "default",
      "xlsx": "default",
      "gif": "default",
      "png": "default",
      "jpg": "default",
      "jpeg": "default",
      "svg": "default"
    }
  },
  "logging": {
    "provider": "local",
    "log_table": "logs",
    "log_info_table": "log_info"
  },
  "prompt": {
    "provider": "local"
  },
  "vector_database": {
    "provider": "local",
    "collection_name": "demo_vecs"
  }
}

The available options for each section are:

  • app:

    • max_logs: Maximum number of logs to retrieve.
    • max_file_size_in_mb: Maximum file size allowed for ingestion (in MB).
  • completions:

    • provider: "litellm", "openai", "llama-cpp" (see LLM Providers for more info).
  • embedding:

    • provider: "openai", "sentence-transformers" (see Embedding Providers for more info).
    • search_model: Embedding model to use (e.g., "text-embedding-3-small" for OpenAI).
    • search_dimension: Embedding dimension.
    • batch_size: Number of texts to embed in each batch.
    • text_splitter: Configuration for text splitting.
      • type: "recursive_character".
      • chunk_size: Target chunk size in characters.
      • chunk_overlap: Number of overlapping characters between chunks.
    • rerank_model: Model used for re-ranking (optional).
  • eval:

    • provider: "local", "deepeval", "parea" (see Evaluation Providers for more info).
    • llm: LLM configuration for evaluation.
      • model: Model to use (e.g., "gpt-4o").
      • provider: Provider to use (e.g., "openai").
    • sampling_fraction: Fraction of data to sample for evaluation (e.g., 1.0 for 100% of requests).
  • ingestion:

    • selected_parsers: Specifies the parsers for different file types.
      • csv: "default".
      • docx: "default".
      • html: "default".
      • json: "default".
      • md: "default".
      • pdf: "default".
      • pptx: "default".
      • txt: "default".
      • xlsx: "default".
      • gif: "default".
      • png: "default".
      • jpg: "default".
      • jpeg: "default".
      • svg: "default".
  • logging:

    • provider: "local", "postgres", "redis".
    • log_table: Name of the table to store logs.
    • log_info_table: Name of the table to store log info.
  • prompt:

    • provider: "local".
  • vector_database:

    • provider: "local", "pgvector", "qdrant" (see Vector Database Providers for more info).
    • collection_name: Name of the collection to store vector embeddings.

For more detailed information on configuring specific providers, please refer to the relevant documentation:

These documentation files provide in-depth guides on setting up and using each provider, including any required environment variables and additional configuration options.

To launch the default pipeline with your own config, you may run the following code:

...
 
config_path = PATH_TO_YOUR_CONFIG
config = R2RConfig.from_json(config_path=config_path)
 
providers = R2RProviderFactory(config).create_providers()
pipes = R2RPipeFactory(config, providers).create_pipes()
pipelines = R2RPipelineFactory(config, pipes).create_pipelines()
r2r = R2RApp(config, providers, pipelines)