Introduction

R2R’s DatabaseProvider offers a unified interface for both relational and vector database operations. R2R only provides database support through Postgres with the pgvector extension.

Postgres was selected to power R2R because it is free, open source, and widely considered be a stable state of the art database. Further, the Postgres community has implemented pgvector for efficient storage and retrieval of vector embeddings alongside traditional relational data.

Architecture Overview

The Database Provider in R2R is built on two primary components:

  1. Vector Database Provider: Handles vector-based operations such as similarity search and hybrid search.
  2. Relational Database Provider: Manages traditional relational database operations, including user management and document metadata storage.

These providers work in tandem to ensure efficient data management and retrieval.

Configuration

Update the database section in your r2r.toml file:

[database]
provider = "postgres"
user = "your_postgres_user"
password = "your_postgres_password"
host = "your_postgres_host"
port = "your_postgres_port"
db_name = "your_database_name"
your_project_name = "your_project_name"

Alternatively, you can set these values using environment variables:

export POSTGRES_USER=your_postgres_user
export POSTGRES_PASSWORD=your_postgres_password
export POSTGRES_HOST=your_postgres_host
export POSTGRES_PORT=your_postgres_port
export POSTGRES_DBNAME=your_database_name
export POSTGRES_PROJECT_NAME=your_collection_name

Environment variables take precedence over the config settings in case of conflicts. The R2R Docker includes configuration options that facilitate integration with a combined Postgres+pgvector database setup.

Vector Database Operations

Initialization

The vector database is automatically initialized with dimensions that correspond to your selected embedding model when the Database Provider is first created.

Upsert Vector Entries

from r2r import R2R, VectorEntry

app = R2R()

vector_entry = VectorEntry(id="unique_id", vector=[0.1, 0.2, 0.3], metadata={"key": "value"})
app.providers.database.vector.upsert(vector_entry)
results = app.providers.database.vector.search(query_vector=[0.1, 0.2, 0.3], limit=10)
results = app.providers.database.vector.hybrid_search(
    query_text="search query",
    query_vector=[0.1, 0.2, 0.3],
    limit=10,
    full_text_weight=1.0,
    semantic_weight=1.0
)

Delete by Metadata

deleted_ids = app.providers.database.vector.delete_by_metadata(
    metadata_fields=["key1", "key2"],
    metadata_values=["value1", "value2"]
)

Relational Database Operations

User Management

Create User

from r2r import UserCreate

new_user = UserCreate(email="[email protected]", password="secure_password")
created_user = app.providers.database.relational.create_user(new_user)

Get User by Email

user = app.providers.database.relational.get_user_by_email("[email protected]")

Update User

user.name = "New Name"
updated_user = app.providers.database.relational.update_user(user)

Document Management

Upsert Document Overview

from r2r import DocumentInfo

doc_info = DocumentInfo(
    document_id="doc_id",
    title="Document Title",
    user_id="user_id",
    version="1.0",
    size_in_bytes=1024,
    metadata={"key": "value"},
    status="processed"
)
app.providers.database.relational.upsert_documents_overview([doc_info])

Get Documents Overview

docs = app.providers.database.relational.get_documents_overview(
    filter_document_ids=["doc_id1", "doc_id2"],
    filter_user_ids=["user_id1", "user_id2"]
)

Advanced Features

Hybrid Search Function

The Database Provider includes a custom Postgres function for hybrid search, combining full-text and vector similarity search.

Token Blacklisting

For enhanced security, the provider supports token blacklisting:

app.providers.database.relational.blacklist_token("token_to_blacklist")
is_blacklisted = app.providers.database.relational.is_token_blacklisted("token_to_check")

Security Best Practices

  1. Environment Variables: Use environment variables for sensitive information like database credentials.
  2. Connection Pooling: The provider uses connection pooling for efficient database connections.
  3. Prepared Statements: SQL queries use prepared statements to prevent SQL injection.
  4. Password Hashing: User passwords are hashed before storage using the configured Crypto Provider.

Performance Considerations

  1. Indexing: The vector database automatically creates appropriate indexes for efficient similarity search.
  2. Batch Operations: Use batch operations like upsert_entries for better performance when dealing with multiple records.
  3. Connection Reuse: The provider reuses database connections to minimize overhead.

Troubleshooting

Common issues and solutions:

  1. Connection Errors: Ensure your database credentials and connection details are correct.
  2. Dimension Mismatch: Verify that the vector dimension in your configuration matches your embeddings.
  3. Performance Issues: Consider optimizing your queries or adding appropriate indexes.

Conclusion

R2R’s Database Provider offers a powerful and flexible solution for managing both vector and relational data. By leveraging Postgres with pgvector, it provides efficient storage and retrieval of embeddings alongside traditional relational data, enabling advanced search capabilities and robust user management.