Document Management

Introduction

This guide demonstrates how to use R2R’s powerful user and document management features. These capabilities allow you to track, organize, and manage documents on a per-user basis, providing granular control over your data.

Setup

Ensure you have R2R installed and configured as described in the installation guide. For this cookbook, we’ll use the default configuration.

Basic Usage

Ingest Document(s)

Let’s start by ingesting some sample documents for different users:

# Sample document ingestion
app = R2R()

# Generate two unique user ids
user1_id = generate_id_from_label("user1")
user2_id = generate_id_from_label("user2")

# Ingest sample documents with unique & deterministic ids
ingestion_result = app.ingest_documents([
    Document(
        id=generate_id_from_label("doc1"),
        type="txt",
        data="Artificial Intelligence is transforming industries.",
        metadata={"title": "AI Overview", "user_id": user1_id},
    ),
    Document(
        id=generate_id_from_label("doc2"),
        type="txt",
        data="Machine Learning is a subset of AI focused on data-driven algorithms.",
        metadata={"title": "ML Basics", "user_id": user1_id},
    ),
    Document(
        id=generate_id_from_label("doc3"),
        type="txt",
        data="Natural Language Processing enables computers to understand human language.",
        metadata={"title": "NLP Intro", "user_id": user2_id},
    ),
])
print(ingestion_result)

Expected Output:

{'processed_documents': ["Document 'AI Overview' processed successfully.", ...]}

ingest_documents (or, ingest_files for ingesting raw files) is the entry point for adding new documents to your R2R system. It processes multiple documents simultaneously, converting their text into vector embeddings and storing them in the pgvector database. At the same time, it updates the document_info table with metadata like title, user ID, and file size. Together these functionalities create a searchable and manageable document system.

Documents Overview

We can fetch an comprehensive overview of our documents at any time:

documents_overview = app.documents_overview()
print(documents_overview)

Expected Output:

[
    DocumentInfo(
        document_id=UUID('460ae2af-2a4b-58d5-b3e0-a142023d83bb'), 
        version='v0', 
        size_in_bytes=51,
        user_id=UUID('e063bb16-cc76-558b-9f94-afe212747cda'),
        title='AI Overview'
        ...
    ),
    ...
]

The command above queries the document_info table, pulling key metadata for each document. It’s particularly useful for auditing your document collection or getting a quick summary of what’s in your knowledge base.

Document Chunks

To get all chunks corresponding by document ID:

document_chunks = app.document_chunks(document_id=generate_id_from_label("doc1"))
print(document_chunks)

[
    {
        'text': 'Artificial Intelligence is transforming industries.', 
        'title': 'AI Overview', 
        'user_id': 'e063bb16-cc76-558b-9f94-afe212747cda', 
        ...
    }
]

This operation fetches all the chunks of a specific document. R2R splits documents into smaller pieces for more efficient processing and retrieval. By examining these chunks, you can see how R2R has broken down your document for internal use.

User Management

Get Users Overview

To get an overview of all users in the system:

users_overview = app.users_overview()
print(users_overview)

Expected Output:

[
    UserStats(
        user_id=UUID('...'),
        num_files=2,
        total_size_in_bytes=120,
        document_ids=[UUID('...'), UUID('...')]
    ),
    UserStats(
        user_id=UUID('...'),
        num_files=1,
        total_size_in_bytes=75,
        document_ids=[UUID('...')]
    )
]

This command aggregates data from the document_info table, showing you how many documents each user has, the total size of their documents, and which specific documents belong to them.

Get User-Specific Documents

To retrieve documents for a specific user:

user_docs = app.documents_overview(user_ids=[user1_id])
print(user_docs)

Expected Output:

[
    DocumentInfo(
        document_id=UUID('...'),
        version='v0',
        size_in_bytes=51,
        # user id and title are elevated on DocumentInfo obj.
        user_id=UUID('...'),
        title="AI Overview",
        metadata={},
    ),
    DocumentInfo(
        document_id=UUID('...'),
        version='v0',
        size_in_bytes=69,
        user_id=UUID('...'),
        title="ML Basics",
        metadata={},
    )
]

When you need to focus on a particular user’s documents, this is your go-to command. It filters the document_info table to show only the documents associated with the specified user ID. This is especially useful in multi-user environments or when you need to manage documents on a per-user basis.

Search User-Specific Documents

To search documents associated with a specific user:

from r2r import VectorSearchSettings

vector_search_settings = VectorSearchSettings(
    search_filters={"user_id": user1_id}
)

search_results = app.search(
    query="What is AI?",
    vector_search_settings=vector_search_settings
)
print(search_results)

Expected Output:

{'results': [
    {
        'id': UUID('...'), 
        'score': 0.60, 
        'metadata': {
            'text': 'Artificial Intelligence is transforming industries.', 
            'title': 'AI Overview', 
            'user_id': '...', 
            ...
        }
    },
    ...
]}

This command performs a semantic search across a user’s documents. The user_id filter ensures you’re only searching within the specified user’s documents.

RAG on User-Specific Documents

To perform RAG on documents of a specific user:

from r2r import GenerationConfig

rag_results = app.rag(
    query="Explain AI briefly",
    vector_search_settings=vector_search_settings,
    rag_generation_config=GenerationConfig(model="gpt-3.5-turbo")
)
print(rag_results)

Expected Output:

{'results': [ChatCompletion(...)]}

This command combines the results from semantic search with the specified LLM to produce a completion which adheres to the OpenAI specification. Again, the user_id filter ensures you’re only searching within the specified user’s documents.

Advanced Features

Delete Document(s) by ID

document_deletion = app.delete(keys=["document_id"], values=[generate_id_from_label("doc3")])
print(document_deletion)

r2r.main.services.management_service - INFO - Deleting entries with metadata: document_id=ad163263-9f1d-5107-880c-a04118efb87b
Document(s) ['ad163263-9f1d-5107-880c-a04118efb87b'] deleted successfully.

When you need to remove a specific document from your system, this is the command to use. It cleans up all traces of the document, removing its chunks from the vector store and its metadata from the document_info table. This ensures your R2R system stays clean and up-to-date.

Delete Document(s) by User ID

user_deletion = app.delete(keys=["user_id"], values=[user1_id])
print(user_deletion)

r2r.main.services.management_service - INFO - Deleting entries with metadata: user_id=e063bb16-cc76-558b-9f94-afe212747cda
Documents ['fbe4ef3e-9402-5135-99fb-e5bf7ddcfa7a', '460ae2af-2a4b-58d5-b3e0-a142023d83bb'] deleted successfully.

This powerful command removes all documents associated with a particular user. R2R ensures a clean deletion, removing all related records from both the vector and metadata tables.

Update Document(s)

To update an existing document:

updated_doc = Document(
    type="txt",
    data="Artificial Intelligence is revolutionizing industries worldwide.",
    metadata={"title": "AI Overview", "user_id": user1_id},
)

update_result = app.update_documents([updated_doc])
print(update_result)

updated_document_chunks = app.document_chunks(document_id=generate_id_from_label("doc1"))
print(updated_document_chunks)

Expected Output:

r2r.main.services.ingestion_service - INFO - Deleting documents which match on these keys and values: (['document_id', 'version'], ['460ae2af-2a4b-58d5-b3e0-a142023d83bb', 'v0']) ...
Document(s) 460ae2af-2a4b-58d5-b3e0-a142023d83bb updated.
[
    {
        'text': 'Artificial Intelligence is revolutionizing industries worldwide.', # updated text
        'title': 'AI Overview', 
        'user_id': 'e063bb16-cc76-558b-9f94-afe212747cda', 
        'version': 'v1', # updated version
        ...
    }
]

This command allows you to modify existing documents in your R2R system. It’s not just updating metadata - it’s re-processing the entire document. This means recalculating embeddings and updating all associated records. R2R handles versioning for you, ensuring you can track changes over time.

Batch Uploads

R2R supports batch ingestions for efficient processing of multiple documents:

# Batch ingestion
batch_docs = [Document(...) for _ in range(100)]
app.ingest_documents(batch_docs)

# Batch update
updated_batch = [Document(...) for _ in range(100)]
app.update_documents(updated_batch)

Summary

R2R provides robust user and document management capabilities, allowing you to:

Track and manage documents on a per-user basis
Perform user-specific searches and RAG operations
Update and version documents, get document chunks and more.
Delete user-specific data

These features enable granular control over your data and support multi-user applications with ease.

For detailed setup and basic functionality, refer back to the R2R Quickstart. For more advanced usage and customization options, join the R2R Discord community.

Get Started

RAG Cookbooks

App Features

Deep Dive

Introduction

Setup

Basic Usage

Ingest Document(s)

Documents Overview

Document Chunks

User Management

Get Users Overview

Get User-Specific Documents

Search User-Specific Documents

RAG on User-Specific Documents

Advanced Features

Delete Document(s) by ID

Delete Document(s) by User ID

Update Document(s)

Batch Uploads

Summary

Get Started

RAG Cookbooks

App Features

Deep Dive

​Introduction

​Setup

​Basic Usage

​Ingest Document(s)

​Documents Overview

​Document Chunks

​User Management

​Get Users Overview

​Get User-Specific Documents

​Search User-Specific Documents

​RAG on User-Specific Documents

​Advanced Features

​Delete Document(s) by ID

​Delete Document(s) by User ID

​Update Document(s)

​Batch Uploads

​Summary

Introduction

Setup

Basic Usage

Ingest Document(s)

Documents Overview

Document Chunks

User Management

Get Users Overview

Get User-Specific Documents

Search User-Specific Documents

RAG on User-Specific Documents

Advanced Features

Delete Document(s) by ID

Delete Document(s) by User ID

Update Document(s)

Batch Uploads

Summary