Knowledge Graphs

Building and managing graphs through collections

Overview

R2R allows you to build and analyze knowledge graphs from your documents through a collection-based architecture. The system extracts entities and relationships from documents, enabling richer search capabilities that understand connections between information.

The process works in several key stages:

Documents are first ingested and entities/relationships are extracted
Collections serve as containers for documents and their corresponding graphs
Extracted information is pulled into the collection’s graph
Communities can be built to identify higher-level concepts
The resulting graph enhances search with relationship-aware queries

Collections in R2R are flexible containers that support multiple documents and provide features for access control and graph management. A document can belong to multiple collections, allowing for different organizational schemes and sharing patterns.

The resulting knowledge graphs improve search accuracy by understanding relationships between concepts rather than just performing traditional document search.

Ingestion and Extraction

Before we can extract entities and relationships from a document, we must ingest a file. After we’ve successfully ingested a file, we can extract the entities and relationships from document.

In the following script, we fetch The Gift of the Magi by O. Henry and ingest it our R2R server. We then begin the extraction process, which may take a few minutes to run.

Python

1 import requests
2 from r2r import R2RClient
3 import tempfile
4 import os
5 
6 # Set up the client
7 client = R2RClient("http://localhost:7272")
8 
9 # Fetch the text file
10 url = "https://www.gutenberg.org/cache/epub/7256/pg7256.txt"
11 response = requests.get(url)
12 
13 # Create a temporary file
14 temp_dir = tempfile.gettempdir()
15 temp_file_path = os.path.join(temp_dir, "gift_of_the_magi.txt")
16 with open(temp_file_path, 'w') as temp_file:
17     temp_file.write(response.text)
18 
19 # Ingest the file
20 ingest_response = client.documents.create(file_path=temp_file_path)
21 document_id = ingest_response["results"]["document_id"]
22 
23 # Extract entities and relationships
24 extract_response = client.documents.extract(document_id)
25 
26 # View extracted knowledge
27 entities = client.documents.list_entities(document_id)
28 relationships = client.documents.list_relationships(document_id)
29 
30 # Clean up the temporary file
31 os.unlink(temp_file_path)

As this script runs, we see indications of successful ingestion and extraction.

Ingestion

Entities

Successful ingestion and extraction in the R2R dashboard. — Both ingestion and extraction were successful, as seen in the R2R Dashboard

Deduplication

If you would like to deduplicate the extracted entities, you can run the following method. To learn more about deduplication, view our deduplication documentation here.

Python

1 from r2r import R2RClient
2 
3 # Set up the client
4 client = R2RClient("http://localhost:7272")
5 
6 client.documents.deduplicate("20e29a97-c53c-506d-b89c-1f5346befc58")

While the exact number of extracted entities and relationships will differ across models, this particular document produces approximately 120 entities, with only 20 distinct entities.

Managing Collections

Graphs are built within a collection, allowing for us to add many documents to a graph, and to share our graphs with other users. When we ingested the file above, it was added into our default collection.

Each collection has a description which is used in the graph creation process. This can be set by the user, or generated using an LLM.

Python

1 from r2r import R2RClient
2 
3 # Set up the client
4 client = R2RClient("http://localhost:7272")
5 
6 # Update the description of the default collection
7 collection_id = "122fdf6a-e116-546b-a8f6-e4cb2e2c0a09"
8 update_result = client.collections.update(
9     id=collection_id,
10     generate_description=True, # LLM generated
11 )

The resulting description. — The LLM generated description for our collection

Pulling Extractions into the Graph

Our graph will not contain the extractions from our documents until we pull them into the graph. This gives developers more granular control over the creation and management of graphs.

Recall that we already extracted the entities and relationships for the graph; this means that we can pull a document into many graphs without having to rerun the extraction process.

Python

1 from r2r import R2RClient
2 
3 # Set up the client
4 client = R2RClient("http://localhost:7272")
5 
6 # Pull the extractions from all docments into the default collection
7 collection_id = "122fdf6a-e116-546b-a8f6-e4cb2e2c0a09"
8 client.graphs.pull(
9     collection_id=collection_id
10 )

As soon as we pull the extractions into the graph, we can begin using the graph in our searches. We can confirm that the entities and relationships were pulled into the collection, as well.

Entities

Entity Visualization

Successful ingestion and extraction in the R2R dashboard. — Entities are pulled in from the document to the collection

Building Communities

To further enhance our graph we can build communities, which clusters over the entities and relationships inside our graph. This allows us to capture higher-level concepts that exist within our data.

Python

1 from r2r import R2RClient
2 
3 # Set up the client
4 client = R2RClient("http://localhost:7272")
5 
6 # Build the communities for the default collection
7 collection_id = "122fdf6a-e116-546b-a8f6-e4cb2e2c0a09"
8 client.graphs.build(
9     collection_id=collection_id
10 )

We can see that the resulting communities capture overall themes and concepts within the story.

The communities generated for the collection. — The resulting communities, generated from the clustering process

Graph Search

Now that we have built our graph we can query over it. Good questions for graphs might require deep understanding of relationships and ideas that span across multiple documents.

Python

1 from r2r import R2RClient
2 
3 # Set up the client
4 client = R2RClient("http://localhost:7272")
5 
6 results = client.retrieval.search("""
7     What items did Della and Jim each originally own,
8     what did they do with those items, and what did they
9     ultimately give each other?
10     """,
11     search_settings={
12         "graph_settings": {"enabled": True},
13     }
14 )

Performing a searhc over the graph. — Performing a multi-hop query over the graph