Knowledge Graph — Build, scale, and manage user-facing Retrieval-Augmented Generation applications.

Introduction

R2R’s KGProvider handles the creation, management, and querying of knowledge graphs in your applications. This guide offers an in-depth look at the system’s architecture, configuration options, and best practices for implementation.

For a practical, step-by-step guide on implementing knowledge graphs in R2R, including code examples and common use cases, see our GraphRAG Cookbook.

Configuration

Knowledge Graph Configuration

These are located in the r2r.toml file, under the [database] section.

1 [database]
2 provider = "postgres"
3 batch_size = 256
4 
5   [database.kg_creation_settings]
6     kg_triples_extraction_prompt = "graphrag_triples_extraction_few_shot"
7     entity_types = ["Person", "Organization", "Location"] # if empty, all entities are extracted
8     relation_types = ["works at", "founded by", "invested in"] # if empty, all relations are extracted
9     max_knowledge_triples = 100
10     fragment_merge_count = 4 # number of fragments to merge into a single extraction
11     generation_config = { model = "openai/gpt-4o-mini" } # and other params, model used for triplet extraction
12 
13   [database.kg_enrichment_settings]
14     max_description_input_length = 65536 # increase if you want more comprehensive descriptions
15     max_summary_input_length = 65536 # increase if you want more comprehensive summaries
16     generation_config = { model = "openai/gpt-4o-mini" } # and other params, model used for node description and graph clustering
17     leiden_params = {}

Environment variables take precedence over the config settings in case of conflicts. The R2R Docker includes configuration options that facilitate integration with a combined Postgres+pgvector database setup.

Implementation Guide

File Ingestion and Graph Construction

1 from r2r import R2RClient
2 
3 client = R2RClient("http://localhost:7272")
4 
5 result = client.ingest_files(["path/to/your/file.txt"])
6 
7 # following will create a graph on all ingested files
8 document_ids = [] # add document ids that you want to create a graph on
9 creation_result = client.create_graph(document_ids)
10 print(f"Creation Result: {creation_result}")
11 # wait for the creation to complete
12 
13 enrichment_result = client.enrich_graph() # enrichment will run on all nodes in the graph
14 print(f"Enrichment Result: {enrichment_result}")
15 # wait for the enrichment to complete

Graph-based Search

There are two types of graph-based search: local and global.

local search is faster and more accurate, but it is not as comprehensive as global search.
global search is slower and more comprehensive, but it will give you the most relevant results. Note that global search may perform a large number of LLM calls.

1 search_result = client.search(
2     query="Find founders who worked at Google",
3     kg_search_settings={"use_kg_search":True, "kg_search_type": "local"}
4 )
5 print(f"Search Result: {search_result}")

Retrieval-Augmented Generation

1 rag_result = client.rag(
2     query="Summarize the achievements of founders who worked at Google",
3     kg_search_settings={"use_kg_search":True, "kg_search_type": "local"}
4 )
5 print(f"RAG Result: {rag_result}")

Best Practices

Optimize Chunk Size: Adjust the chunk_size based on your data and model capabilities.
Use Domain-Specific Entity Types and Relations: Customize these for more accurate graph construction.
Balance Batch Size: Adjust batch_size for optimal performance and resource usage.
Implement Caching: Cache frequently accessed graph data for improved performance.
Regular Graph Maintenance: Periodically clean and optimize your knowledge graph.

Advanced Topics

Custom Knowledge Graph Providers

Extend the KGProvider class to implement custom knowledge graph providers:

1 from r2r.base import KGProvider, KGConfig
2 
3 class CustomKGProvider(KGProvider):
4     def __init__(self, config: KGConfig):
5         super().__init__(config)
6         # Custom initialization...
7 
8     def ingest_files(self, file_paths: List[str]):
9         # Custom implementation...
10 
11     def search(self, query: str, use_kg_search: bool = True):
12         # Custom implementation...
13 
14     # Implement other required methods...

Integrating External Graph Databases

To integrate with external graph databases:

Implement a custom KGProvider.
Handle data synchronization between R2R and the external database.
Implement custom querying methods to leverage the external database’s features.

Scaling Knowledge Graphs

For large-scale applications:

Implement graph partitioning for distributed storage and processing.
Use graph-specific indexing techniques for faster querying.
Consider using a graph computing framework for complex analytics.

Troubleshooting

Common issues and solutions:

Ingestion Errors: Check file formats and encoding.
Query Performance: Optimize graph structure and use appropriate indexes.
Memory Issues: Adjust batch sizes and implement pagination for large graphs.

Conclusion

R2R’s Knowledge Graph system provides a powerful foundation for building applications that require structured data representation and complex querying capabilities. By understanding its components, following best practices, and leveraging its flexibility, you can create sophisticated information retrieval and analysis systems tailored to your specific needs.

For further customization and advanced use cases, refer to the R2R API Documentation and the GraphRAG Cookbook.

1	[database]
2	provider = "postgres"
3	batch_size = 256
4
5	[database.kg_creation_settings]
6	kg_triples_extraction_prompt = "graphrag_triples_extraction_few_shot"
7	entity_types = ["Person", "Organization", "Location"] # if empty, all entities are extracted
8	relation_types = ["works at", "founded by", "invested in"] # if empty, all relations are extracted
9	max_knowledge_triples = 100
10	fragment_merge_count = 4 # number of fragments to merge into a single extraction
11	generation_config = { model = "openai/gpt-4o-mini" } # and other params, model used for triplet extraction
12
13	[database.kg_enrichment_settings]
14	max_description_input_length = 65536 # increase if you want more comprehensive descriptions
15	max_summary_input_length = 65536 # increase if you want more comprehensive summaries
16	generation_config = { model = "openai/gpt-4o-mini" } # and other params, model used for node description and graph clustering
17	leiden_params = {}

1	from r2r import R2RClient
2
3	client = R2RClient("http://localhost:7272")
4
5	result = client.ingest_files(["path/to/your/file.txt"])
6
7	# following will create a graph on all ingested files
8	document_ids = [] # add document ids that you want to create a graph on
9	creation_result = client.create_graph(document_ids)
10	print(f"Creation Result: {creation_result}")
11	# wait for the creation to complete
12
13	enrichment_result = client.enrich_graph() # enrichment will run on all nodes in the graph
14	print(f"Enrichment Result: {enrichment_result}")
15	# wait for the enrichment to complete

1	search_result = client.search(
2	query="Find founders who worked at Google",
3	kg_search_settings={"use_kg_search":True, "kg_search_type": "local"}
4	)
5	print(f"Search Result: {search_result}")

1	rag_result = client.rag(
2	query="Summarize the achievements of founders who worked at Google",
3	kg_search_settings={"use_kg_search":True, "kg_search_type": "local"}
4	)
5	print(f"RAG Result: {rag_result}")

1	from r2r.base import KGProvider, KGConfig
2
3	class CustomKGProvider(KGProvider):
4	def __init__(self, config: KGConfig):
5	super().__init__(config)
6	# Custom initialization...
7
8	def ingest_files(self, file_paths: List[str]):
9	# Custom implementation...
10
11	def search(self, query: str, use_kg_search: bool = True):
12	# Custom implementation...
13
14	# Implement other required methods...