Introduction

This guide explains how to configure R2R for automatic knowledge graph construction during input file ingestion. The constructed graph is optionally available in downstream R2R RAG.

Knowledge graphs are created with local systems using the newly released Triplex model.

Setup

R2R uses Neo4j as the primary knowledge graph provider. To set up:

# `r2r docker-down` to bring down existing R2R Docker, if running.
r2r --config-name=local_neo4j_kg serve --docker --docker-ext-neo4j 

Pass the flag --docker-ext-ollama if you would like to run with ollama bundled into the R2R docker.

In addiiton, use config config-name=neo4j_kg to run with cloud LLM providers

Local RAG Setup

When running with local RAG, you must have the Triplex model available locally. Pull it and refresh your other relevant models, then start the Ollama server:

# Check the name of the ollama container and modify the command if it differs from r2r-ollama-1
docker exec -it r2r-ollama-1 ollama pull triplex
docker exec -it r2r-ollama-1 ollama pull llama3
docker exec -it r2r-ollama-1 ollama pull mxbai-embed-large

Basic Example

Ingestion

Ingest some sample data and visualize the resulting knowledge graph.

echo "John is a person that works at Google.\n\nPaul is a person that works at Microsoft that collaborates with John." \
>> test.txt
r2r ingest-files test.txt

Visualization

r2r inspect-knowledge-graph
== John ==
  IS_EMPLOYED_BY:
    - Google

== Paul ==
  IS_EMPLOYED_BY:
    - Microsoft
  COLLABORATES_WITH:
    - John

== Graph Statistics ==
Number of nodes: 4
Number of edges: 3
Number of connected components: 2

== Most Central Nodes ==
  Paul: 0.6667
  John: 0.3333
  Google: 0.0000
  Microsoft: 0.0000

Visualizing the created knowledge graph with Neo4j

R2R also performs traditional chunking and embedding for semantic search during ingestion. Verify this:

r2r search --query="who is john?"
Terminal Output
{'id': '5c7731be-81d6-5ab1-ae88-458fee8c462b', 'score': 0.5679587721418247, 'metadata': {'text': 'John is a person that works at Google.\n\nPaul is a person that works at Microsoft that knows John.', 'title': 'test.txt', 'version': 'v0', 'chunk_order': 0, 'document_id': '56f1fdc0-df48-5245-9910-75a0cfb5c641', 'extraction_id': 'e7db5809-c9e0-529d-85bb-5a78c5d21a94', 'associatedQuery': 'who is john?'}}

Scaling up

Scale up your knowledge graph creation efforts by ingesting the got.txt sample file distributed in R2R. This file can take up to several minutes to process when running locally, due to local construction computational complexity.

r2r ingest-sample-file --option=1

We can then inspect the output graph with the CLI

r2r inspect-knowledge-graph --limit=10000
== Stannis ==
  MEMBER_OF:
    - House Baratheon

== Queen of the realm ==
  MARRIED_TO:
    - Robert Baratheon

== Jaime Lannister ==
  SIBLING_OF:
    - Tyrion
    - Cersei
  KILLED:
    - Aerys II Targaryen

...

== Robb Stark ==
  CHILD_OF:
    - Ned Stark

== Maester Luwin ==
  TEACHES_AT:
    - Westeros

== Graph Statistics ==
Number of nodes: 37
Number of edges: 32
Number of connected components: 15

== Most Central Nodes ==
  Jon Arryn: 0.1111
  Jaime Lannister: 0.0833
  Ned Stark: 0.0833
  Rhaegar Targaryen: 0.0833
  Prince Joffrey: 0.0833
Time taken to print relationships: 0.23 seconds

Performance improves significantly when specifying desired entities and relationships for extraction, as shown later. This example uses a default selection of 20 generic entity types and 50 generic relationship types, which limits performance.

Advanced Example

Customized Entities and Relations

For a more complex example, ingest startup company information from the YC company directory. First, specify entity types and relations for the constructed knowledge graph to improve performance:

r2r/examples/scripts/advanced_kg_cookbook.py
from r2r import EntityType, Relation

entity_types = [
    EntityType("ORGANIZATION"),
    EntityType("COMPANY"),
    EntityType("SCHOOL"),
    # ... more entity types
]

relations = [
    Relation("EDUCATED_AT"),
    Relation("WORKED_AT"),
    Relation("FOUNDED"),
    # ... more relations
]

Next, submit a request to the R2R server to update the knowledge graph ingestion prompt to use your specified entity types:

r2r/examples/scripts/advanced_kg_cookbook.py
client = R2RClient(base_url=base_url)
r2r_prompts = R2RPromptProvider()

# use few-shot example to improve cloud provider performance
prompt_base = (
    "zero_shot_ner_kg_extraction"
    if local_mode
    else "few_shot_ner_kg_extraction"
)

update_kg_prompt(client, r2r_prompts, prompt_base, entity_types, relations)

Ingesting a single company

Test this approach on a single company by executing the following command:

# add --local_mode=False when using cloud providers
python -m r2r.examples.scripts.advanced_kg_cookbook --max_entries=1

The extracted relationships will then be printed, like those shown below:

Example Output
== Airbnb ==
  PRODUCT:
    - Book accommodations
  HAS:
    - 7M listings
    - Dublin
    - London
    - Barcelona
    - Paris
    - Milan
    - Copenhagen
    - Berlin
    - Moscow
    - São Paulo
    - Sydney
    - Singapore
  LOCATION:
    - 191+ countries
    - San Francisco
  FOUNDED:
    - 2008
    - Brian Chesky
    - August of 2008
  ANNOUNCED:
    - Airbnb Rooms listing category launch
  TEAM_SIZE:
    - 6132

== Joe Gebbia ==
  FOUNDED:
    - Airbnb
    - Samara
  PARTICIPATED:
    - Board of Directors at Airbnb

== Brian Chesky ==
  FOUNDED:
    - Airbnb
  ASSOCIATED:
    - New York
  EDUCATED_AT:
    - Rhode Island School of Design
  WORKED_AT:
    - Industrial Design

== Nathan Blecharczyk ==
  FOUNDED:
    - Airbnb China
  EDUCATED_AT:
    - Harvard University
  WORKED_AT:
    - Computer Science

== Brian Chesky's cofounder ==
  RAISED:
    - $25M

== Graph Statistics ==
Number of nodes: 33
Number of edges: 31
Number of connected components: 4

== Most Central Nodes ==
  Airbnb: 0.6250
  Brian Chesky: 0.1250
  Joe Gebbia: 0.0938
  Nathan Blecharczyk: 0.0938
  Brian Chesky's cofounder: 0.0312

Again, use the Neo4j browser to visualize the basic graph that has been produced by this process.

Scaling Up

You are now ready to ingest a much larger dataset:

python -m r2r.examples.scripts.advanced_kg_cookbook --max_entries=100

Now you can see that the graph is much richer after ingesting all the data above. When focusing on San Francisco with a limit of 250 nodes, you can see a rich structure emerging.

This will create a more complex graph structure. You can then run various queries on this graph:

# Find all founders
query = """
MATCH (p:PERSON)-[:FOUNDED]->(c)
RETURN p.id AS Founder, c.id AS Company
ORDER BY c.id
LIMIT 10;
"""
# [{'Founder': 'Nathan Blecharczyk', 'Company': 'Airbnb'}, {'Founder': 'Brian Chesky', 'Company': 'Airbnb'}, {'Founder': 'Joe Gebbia', 'Company': 'Airbnb'}, {'Founder': 'Tommy Guo', 'Company': 'Airfront'}, {'Founder': 'Joanne Wang', 'Company': 'Airfront'}, {'Founder': 'Adam Tilton', 'Company': 'Aktive'}, {'Founder': 'Abraham Heifets', 'Company': 'Atomwise'}, {'Founder': 'Nicholas Charriere', 'Company': 'Axilla'}, {'Founder': 'Caitlin', 'Company': 'B2B marketing software'}, {'Founder': 'Timmy', 'Company': 'B2B marketing software'}]

# Find 2-time founders
query = """
MATCH (p:PERSON)-[:FOUNDED]->(c:ORGANIZATION)
WITH p.id AS Person, COUNT(c) AS CompaniesCount
RETURN Person, CompaniesCount
ORDER BY CompaniesCount DESC
LIMIT 10;
"""
# [{'Person': 'Ilana Nasser', 'CompaniesCount': 3}, {'Person': 'Eric', 'CompaniesCount': 2}, {'Person': 'Kris Pahuja', 'CompaniesCount': 2}, {'Person': 'Sam', 'CompaniesCount': 2}, {'Person': 'Tom Blomfield', 'CompaniesCount': 2}, {'Person': 'Umur Cubukcu', 'CompaniesCount': 2}, {'Person': 'Jason', 'CompaniesCount': 2}, {'Person': 'Joe Gebbia', 'CompaniesCount': 2}, {'Person': 'Adam Tilton', 'CompaniesCount': 2}, {'Person': 'Alex', 'CompaniesCount': 2}]

# Find companies with AI products
query = """
MATCH (c:ORGANIZATION)-[r:PRODUCT]->(t)
WHERE t.id CONTAINS 'AI'
RETURN DISTINCT c.id AS Company, t.id AS Product
ORDER BY c.id
LIMIT 10;
"""
# [{'Company': 'AgentsForce', 'Product': 'AI support agents'}, {'Company': 'Airfront', 'Product': 'AI-first email platform'}, {'Company': 'Airfront', 'Product': 'AI-first email platform with built-in automations'}, {'Company': 'Airfront', 'Product': 'AI automation platform'}, {'Company': 'Axflow', 'Product': 'AI app'}, {'Company': 'Clarum', 'Product': 'AI-powered due diligence solutions'}, {'Company': 'Clarum', 'Product': 'AI-powered due diligence'}, {'Company': 'CommodityAI', 'Product': 'AI-automation platform'}, {'Company': 'Dawn', 'Product': 'Analytics for AI products'}, {'Company': 'Decipher', 'Product': 'AI-powered user impact summaries'}]```

Knowledge Graph Agents

Knowledge graph agents currently perform best with advanced models like GPT-4 and Claude 3.5. We’re actively working on enhancing local performance to match these capabilities.

To use a knowledge graph agent for complex queries, follow the example below. This code demonstrates how to update the agent’s prompt with custom entity types and relations, perform a search query, and execute a RAG (Retrieval-Augmented Generation) query:

r2r/examples/scripts/advanced_kg_cookbook.py
if not local_mode:

    update_kg_prompt(
        client, r2r_prompts, "kg_agent", entity_types, relations
    )

    result = client.search(
        query="Find up to 10 founders that worked at Google",
        use_kg_search=True,
    )["results"]

    print("result:\n", result)
    print("Search Result:\n", result["kg_search_results"])

    result = client.rag(
        query="Find up to 10 founders that worked at Google",
        use_kg_search=True,
    )
    print("RAG Result:\n", result)
Expected Output
Search Result:
# Search output structured as (agent_query, results)
("\nMATCH (p:PERSON)-[:FOUNDED]->(o:ORGANIZATION)\nMATCH (p)-[:WORKED_AT]->(g:ORGANIZATION)\nWHERE g.name = 'Google'\nRETURN p.name AS Founder, o.name AS Organization\nLIMIT 10;\n",
    [
        [
            {'Founder': 'Kris Pahuja', 'Organization': 'Piramidal'}, 
            {'Founder': 'Kris Pahuja', 'Organization': 'Gyftgo'}, 
            {'Founder': 'Ohad Navon', 'Organization': 'Octo'}, 
            {'Founder': 'Edrei', 'Organization': 'Stellar Sleep'}, 
            {'Founder': 'Pedro Saratscheff', 'Organization': 'Ruuf'}
        ]
    ]
)
RAG Result:
[
    ChatCompletion(
        choices=[
            Choice(
                finish_reason='stop', 
                index=0, 
                logprobs=None,
                message=ChatCompletionMessage(
                    content='Here are the founders that worked at Google:\n\n1. Kris Pahuja, founder of Piramidal [1]\n2. Kris Pahuja, founder of Gyftgo [2]\n3. Ohad Navon, founder of Octo [3]\n4. Edrei, founder of Stellar Sleep [4]', 
                    role='assistant', 
                    function_call=None, 
                    tool_calls=None
                )
            )
        ], 
        ...
    )
]

This approach allows for more flexible and complex querying of the knowledge graph.

Summary

Knowledge graphs in R2R provide a powerful way to structure and query information extracted from your documents. By combining vector search, semantic search, and structured queries, you can build sophisticated retrieval systems that leverage both unstructured and structured data.

Further, knowledge graphs are a core component in innovative new techniques like GraphRAG (Graphs + Retrieval-Augmented Generation) recently pioneered by Microsoft. Graph RAG provides rich understanding of text datasets by combining text extraction, network analysis, and LLM prompting and summarization into a single end-to-end system.

For detailed setup and basic functionality, refer back to the R2R Quickstart. For more advanced usage and customization options, join the R2R Discord community.