Deduplication
Building and managing knowledge graphs through collections
In many cases, the chunks that go into a document contain duplicate elements. This can create significant noise within a graph, and produce less-than-optimal search results. One way to reconcile this is through entity deduplication, which condenses duplicate elements into a single, high quality element.
Overview
Entity deduplication is the process of identifying and merging duplicate entities within a knowledge graph. R2R currently supports document-level deduplication, with graph-level deduplication planned for future releases.
Document-Level Deduplication
Document-level deduplication focuses on consolidating duplicate entities within a single document. This process:
- Identifies duplicate entities using configurable matching techniques
- Merges matched entities into a single high-quality entity
- Regenerates entity descriptions and embeddings using LLM
- Updates related relationships to point to the merged entity
Following the process of creating a graph outlined in our graph cookbook, we can ingest a document. This process produces a number of entities and relationships, however, we see many duplicates!
When extracting elements from The Gift of the Magi by O. Henry, we find that there 129 total entities, however only 20 of the entities are unique.
Extracted Entities Before Deduplication
Python
After running the deduplication process, we are left with 20 entities. Those that were duplicates have been merged, and their description has been updated to ensure that no description context is lost through the merging process.
Deduplication Techniques
R2R supports (or plans to support) several deduplication techniques, each with its own advantages:
Merging Strategy
When duplicates are identified, R2R employs a sophisticated merging strategy:
- Name Retention: Keeps the most common form of the entity name
- Description Consolidation: Combines descriptions from all duplicates and uses LLM to generate a comprehensive, non-redundant description
- Category Resolution: Preserves the most specific category if categories differ
- Metadata Merging: Combines metadata from all duplicates, resolving conflicts through configurable rules
- Relationship Redirection: Updates all relationships to point to the merged entity
Future Developments
Runtime Configurable Techniques
Runtime configurable deduplication techniques will allow for more advanced strategies. This includes n-character block matching, semantic similarity matching, and fuzzy name matching.
Graph-Level Deduplication
A major feature planned for R2R’s deduplication capabilities is graph-level deduplication. This will:
- Identify and merge duplicates across multiple documents within a graph
- Maintain provenance information for merged entities
- Provide configurable merging rules at the graph level
- Support cross-document relationship consolidation
Entity deduplication is a critical step in maintaining graph quality. While automatic deduplication is powerful, it’s recommended to review results, especially in domains where entity disambiguation is crucial.