Design choices of the Context Graph
Why a knowledge graph?
When you’re trying to build an enterprise-scale “wiki” or memory system for an AI, the first question is how to store the data. I chose a knowledge graph. Why? Because linking many entities, names, and concepts together allows both humans and AI to traverse and discover related information naturally.
If you just dump documents into a vector database, you get semantic search, that’s all well and good but it lacks navigable links from one concepts to other concepts. A knowledge graph lets you explicitly state that “Person A is the CEO of Company B” and “Company B is headquartered in Region C”. It builds a web of context that an AI can navigate.
What is a Context Graph?
While a Knowledge Graph focuses on linking entities and concepts together to form a web of information, a Context Graph takes this a step further. The primary difference is that a Context Graph provides a strict audit trail, provenance, and data lineage on top of those semantic definitions. It doesn’t just know what a relationship is; it knows where it came from, who asserted it, and when.
We can visualize this as the intersection of three distinct domains:
- Circle A (Ontology): Semantic-oriented. Focuses on the meaning and definitions of concepts.
- Circle B (Provenance): Auditability and traceability-oriented. Focuses on the origin and history of data.
- Circle C (Network & Link Analysis): Linkage and traversal-oriented. Focuses on the connections between entities.
When these domains intersect, we get:
- A + B (Semantic Lineage): Understanding the history and origin of specific semantic definitions.
- B + C (Flow/Lineage Analysis): Tracing how data and relationships move or evolve across the network.
- A + C (Knowledge Graph): A semantic network of linked concepts (the traditional Knowledge Graph).
- A + B + C (Context Graph): The innermost intersection. A fully traceable, semantically rich, and deeply linked network of knowledge.
Why RDF and not LPG?
This was a difficult choice. I am actually much more familiar with Labeled Property Graph (LPG) technology like Neo4j. LPG is an easier system to understand, the querying (like Cypher) feels more intuitive, and the concepts map really nicely to object-oriented programming.
But ultimately, I went with RDF (Resource Description Framework). Why? Because of the richness of its ontology support, the ability to encode business rules directly into the ontology, and the foundational idea of linked data.
One massive implication of RDF’s ontology-driven development is that the ontology lives outside of the application code. This gives a beautiful separation of concerns: the application code just handles moving data around, while the business logic and schema live in the graph itself.
I think some of the specific design decisions I make here could be applied to an LPG-styled graph, but it would require writing a lot more application-level code and ultimately expose a much larger maintenance surface.
RDF vs. LPG Comparison
Here is how I break down the differences and why RDF won out for this specific use case:
| Feature | RDF (Resource Description Framework) | LPG (Labeled Property Graph) |
|---|---|---|
| Identity | Globally unique URIs (e.g., http://my-wiki.com/entity/123) |
Local, internal database IDs |
| Ontology / Schema | Formal (OWL/RDFS), shared, and decoupled from the application | Usually schema-on-read or database-specific constraints |
| Reasoning | Built-in logical inference (forward/backward chaining) | Relies on application code or graph algorithms |
| Query Language | SPARQL (W3C Standard) | Cypher, Gremlin (Vendor-specific) |
| Core Strengths | Data integration, provenance, stable identity, logical rules | Deep traversal (shortest path), graph data science (PageRank), developer ergonomics |
The Database Stack
To make this architecture work, we need a few different data stores playing nicely together:
graph TD
subgraph Storage [Database Stack]
MinIO[("MinIO (Object Store)")]
PG[("Postgres (pgvector)")]
Jena[("Apache Jena Fuseki (RDF)")]
Ontop[("ONTOP (Virtual Graph)")]
end
subgraph Indexing [Indexing Pipeline]
Docs["Raw Documents"] --> |"Upload"| MinIO
Docs --> |"Chunk & Embed"| PG
Docs --> |"Extract Entities"| Jena
end
subgraph Serving [Query & Serving]
Backend["Java Backend"]
UI["Astro Web UI"]
AI["MCP AI Tools"]
end
Jena --> |"SPARQL"| Backend
PG --> |"Vector Search"| Backend
Ontop --> |"SPARQL-to-SQL"| Backend
Backend --> UI
Backend --> AI
- Apache Jena Fuseki: This is our core RDF triple store. It holds the graph, the ontology, and handles the reasoning.
- Postgres (with pgvector): Used for storing the raw document texts, the chunked text, and their vector embeddings.
- MinIO: An S3-compatible object store for keeping the original raw files (PDFs, text files).
- ONTOP: Used as a virtual knowledge graph layer to translate SPARQL queries into SQL for our structured relational data.
The Indexing Pipeline
Getting unstructured data into the graph is handled by an indexing pipeline written in TypeScript. I use Temporal for workflow execution. Temporal is fantastic here because indexing involves slow, flaky steps (like OCRing a PDF or calling an LLM to extract entities). Temporal ensures that if a step fails, the workflow pauses and retries without losing state, and can cleanly roll back partial graph insertions if things completely break.
Querying
Querying is handled by a Java backend that routes requests. It exposes endpoints for raw SPARQL, reasoned SPARQL (which applies inference rules), text search (via Lucene), and vector search (via Postgres).
Ultimately, this is consumed by the AI via MCP (Model Context Protocol) tools or a CLI. This allows the LLM to query concepts or traverse entities from the graph without needing to know how to write perfect SPARQL. Because I have built MCPs elsewhere and because the querying is largely dependent on the choices made during indexing, I’ll leave the deep implementation details of the MCPs as a future scope.
Serving
We envision every entity or document being served as a web page, similar to a Wikipedia page. The relationships between entities are literally links from page to page.
Crucially, this web page must be citable by external systems. Someone should be able to reference this piece of knowledge in a PDF report, a presentation, or send it as a web link in an email. This means the web page must have stable URLs.
This requirement right away imposes a severe limitation: our knowledge graph must either grow continuously or keep track of entity identity over time. We cannot use a “nuke and rebuild” style where updates happen as a batch process and IDs don’t survive between builds. This requirement plays a massive role in selecting RDF over LPG. In RDF, the URI is an inherent, foundational concept. In LPG, stable, globally resolvable identity is a matter of custom application-level code.
I built this serving layer using Astro for server-side rendering, with React Islands for the interactive parts (like the graph visualization). It’s great for spot checks and sanity checks. It provides trust and verifiable grounding for the LLMs.
A System of Knowledge Record
Ultimately, what I have in mind is a system of knowledge record. Every entity or concept tracks:
- Who contributed it
- When it was recorded
- Which document it came from
- What other concepts are related to it, and how
When we feed this information to an LLM, we want it to be able to navigate this web of knowledge in place of us. We want it to make sense of the knowledge and use it as strict references when making decisions.
The reference here is crucial because we want to guarantee 0% hallucination from the LLMs. Meaning: every single assertion the AI makes must be backed up by sources in the graph, and we can explicitly quantify the agreement between the AI’s assertion and the cited sources.
Navigation:
- Previous: Part 1: Intro
- Next: Part 3: Introduction to ontology, RDF
