Codegraph MCP
A persistent code intelligence layer for AI coding agents — built in Rust, tested on 111K lines of production code.
Gives AI agents structured code navigation and cross-session memory so they stop reading entire files and start querying a graph. 23% cheaper, 32% faster than vanilla grep/read on multi-step tasks.
The Problem
Read entire files
To answer "who calls this function?", agents read thousands of tokens of irrelevant code.
Lose all context
When the context window compacts, task progress, decisions, and working files vanish.
Repeat mistakes
No memory of what worked or failed across sessions — every conversation starts from zero.
On a 111K-line codebase, a single “who calls X?” query costs ~42,000 tokens via grep+read. Multiply by 15-25 queries per session and you're burning hundreds of thousands of tokens just on orientation, not coding.
The Solution
Codegraph is an MCP server that runs over stdio and provides 26 tools across three systems:
Code Graph
tree-sitter parses 5 languages into a directed graph of symbols + relationships.
"Who calls X?" — 151 tokens instead of 42,478
Session Memory
Tracks task, subtasks, decisions, and working context across compaction.
"Where was I?" — 95 tokens instead of re-reading 5-10 files
Learning System
Records patterns, failures, and solution lineage that compound over sessions.
"What worked last time?" with confidence scoring & time decay
Key Results
The Benchmark
111K-line Python codebase, 5 tool configurations, isolated tasks + multi-step investigation.
Multi-Step Refactoring (10 steps)
Same accuracy across all five configs. Zero difference in correctness. Pure cost and speed difference.
Per-Query Savings
Key insight: Codegraph's cost scales with answer size, not file size. The bigger the codebase, the wider the gap.
How It Works
Code Graph
- tree-sitter extracts symbols & relationships from 5 languages
- Directed graph in petgraph with SQLite persistence
- Incremental indexing — only re-parses changed files
- Cross-file resolution: 47% resolution rate
Session Memory
- Tracks task, subtasks, decisions, working context
- smart_context restores state in ~95 tokens
- Decisions persist with reasoning & symbol links
Learning System
- Records patterns & failures with file/tag scoping
- Solution lineage tracks attempt chains
- Confidence scoring with 90-day time decay
- suggest_approach synthesizes into recommendations
Gets Smarter Over Time
The learning system compounds knowledge across sessions.
Generic but correct approaches
Increasingly specific — mentions exact class names, file paths
"Leverage Session 1 + Session 2 + Session 4. Search for redis.RedisError catches."
"Combine Session 3 + Session 5. Verify with forgetting module."
By session 9, the system wasn't just recalling individual patterns — it was synthesizing across multiple earlier sessions to suggest composite strategies.
Architecture
src/
├── mcp/ Protocol layer (JSON-RPC 2.0, 26 tools)
├── code/ Code analysis (tree-sitter, indexer)
├── store/ Persistence (SQLite + petgraph)
├── session/ Session state machine
├── learning/ Patterns, failures, lineage
├── skill/ Skill distillation
└── compress/ Token-saving compressionThe Default That Makes or Breaks It
Codegraph's value isn't in reading code through an extra protocol layer — it's in not reading code at all until you know exactly which symbol you need.
What I Learned
Compact mode is everything
The same tool used two different ways produces a 2x cost difference. API design matters more than the underlying engine.
Per-query savings compound
On isolated queries, fixed overhead buries the savings. On multi-step tasks with 18+ queries, the per-query advantage dominates.
Structured self-reflection works
The "When X, do Y because Z" format produces genuinely reusable knowledge. Unstructured reflections don't.
Honest benchmarking is hard
The first benchmark design made codegraph look mediocre. The multi-step test revealed the real value.
This was a self-use project — built to make my own AI coding sessions more efficient on large codebases. All benchmark data is from real measured runs on production code.