Open Source · Self-Use Tool

Codegraph MCP

A persistent code intelligence layer for AI coding agents — built in Rust, tested on 111K lines of production code.

Gives AI agents structured code navigation and cross-session memory so they stop reading entire files and start querying a graph. 23% cheaper, 32% faster than vanilla grep/read on multi-step tasks.

RustMCP Protocoltree-sitterSQLitepetgraphSelf-Use Tool
View on GitHub

The Problem

Read entire files

To answer "who calls this function?", agents read thousands of tokens of irrelevant code.

Lose all context

When the context window compacts, task progress, decisions, and working files vanish.

Repeat mistakes

No memory of what worked or failed across sessions — every conversation starts from zero.

On a 111K-line codebase, a single “who calls X?” query costs ~42,000 tokens via grep+read. Multiply by 15-25 queries per session and you're burning hundreds of thousands of tokens just on orientation, not coding.

The Solution

Codegraph is an MCP server that runs over stdio and provides 26 tools across three systems:

Code Graph

tree-sitter parses 5 languages into a directed graph of symbols + relationships.

"Who calls X?" — 151 tokens instead of 42,478

Session Memory

Tracks task, subtasks, decisions, and working context across compaction.

"Where was I?" — 95 tokens instead of re-reading 5-10 files

Learning System

Records patterns, failures, and solution lineage that compound over sessions.

"What worked last time?" with confidence scoring & time decay

Key Results

281x
fewer tokens
on caller lookups (151 vs 42,478)
23%
cheaper
than vanilla grep/read
32%
faster
task completion (123s vs 180s)
3/3
accuracy
vs 2.5/3 for vanilla

The Benchmark

111K-line Python codebase, 5 tool configurations, isolated tasks + multi-step investigation.

Multi-Step Refactoring (10 steps)

#1Codegraph (compact)-23%
36,790 tokens
#2Serena + Codegraph-5%
45,424 tokens
#3Vanilla (grep/read)baseline
47,693 tokens
#4Serena-only (LSP)+60%
76,228 tokens
#5Codegraph (bad usage)+64%
78,051 tokens

Same accuracy across all five configs. Zero difference in correctness. Pure cost and speed difference.

Per-Query Savings

Who uses this class?
42,478151
99.6%
What does this call?
9,431366
96%
What's in this file?
14,5031,382
90%
Resume after compaction
~20,00095
99.5%

Key insight: Codegraph's cost scales with answer size, not file size. The bigger the codebase, the wider the gap.

How It Works

Source Code
Rust, Python, TS, JS, Go
tree-sitter parsing
Symbol + relationship extraction
Directed Graph
petgraph + SQLite persistence
MCP Tools over stdio
JSON-RPC 2.0
AI Agent queries graph
Instead of reading files

Code Graph

  • tree-sitter extracts symbols & relationships from 5 languages
  • Directed graph in petgraph with SQLite persistence
  • Incremental indexing — only re-parses changed files
  • Cross-file resolution: 47% resolution rate

Session Memory

  • Tracks task, subtasks, decisions, working context
  • smart_context restores state in ~95 tokens
  • Decisions persist with reasoning & symbol links

Learning System

  • Records patterns & failures with file/tag scoping
  • Solution lineage tracks attempt chains
  • Confidence scoring with 90-day time decay
  • suggest_approach synthesizes into recommendations

Gets Smarter Over Time

The learning system compounds knowledge across sessions.

S1–3

Generic but correct approaches

S4–8

Increasingly specific — mentions exact class names, file paths

S9

"Leverage Session 1 + Session 2 + Session 4. Search for redis.RedisError catches."

S10

"Combine Session 3 + Session 5. Verify with forgetting module."

By session 9, the system wasn't just recalling individual patterns — it was synthesizing across multiple earlier sessions to suggest composite strategies.

Architecture

src/
src/
├── mcp/       Protocol layer (JSON-RPC 2.0, 26 tools)
├── code/      Code analysis (tree-sitter, indexer)
├── store/     Persistence (SQLite + petgraph)
├── session/   Session state machine
├── learning/  Patterns, failures, lineage
├── skill/     Skill distillation
└── compress/  Token-saving compression
Rust (async/tokio)tree-sitterpetgraphlibSQL/SQLiteMCP over stdioxxh3 hashing87 testscriterion benchmarks

The Default That Makes or Breaks It

compact=true (default)
36,790
Overviews first, targeted source when needed
-23% vs vanilla
include_source=true
78,051
Dumps everything through the protocol
+64% vs vanilla

Codegraph's value isn't in reading code through an extra protocol layer — it's in not reading code at all until you know exactly which symbol you need.

What I Learned

Compact mode is everything

The same tool used two different ways produces a 2x cost difference. API design matters more than the underlying engine.

Per-query savings compound

On isolated queries, fixed overhead buries the savings. On multi-step tasks with 18+ queries, the per-query advantage dominates.

Structured self-reflection works

The "When X, do Y because Z" format produces genuinely reusable knowledge. Unstructured reflections don't.

Honest benchmarking is hard

The first benchmark design made codegraph look mediocre. The multi-step test revealed the real value.

This was a self-use project — built to make my own AI coding sessions more efficient on large codebases. All benchmark data is from real measured runs on production code.

GitHub·BENCHMARK.md·Built with Rust