Open Source · Self-Use Tool

Codegraph MCP

A persistent code intelligence layer for AI coding agents — built in Rust, tested on 111K lines of production code.

Gives AI agents structured code navigation and cross-session memory so they stop reading entire files and start querying a graph. 23% cheaper, 32% faster than vanilla grep/read on multi-step tasks.

RustMCP Protocoltree-sitterSQLitepetgraphSelf-Use Tool

View on GitHub

The Problem

Read entire files

To answer "who calls this function?", agents read thousands of tokens of irrelevant code.

Lose all context

When the context window compacts, task progress, decisions, and working files vanish.

Repeat mistakes

No memory of what worked or failed across sessions — every conversation starts from zero.

On a 111K-line codebase, a single “who calls X?” query costs ~42,000 tokens via grep+read. Multiply by 15-25 queries per session and you're burning hundreds of thousands of tokens just on orientation, not coding.

The Solution

Codegraph is an MCP server that runs over stdio and provides 26 tools across three systems:

Code Graph

tree-sitter parses 5 languages into a directed graph of symbols + relationships.

"Who calls X?" — 151 tokens instead of 42,478

Session Memory

Tracks task, subtasks, decisions, and working context across compaction.

"Where was I?" — 95 tokens instead of re-reading 5-10 files

Learning System

Records patterns, failures, and solution lineage that compound over sessions.

"What worked last time?" with confidence scoring & time decay

Key Results

281x

fewer tokens

on caller lookups (151 vs 42,478)

23%

cheaper

than vanilla grep/read

32%

faster

task completion (123s vs 180s)

3/3

accuracy

vs 2.5/3 for vanilla

The Benchmark

111K-line Python codebase, 5 tool configurations, isolated tasks + multi-step investigation.

Multi-Step Refactoring (10 steps)

#1Codegraph (compact)

36,790

-23%

#2Serena + Codegraph

45,424

-5%

#3Vanilla (grep/read)

47,693

baseline

#4Serena-only (LSP)

76,228

+60%

#5Codegraph (bad usage)

78,051

+64%

#1Codegraph (compact)-23%

36,790 tokens

#2Serena + Codegraph-5%

45,424 tokens

#3Vanilla (grep/read)baseline

47,693 tokens

#4Serena-only (LSP)+60%

76,228 tokens

#5Codegraph (bad usage)+64%

78,051 tokens

Same accuracy across all five configs. Zero difference in correctness. Pure cost and speed difference.

Per-Query Savings

Query

Without

With

Savings

Who uses this class?

42,478

151

99.6%

What does this call?

9,431

366

96%

What's in this file?

14,503

1,382

90%

Resume after compaction

~20,000

99.5%

Who uses this class?

42,478151

99.6%

What does this call?

9,431366

96%

What's in this file?

14,5031,382

90%

Resume after compaction

~20,00095

99.5%

Key insight: Codegraph's cost scales with answer size, not file size. The bigger the codebase, the wider the gap.

How It Works

Source Code

Rust, Python, TS, JS, Go

tree-sitter parsing

Symbol + relationship extraction

Directed Graph

petgraph + SQLite persistence

MCP Tools over stdio

JSON-RPC 2.0

AI Agent queries graph

Instead of reading files

Code Graph

tree-sitter extracts symbols & relationships from 5 languages
Directed graph in petgraph with SQLite persistence
Incremental indexing — only re-parses changed files
Cross-file resolution: 47% resolution rate

Session Memory

Tracks task, subtasks, decisions, working context
smart_context restores state in ~95 tokens
Decisions persist with reasoning & symbol links

Learning System

Records patterns & failures with file/tag scoping
Solution lineage tracks attempt chains
Confidence scoring with 90-day time decay
suggest_approach synthesizes into recommendations

Gets Smarter Over Time

The learning system compounds knowledge across sessions.

S1–3

Generic but correct approaches

S4–8

Increasingly specific — mentions exact class names, file paths

"Leverage Session 1 + Session 2 + Session 4. Search for redis.RedisError catches."

S10

"Combine Session 3 + Session 5. Verify with forgetting module."

By session 9, the system wasn't just recalling individual patterns — it was synthesizing across multiple earlier sessions to suggest composite strategies.

Architecture

src/

src/
├── mcp/       Protocol layer (JSON-RPC 2.0, 26 tools)
├── code/      Code analysis (tree-sitter, indexer)
├── store/     Persistence (SQLite + petgraph)
├── session/   Session state machine
├── learning/  Patterns, failures, lineage
├── skill/     Skill distillation
└── compress/  Token-saving compression

Rust (async/tokio)tree-sitterpetgraphlibSQL/SQLiteMCP over stdioxxh3 hashing87 testscriterion benchmarks

The Default That Makes or Breaks It

compact=true (default)

36,790

Overviews first, targeted source when needed

-23% vs vanilla

include_source=true

78,051

Dumps everything through the protocol

+64% vs vanilla

Codegraph's value isn't in reading code through an extra protocol layer — it's in not reading code at all until you know exactly which symbol you need.

What I Learned

Compact mode is everything

The same tool used two different ways produces a 2x cost difference. API design matters more than the underlying engine.

Per-query savings compound

On isolated queries, fixed overhead buries the savings. On multi-step tasks with 18+ queries, the per-query advantage dominates.

Structured self-reflection works

The "When X, do Y because Z" format produces genuinely reusable knowledge. Unstructured reflections don't.

Honest benchmarking is hard

The first benchmark design made codegraph look mediocre. The multi-step test revealed the real value.

This was a self-use project — built to make my own AI coding sessions more efficient on large codebases. All benchmark data is from real measured runs on production code.

GitHub·BENCHMARK.md·Built with Rust