ADR-006: Incremental Indexing with Content Hashing¶
Status: Implemented
Date: 2026-02-16
Decision Makers: Brian Moore, AI Team
Context¶
Re-indexing an entire codebase is expensive: 1. Embedding API costs: Voyage AI charges $0.06/M tokens 2. Time: Large codebases take minutes to index 3. Unnecessary work: Most files don't change between commits
We needed a system to detect which files/chunks have changed and only re-embed those.
Decision¶
Implement content hashing with embedding cache at both file and chunk levels.
Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Incremental Indexing Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. FILE LEVEL CHECK │
│ ┌──────────┐ SHA256 ┌───────────────┐ │
│ │ File │ ──────────▶ │ Content Hash │ │
│ │ Content │ │ (32 hex chars)│ │
│ └──────────┘ └───────┬───────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ MetadataStore.GetFileHash(path) │ │
│ │ │ │
│ │ If hash matches → SKIP FILE (no changes) │ │
│ │ If hash differs → Proceed to chunk-level check │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ 2. CHUNK LEVEL CHECK (for changed files) │
│ ┌──────────┐ Chunker ┌───────────────┐ │
│ │ File │ ──────────▶ │ Chunks with │ │
│ │ Content │ │ Content Hash │ │
│ └──────────┘ └───────┬───────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ For each chunk: │ │
│ │ 1. Check MetadataStore for existing embedding info │ │
│ │ 2. If content_hash matches & has embedding_hash: │ │
│ │ → Fetch embedding from VectorStore (cache hit) │ │
│ │ 3. If no match: │ │
│ │ → Add to "needs embedding" list (cache miss) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ 3. SELECTIVE EMBEDDING │
│ ┌───────────────────┐ Voyage AI ┌──────────────────┐ │
│ │ Changed Chunks │ ──────────▶ │ New Embeddings │ │
│ │ (cache misses) │ API Call │ (1024 dims) │ │
│ └───────────────────┘ └──────────────────┘ │
│ │
│ 4. MERGE & STORE │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Combine cached embeddings + new embeddings │ │
│ │ Upsert to VectorStore (Qdrant) │ │
│ │ Update MetadataStore with new hashes │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Implementation¶
File: cortex-context/internal/indexer/pipeline.go
Key Data Structures¶
// ChunkEmbeddingInfo for cache lookup
type ChunkEmbeddingInfo struct {
ContentHash string // SHA256 of chunk content
EmbeddingHash string // SHA256 of embedding vector (first 16 bytes)
}
// FileRecord in MetadataStore
type FileRecord struct {
Path string
ContentHash string // SHA256 of entire file
LastIndexed time.Time
ModifiedAt time.Time
}
Hash Functions¶
// File content hash (full SHA256)
func computeHash(content []byte) string {
hash := sha256.Sum256(content)
return hex.EncodeToString(hash[:])
}
// Embedding hash (truncated SHA256 for storage efficiency)
func computeEmbeddingHash(embedding []float32) string {
data := make([]byte, len(embedding)*4)
for i, f := range embedding {
bits := math.Float32bits(f)
binary.LittleEndian.PutUint32(data[i*4:], bits)
}
hash := sha256.Sum256(data)
return hex.EncodeToString(hash[:16]) // 32 hex chars
}
Cost Savings Example¶
| Scenario | Files | Changed | Chunks | API Calls | Savings |
|---|---|---|---|---|---|
| Full re-index | 1000 | 1000 | 10000 | 10000 | 0% |
| 1 file changed | 1000 | 1 | 10000 | 10 | 99.9% |
| 10 files changed | 1000 | 10 | 10000 | 100 | 99% |
Consequences¶
Positive¶
- 99%+ cost reduction for incremental updates
- Faster indexing (skip unchanged files entirely)
- Idempotent: Re-running produces same result
Negative¶
- Requires MetadataStore (PostgreSQL or file-based)
- Additional complexity in pipeline
- Cache invalidation on embedding model change
Future: Merkle Tree (Not Yet Implemented)¶
For very large repos, a Merkle tree could provide: - O(log n) change detection for directories - Efficient diff between index versions - Proof of index integrity
Current content hashing is sufficient for repos <100k files.
Related¶
cortex-context/internal/indexer/pipeline.go- Implementationcortex-context/internal/store/metadata.go- MetadataStore interface