Skip to content

ADR-006: Incremental Indexing with Content Hashing

Status: Implemented
Date: 2026-02-16
Decision Makers: Brian Moore, AI Team

Context

Re-indexing an entire codebase is expensive: 1. Embedding API costs: Voyage AI charges $0.06/M tokens 2. Time: Large codebases take minutes to index 3. Unnecessary work: Most files don't change between commits

We needed a system to detect which files/chunks have changed and only re-embed those.

Decision

Implement content hashing with embedding cache at both file and chunk levels.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Incremental Indexing Pipeline                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. FILE LEVEL CHECK                                            │
│     ┌──────────┐    SHA256     ┌───────────────┐                │
│     │ File     │ ──────────▶  │ Content Hash  │                │
│     │ Content  │              │ (32 hex chars)│                │
│     └──────────┘              └───────┬───────┘                │
│                                       │                         │
│                                       ▼                         │
│     ┌─────────────────────────────────────────────────────┐    │
│     │ MetadataStore.GetFileHash(path)                     │    │
│     │                                                      │    │
│     │ If hash matches → SKIP FILE (no changes)            │    │
│     │ If hash differs → Proceed to chunk-level check      │    │
│     └─────────────────────────────────────────────────────┘    │
│                                                                 │
│  2. CHUNK LEVEL CHECK (for changed files)                       │
│     ┌──────────┐    Chunker   ┌───────────────┐                │
│     │ File     │ ──────────▶ │ Chunks with   │                │
│     │ Content  │              │ Content Hash  │                │
│     └──────────┘              └───────┬───────┘                │
│                                       │                         │
│                                       ▼                         │
│     ┌─────────────────────────────────────────────────────┐    │
│     │ For each chunk:                                      │    │
│     │   1. Check MetadataStore for existing embedding info │    │
│     │   2. If content_hash matches & has embedding_hash:   │    │
│     │      → Fetch embedding from VectorStore (cache hit)  │    │
│     │   3. If no match:                                    │    │
│     │      → Add to "needs embedding" list (cache miss)    │    │
│     └─────────────────────────────────────────────────────┘    │
│                                                                 │
│  3. SELECTIVE EMBEDDING                                         │
│     ┌───────────────────┐  Voyage AI  ┌──────────────────┐     │
│     │ Changed Chunks    │ ──────────▶ │ New Embeddings   │     │
│     │ (cache misses)    │   API Call  │ (1024 dims)      │     │
│     └───────────────────┘             └──────────────────┘     │
│                                                                 │
│  4. MERGE & STORE                                               │
│     ┌─────────────────────────────────────────────────────┐    │
│     │ Combine cached embeddings + new embeddings          │    │
│     │ Upsert to VectorStore (Qdrant)                      │    │
│     │ Update MetadataStore with new hashes                │    │
│     └─────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Implementation

File: cortex-context/internal/indexer/pipeline.go

Key Data Structures

// ChunkEmbeddingInfo for cache lookup
type ChunkEmbeddingInfo struct {
    ContentHash   string  // SHA256 of chunk content
    EmbeddingHash string  // SHA256 of embedding vector (first 16 bytes)
}

// FileRecord in MetadataStore
type FileRecord struct {
    Path        string
    ContentHash string    // SHA256 of entire file
    LastIndexed time.Time
    ModifiedAt  time.Time
}

Hash Functions

// File content hash (full SHA256)
func computeHash(content []byte) string {
    hash := sha256.Sum256(content)
    return hex.EncodeToString(hash[:])
}

// Embedding hash (truncated SHA256 for storage efficiency)
func computeEmbeddingHash(embedding []float32) string {
    data := make([]byte, len(embedding)*4)
    for i, f := range embedding {
        bits := math.Float32bits(f)
        binary.LittleEndian.PutUint32(data[i*4:], bits)
    }
    hash := sha256.Sum256(data)
    return hex.EncodeToString(hash[:16])  // 32 hex chars
}

Cost Savings Example

Scenario Files Changed Chunks API Calls Savings
Full re-index 1000 1000 10000 10000 0%
1 file changed 1000 1 10000 10 99.9%
10 files changed 1000 10 10000 100 99%

Consequences

Positive

  • 99%+ cost reduction for incremental updates
  • Faster indexing (skip unchanged files entirely)
  • Idempotent: Re-running produces same result

Negative

  • Requires MetadataStore (PostgreSQL or file-based)
  • Additional complexity in pipeline
  • Cache invalidation on embedding model change

Future: Merkle Tree (Not Yet Implemented)

For very large repos, a Merkle tree could provide: - O(log n) change detection for directories - Efficient diff between index versions - Proof of index integrity

Current content hashing is sufficient for repos <100k files.

  • cortex-context/internal/indexer/pipeline.go - Implementation
  • cortex-context/internal/store/metadata.go - MetadataStore interface