Skip to content

ADR-001: Core Architecture Choices

Date: 2026-02-09 Status: Accepted Deciders: Jeff Mosley, Brian Moore

Context

We're building Cortex, a self-hosted AI coding assistant that aims to replicate 80% of Augment Code/Cursor functionality. We need to make foundational decisions about:

  1. LLM inference server
  2. Vector database
  3. Programming language
  4. AST parsing library
  5. Embedding and code models

These choices must align with our infrastructure constraints: - NO Docker (LXC containers only) - Must integrate with existing Zitadel, Vault, LGTM stack - HA from day one - Self-hosted, no external API calls

Decision

LLM Server: Ollama

Choice: Ollama over vLLM, llama.cpp, or text-generation-inference

Rationale: - Native binary, no Docker required - Simple REST API compatible with OpenAI format - Easy model management (ollama pull, ollama list) - Supports both embedding and chat models - Active development, good community

Vector Database: Qdrant

Choice: Qdrant over Milvus, Weaviate, pgvector, or Chroma

Rationale: - Native binary available (no Docker) - Built-in Raft clustering for HA - Excellent performance for our scale (< 1M vectors) - Rich filtering and payload support - gRPC and REST APIs

Programming Language: Go

Choice: Go over Rust, Python, or TypeScript

Rationale: - Matches existing codebase (portal, graphql-gateway, infra-clients) - Fast compilation, single binary deployment - Excellent HTTP server libraries (Chi) - Good concurrency primitives - Team familiarity

AST Parser: Tree-sitter

Choice: Tree-sitter over go/ast, babel, or regex

Rationale: - Language-agnostic (Go, Python, TypeScript, JavaScript, etc.) - Incremental parsing (fast re-parse on edits) - Mature Go bindings (go-tree-sitter) - Used by GitHub, Neovim, Zed - Produces concrete syntax trees with position info

Embedding Model: nomic-embed-text

Choice: nomic-embed-text over text-embedding-ada-002, e5, or bge

Rationale: - Runs locally via Ollama (no API calls) - 768 dimensions (good balance of quality vs. storage) - Trained on code and text - Fast inference on CPU

Code Model: deepseek-coder:6.7b

Choice: deepseek-coder:6.7b over codellama, starcoder, or wizardcoder

Rationale: - Best benchmark scores at 6-7B parameter range - Good instruction following - Supports fill-in-middle (FIM) format - Runs on our hardware (32GB RAM per Ollama node)

Consequences

Positive

  • All components run as native binaries in LXC containers
  • No external API dependencies (fully self-hosted)
  • HA achievable with standard patterns (HAProxy, Raft)
  • Consistent with existing infrastructure patterns
  • Team can maintain and extend the codebase

Negative

  • Ollama is less optimized than vLLM for high-throughput inference
  • Qdrant requires more memory than pgvector
  • Go tree-sitter bindings are less mature than Rust
  • deepseek-coder:6.7b is slower than smaller models

Risks

  • Ollama may not scale if we need >10 concurrent users
  • Mitigation: Can add more Ollama nodes behind HAProxy
  • Tree-sitter grammar updates may break parsing
  • Mitigation: Pin grammar versions, test before updating

Alternatives Considered

vLLM

  • Pros: Better throughput, PagedAttention
  • Cons: Requires Docker, more complex setup
  • Rejected: Docker constraint

pgvector

  • Pros: Uses existing PostgreSQL, simpler ops
  • Cons: Slower at scale, no native clustering
  • Rejected: Performance concerns at 100k+ vectors

Rust

  • Pros: Better performance, memory safety
  • Cons: Team unfamiliar, longer development time
  • Rejected: Velocity more important than peak performance

CodeLlama

  • Pros: Meta-backed, well-known
  • Cons: Worse benchmarks than deepseek at same size
  • Rejected: Quality difference noticeable in testing