ADR-001: Core Architecture Choices¶
Date: 2026-02-09 Status: Accepted Deciders: Jeff Mosley, Brian Moore
Context¶
We're building Cortex, a self-hosted AI coding assistant that aims to replicate 80% of Augment Code/Cursor functionality. We need to make foundational decisions about:
- LLM inference server
- Vector database
- Programming language
- AST parsing library
- Embedding and code models
These choices must align with our infrastructure constraints: - NO Docker (LXC containers only) - Must integrate with existing Zitadel, Vault, LGTM stack - HA from day one - Self-hosted, no external API calls
Decision¶
LLM Server: Ollama¶
Choice: Ollama over vLLM, llama.cpp, or text-generation-inference
Rationale:
- Native binary, no Docker required
- Simple REST API compatible with OpenAI format
- Easy model management (ollama pull, ollama list)
- Supports both embedding and chat models
- Active development, good community
Vector Database: Qdrant¶
Choice: Qdrant over Milvus, Weaviate, pgvector, or Chroma
Rationale: - Native binary available (no Docker) - Built-in Raft clustering for HA - Excellent performance for our scale (< 1M vectors) - Rich filtering and payload support - gRPC and REST APIs
Programming Language: Go¶
Choice: Go over Rust, Python, or TypeScript
Rationale: - Matches existing codebase (portal, graphql-gateway, infra-clients) - Fast compilation, single binary deployment - Excellent HTTP server libraries (Chi) - Good concurrency primitives - Team familiarity
AST Parser: Tree-sitter¶
Choice: Tree-sitter over go/ast, babel, or regex
Rationale: - Language-agnostic (Go, Python, TypeScript, JavaScript, etc.) - Incremental parsing (fast re-parse on edits) - Mature Go bindings (go-tree-sitter) - Used by GitHub, Neovim, Zed - Produces concrete syntax trees with position info
Embedding Model: nomic-embed-text¶
Choice: nomic-embed-text over text-embedding-ada-002, e5, or bge
Rationale: - Runs locally via Ollama (no API calls) - 768 dimensions (good balance of quality vs. storage) - Trained on code and text - Fast inference on CPU
Code Model: deepseek-coder:6.7b¶
Choice: deepseek-coder:6.7b over codellama, starcoder, or wizardcoder
Rationale: - Best benchmark scores at 6-7B parameter range - Good instruction following - Supports fill-in-middle (FIM) format - Runs on our hardware (32GB RAM per Ollama node)
Consequences¶
Positive¶
- All components run as native binaries in LXC containers
- No external API dependencies (fully self-hosted)
- HA achievable with standard patterns (HAProxy, Raft)
- Consistent with existing infrastructure patterns
- Team can maintain and extend the codebase
Negative¶
- Ollama is less optimized than vLLM for high-throughput inference
- Qdrant requires more memory than pgvector
- Go tree-sitter bindings are less mature than Rust
- deepseek-coder:6.7b is slower than smaller models
Risks¶
- Ollama may not scale if we need >10 concurrent users
- Mitigation: Can add more Ollama nodes behind HAProxy
- Tree-sitter grammar updates may break parsing
- Mitigation: Pin grammar versions, test before updating
Alternatives Considered¶
vLLM¶
- Pros: Better throughput, PagedAttention
- Cons: Requires Docker, more complex setup
- Rejected: Docker constraint
pgvector¶
- Pros: Uses existing PostgreSQL, simpler ops
- Cons: Slower at scale, no native clustering
- Rejected: Performance concerns at 100k+ vectors
Rust¶
- Pros: Better performance, memory safety
- Cons: Team unfamiliar, longer development time
- Rejected: Velocity more important than peak performance
CodeLlama¶
- Pros: Meta-backed, well-known
- Cons: Worse benchmarks than deepseek at same size
- Rejected: Quality difference noticeable in testing