18 — Architecture Recommendations
What Was Researched
Synthesized architecture recommendations for building a modern, model-agnostic agent harness, based on findings from all 17 preceding research directories and 10 local codebase studies.
Which Sources Were Used
All previous research directories (01–15, 17) and all 10 local codebases.
Recommended Architecture: The Composite Approach
This is not a merge of reference codebases. Hermes, Codex, Pi, LangGraph, OpenClaw, LiteLLM, assistant-ui, LibreChat, and others were studied to extract patterns. The recommendation is to build one new harness — pick the best idea per layer and connect them via standard interfaces (OpenAI-compat API, MCP, SSE,
SKILL.md). Do not fork and glue every project into a single repo.
Core Principle: Narrow Waist, Rich Edges
Adopt Hermes's "narrow core, capability at edges" principle as the foundational design constraint.
┌──────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ Coding Agent │ Personal Assistant │ Research Agent │ Custom │
├──────────────────────────────────────────────────────────────┤
│ GATEWAY / ORCHESTRATION │
│ Multi-channel routing │ Session management │ Cron scheduler │
├──────────────────────────────────────────────────────────────┤
│ AGENT RUNTIME │
│ Agent loop │ Tool dispatch │ Memory │ Skills │ Subagents │
├──────────────────────────────────────────────────────────────┤
│ AI SDK (NARROW WAIST) │
│ Provider translation │ Streaming │ Tool calling │ Caching │
├──────────────────────────────────────────────────────────────┤
│ MODEL PROVIDERS │
│ OpenAI │ Anthropic │ Google │ xAI │ Ollama │ OpenRouter │ … │
└──────────────────────────────────────────────────────────────┘
Layer 1: AI SDK (Narrow Waist)
Recommendation: Standardize on an OpenAI-compatible client (base_url + api_key). LiteLLM is optional — not required.
| Backend | Typical base_url |
Good for |
|---|---|---|
| Ollama (local) | http://localhost:11434/v1 |
Dev, privacy, no cloud keys |
| OpenRouter (hosted) | https://openrouter.ai/api/v1 |
One key, many models, fast setup |
| LiteLLM (self-hosted proxy) | Your proxy /v1 |
Auth, budgets, 100+ providers at scale |
| Direct provider | api.openai.com, Anthropic, etc. |
Single vendor, simplest path |
| Decision | Recommendation | Rationale |
|---|---|---|
| Wire format | OpenAI Chat Completions / Responses | Works with Ollama, OpenRouter, LiteLLM, and most SDKs unchanged |
| Provider translation | Only when you need it | OpenRouter and Ollama already speak OpenAI-compat; LiteLLM adds value for self-hosted multi-tenant routing |
| Model specification | provider/model or plain model id |
Depends on backend (OpenRouter: anthropic/claude-3.5-sonnet; Ollama: llama3.2) |
| Streaming | SSE (Server-Sent Events) | Standard across all backends above |
| Type system | TypeScript with Zod | Pi/OpenRouter SDK pattern for type safety |
Layer 2: Agent Runtime
Recommendation: While-loop agent with LangGraph-inspired state management
| Decision | Recommendation | Rationale |
|---|---|---|
| Agent loop | While-loop with budget tracking | Simpler than graph, proven (Hermes/Pi/Codex) |
| Iteration cap | Default 90, configurable | Hermes's default, prevents runaway costs |
| Grace call | One extra turn after budget | Hermes pattern, improves completion quality |
| State persistence | SQLite + FTS5 checkpoints | Zero-dependency, fast, proven (Hermes) |
| Memory | MemoryProvider ABC | Hermes pattern, pluggable backends |
| Skills | SKILL.md format | Emerging standard |
| Subagents | Context-isolated, budget-shared | Hermes delegation pattern |
| Tool registry | Auto-discovery with register() |
Hermes pattern |
| Tool extensibility | MCP first, plugins second | Codex + Hermes principle |
| Token Calibration | Cumulative billing ratio scaling | Corrects local tokenizer discrepancies against provider bills CLAIM-183 |
| Overhead Calibration | Dynamic tool schema ceiling (15% variance) | Corrects estimated schema overhead using real provider feedback CLAIM-184 |
| Multi-Agent Handoffs | LangGraph Command-based transfers | Outgoing handoffs return Command parenting with incoming receiver context filtering CLAIM-185, CLAIM-186 |
| Observation Masking | Character-limited ToolMessage previews (~300 chars) | Mask consumed tool results above 80% context pressure to keep system cache hits high CLAIM-187 |
| Summary Infiltration | HumanMessage injection on clean state | Mid-run summaries compete for message budget rather than inflating system instructions CLAIM-188 |
Layer 3: Gateway / Orchestration
Recommendation: OpenClaw-inspired gateway with Hermes features
| Decision | Recommendation | Rationale |
|---|---|---|
| Architecture | Gateway-centric (OpenClaw) | Scalable, multi-channel |
| Language | TypeScript (Node.js) | Full-stack, shared with frontend |
| Session management | Per-channel isolated sessions | OpenClaw pattern |
| Multi-agent | Route channels to agents | OpenClaw pattern |
| Cron scheduling | Natural language + cron syntax | Hermes pattern |
| Platform delivery | Webhook-based routing | Both Hermes and OpenClaw |
Layer 4: Application Layer
Recommendation: Modular applications built on the runtime
| Application | Stack | Reference |
|---|---|---|
| CLI | prompt_toolkit or Ink | Hermes/Pi |
| TUI | Ink (React in terminal) | Hermes |
| Desktop | Electron + assistant-ui | Hermes |
| Web dashboard | React + Vite + assistant-ui | OpenClaw |
| Mobile | React Native / WebSocket node | OpenClaw |
Technology Choices
Primary Stack
| Component | Technology | Why |
|---|---|---|
| Agent core | Python 3.11+ | LLM SDK ecosystem dominance |
| Gateway | TypeScript (Node 24+) | Full-stack, shared with frontend |
| Frontend | React + assistant-ui | Reference component library |
| Package manager (Python) | uv | Modern, fast, reproducible |
| Package manager (TS) | pnpm workspace | Efficient monorepo management |
| Local database | SQLite + FTS5 | Zero-dependency, fast |
| Scale database | PostgreSQL + Redis | When multi-user is needed |
| TTS | ElevenLabs (pluggable) | Dominant open-source choice |
| Observability | OpenTelemetry | Standard, LangSmith-compatible |
Model Routing Strategy (5-Tier)
From 13_model_agnostic_harness_architecture/model_landscape_june_2026.md:
| Tier | Use Case | Recommended Default |
|---|---|---|
| Frontier reasoning | Complex analysis, planning | Claude Fable 5 or GPT-5.5 |
| Fast frontier | Standard agent tasks | Kimi K2.7 Code or Nemotron 3 Ultra |
| Flash | Real-time, high-volume | Gemini 3.5 Flash or GPT-5.4 Mini |
| Nano | Embeddings, classification | GPT-5.4 Nano |
| Voice | Audio I/O | GPT Audio or Grok Voice TTS |
Security Model
| Aspect | Recommendation | Reference |
|---|---|---|
| Sandboxing | Docker (default) + OS-native (option) | Codex |
| Dependency pinning | >=floor,<ceiling |
Hermes |
| Supply-chain | Exact versions + shrinkwrap | Pi |
| Tool approval | User approval for destructive ops | Hermes |
| Skill validation | AST-level auditing | Hermes |
| Context files | Read-only from agent | All |
File Convention Standards
| File | Purpose | Standard |
|---|---|---|
AGENTS.md |
Project instructions | De facto universal |
SKILL.md |
Procedural knowledge | agentskills.io |
SOUL.md |
Agent persona | OpenClaw |
config.yaml |
User configuration | Hermes |
.env |
Secrets only | Hermes |
Critical Design Constraints
- Prompt Caching is Sacred (Design for Byte-Stability)
- What/Why: Modern frontier models (Anthropic, DeepSeek, OpenAI) charge up to 90% less for cached input tokens. Prompt caching works by storing prefix spans in memory. Any modification to a prefix invalidates the entire cache downstream.
- When to Use: Essential for multi-turn cognitive loops where the system instruction and tool definitions remain static, but message histories accumulate.
- How to Design:
- Order Matters: Place volatile variables (such as the current date/time, user query, and dynamic database context) at the absolute tail of the message array. Keep the system prompt, guidelines, and tool schemas at the head.
- Byte-Stability: Clean whitespaces, sort tools alphabetically by name, and standardize date strings to UTC-day-only formats.
- Tailored Slicing: Ensure the prefix hits model-defined cache boundaries (e.g., Anthropic's 1024-token minimum or 2048-token increments).
- Minimize Regex & Deterministic Logic (LLM-First and AST Parsing)
- Constraint: Restrict the use of regular expressions to trivial boundary validation (e.g., prefix checks like
text.startsWith('/')). Never use regex to parse tool call payloads, JSON configurations, or nested structures. - Rationale: Regex parsing is brittle to formatting variations, markdown wrapping, escaping, and presents a risk of catastrophic backtracking (ReDoS).
- Alternatives:
- Programmatic Parsers: Use standard JSON/JSON5 parsers or AST builders to parse code blocks or JSON payloads.
- LLM-First Parsing: Use model-native structured output schemas (JSON schema validation) or delegate parsing to a fast, dedicated LLM helper.
- Constraint: Restrict the use of regular expressions to trivial boundary validation (e.g., prefix checks like
- Tool footprint matters — Every core tool costs tokens on every API call.
- Context discipline — Implement both caps and compression.
- Budget tracking from day one — Token + cost + iteration budgets.
- Message role alternation — Never two same-role messages in a row.
- Skills as user messages — Don't inject into system prompt.
Tracing & Observability Architecture
To debug agent loops, analyze performance bottlenecks, and monitor production runs, the harness must support unified tracing.
1. Tracing Standard: OpenTelemetry (OTel)
- Recommendation: Use OpenTelemetry Semantic Conventions for GenAI to decouple tracing from specific backend vendors.
- Implementation: Instrument LLM clients and tool dispatch functions with OTel Spans. Trace attributes should follow standard keys:
gen_ai.system(e.g.,openai,anthropic)gen_ai.request.modelandgen_ai.response.modelgen_ai.usage.input_tokensandgen_ai.usage.output_tokensgen_ai.request.temperature
2. Tracing Tool Integrations
- LangSmith (Commercial/Enterprise):
- Use Case: Deep run nesting visualizer, run feedback loops (attaching feedback scores to runs), and exporting failing agent runs directly into test datasets.
- Integration: Seamlessly enabled via environment flags (
LANGCHAIN_TRACING_V2=true) and standard callback handlers.
- Langfuse (Open-Source/Self-Hostable):
- Use Case: Zero-dependency self-hosted tracing. Offers visual traces, cost estimation, prompt version management, and SDK middleware.
- Integration: Connect via
@langfuse/typescriptorlangfusePython SDK using decorators or client wrappers.
- Arize Phoenix (Open-Source/Local-First):
- Use Case: Local Jupyter notebook evaluations and zero-config local tracer servers. Excellent for checking prompt drift and conducting cosine-similarity evals on agent retrievals.
- Integration: Run
phoenix.server.start()locally and export OTel traces directly to the local collector endpoint (http://localhost:6006/v1/traces).
3. Decoupled Middleware Pattern (Callbacks)
- Do not hardcode vendor-specific tracing code directly inside the agent loop. Instead, implement a Lifecycle Callback Registry in the runtime:
This callback interface broadcasts events to active tracers (LangSmith, Langfuse, OTel) without leaking tracking libraries into cognitive reasoning files.interface AgentLifecycleCallbacks { onLlmStart?: (runId: string, prompt: Message[]) => void; onLlmEnd?: (runId: string, response: LlmResponse) => void; onToolStart?: (runId: string, toolName: string, args: any) => void; onToolEnd?: (runId: string, toolName: string, result: any) => void; onException?: (runId: string, error: Error) => void; }
Multi-Model Deliberation & Fusion Recommendations
When to Use Multi-Model Deliberation
| Scenario | Recommended Pattern | Reasoning |
|---|---|---|
| Research reports requiring citations | Panel + Judge (Fusion) | Diverse perspectives improve factual accuracy and coverage |
| Code review / security audit | Council / Debate | Peer critique catches bugs that single-model consensus misses |
| Complex multi-file refactoring | Supervisor-Worker Swarm | Task decomposition across specialist workers |
| High-stakes legal/medical analysis | Council with HITL gate | Peer review + human sign-off for critical decisions |
| Simple extraction / summarization | Single model | Fusion adds unnecessary latency and cost — it is an escalation lane |
Recommended Implementation
- Gateway-Level Fusion Tool: Register
harness__fusionas a gateway-injected tool. The primary model decides when to invoke deliberation — it is not forced on every request CLAIM-145. - Budget Panels First: Start with 3× Flash-tier panel + 1 frontier judge. Budget panels outperform standalone frontier models on DRACO at ~50% cost CLAIM-157.
- Structured JSON Judge Output: Judge must produce structured analysis (consensus, contradictions, blind spots) — not freeform merge CLAIM-145.
- Anonymize Panel Responses: Strip model identifiers before judge evaluation to prevent lab-bias CLAIM-150.
- Recursion Protection: Depth headers prevent infinite nested fusion calls CLAIM-146.
Promise.allSettled()for Panel Dispatch: Partial panel failures should not abort the entire deliberation.
Framework Selection for Deliberation
| Framework | Best Pattern | Why |
|---|---|---|
| LangGraph | All patterns (via StateGraph) | Production-grade, auditable, supports cycles and conditional edges |
| CrewAI | Supervisor-Worker | Intuitive role-based setup, quick prototyping |
| OpenAI Agents SDK | Supervisor-Worker (handoffs) | Production-grade, built-in tracing |
| Microsoft Agent Framework | Enterprise graph workflows | Type-safe, Azure-native, successor to AutoGen |
| Custom Gateway Tool | Panel + Judge (Fusion) | Lightweight, no framework dependency; dispatches via configured model backend (OpenRouter, Ollama, LiteLLM, etc.) |
For detailed research including taxonomy, self-hosted implementation code, anti-patterns, benchmarks, and decision matrices, see multi_model_deliberation_and_swarms.md.
Generative UI & MCP UI Recommendations
Core Recommendations for Dynamic UIs
| Recommendation | Implementation Details | Rationale |
|---|---|---|
| Declarative Registry Gating | Client-side Component Registry mapping schemas to React components | Blocks arbitrary code execution, establishing a strict security perimeter CLAIM-171, CLAIM-175 |
| Isolated Sandbox Rendering | Load remote MCP widgets inside an isolated <iframe> with sandbox="allow-scripts" and a strict Content Security Policy (CSP) |
Prevents remote server templates from stealing user cookies or parent window access CLAIM-175 |
| Bi-directional postMessage Sync | Establish JSON-RPC bridges over postMessage to sync iframe widget state to host agent variables |
Keeps host-engine and visual views in lockstep, letting user interactions run subsequent tools CLAIM-174 |
| Stateless Core Tasks | Implement client-driven durable state machines (durable Tasks) storing checkpoint state in SQLite | Eliminates TCP/HTTP socket exhaustion on background loops CLAIM-177, CLAIM-178 |
Framework Selection for Agent UIs
- Vercel AI SDK (AI SDK UI): Use for progressive JSON token stream rendering, letting client components render layouts immediately as parameter nodes materialize CLAIM-172.
- CopilotKit: Use for active state updates (AG-UI protocol) syncing client components directly with background agent states CLAIM-173.
- Mastra AI &
mcp-use: Use to construct and expose rich visual tools on local/remote MCP servers CLAIM-176.
For detailed security guidelines, postMessage schemas, and Tasks lifecycles, see the dedicated document: mcp_apps_and_ui.md.
Human-in-the-Loop & Execution Control Recommendations
Core HITL Design Guidelines
| Aspect | Recommendation | Rationale |
|---|---|---|
| Conversation Steering | Implement dynamic user message injection between tool executions (mid-turn) rather than force-killing the session context. Enforce strict alternation constraints immediately post-injection CLAIM-189. | Lets users correct errors in real-time without wasting historical token context and compute CLAIM-189. |
| Request-Local Aborts | Wire up request-local cancellation tokens to worker exception handlers. When force-closing connections, check this token to bypass default connection retries CLAIM-191. | Prevents cascading retry hangs where cancelled workers persist and conflict with subsequent user messages (PR #6600 fix) CLAIM-191. |
| Granular Bypass Policies | Support 3 bypass policy levels: Ask (micro-confirmations for every tool execution), Session Bypass (auto-approve tools during current session), and Workspace Safe (auto-approve non-sensitive workspace paths) CLAIM-194. |
Maximizes developer velocity while maintaining tight boundaries for critical resources. |
| Path Sensitivity Gates | Hardcode file edit exclusions for sensitive files (e.g. .env, .git/config, SSH keys) to always force interactive approvals, regardless of bypass policy settings CLAIM-195. |
Mitigates prompt injection attacks targeting private credentials. |
| Headless Swarm Auto-Deny | Subagent loops triggered by webhooks or cron workers should default to auto-deny policies when prompting for dangerous tool execution CLAIM-196. | Prevents runaway resource consumption or automated workspace corruption in unattended loops CLAIM-196. |
For detailed code paradigms, cascading cancellation fixes, and framework implementations, see the dedicated document: human_in_the_loop_steering.md.
Agent Scratchpad & Graph-Based Session Memory Recommendations
Core Memory & Scratchpad Design Guidelines
| Aspect | Recommendation | Rationale |
|---|---|---|
| Private Sandbox Scratchpads | Isolate transient experimental scripts, temporary data caches, and test script variants to a private, conversation-locked directory outside the workspace tree CLAIM-203. | Prevents polluting the user's repository version history and avoids triggering project linting or staging tools CLAIM-203. |
| Active Task Re-Injection | Support in-memory or database-backed task lists. Upon context compression/compaction events, format and re-inject active items back into the prompt window CLAIM-199. | Preserves the agent's task state and active plan across history compactions, preventing lost focus CLAIM-199. |
| Completed Task Gating | Filter out completed and cancelled checklist items from the re-injection stream CLAIM-200. | Gating completed work prevents the agent from re-doing already finished sub-tasks CLAIM-200. |
| Workspace Rule Files | Standardize project instructions in CLAUDE.md and custom rules in .claude/rules/*.md files CLAIM-202. |
Provides static, version-controlled coding guidelines that are easily parsed on session bootstrap CLAIM-202. |
| Graph-Based Memory Traversal | Utilize open-source Knowledge Graph memory solutions (such as Mem0, Graphiti, or Cognee) for multi-hop personalization CLAIM-205 and temporal fact resolution CLAIM-206. | Resolves contradictory facts dynamically by deprecating stale graph edges and allows reasoning over complex relationship networks CLAIM-206. |
For detailed research covering scratchpad patterns, auto-memory logs, and knowledge graphs, see: agent_scratchpads_and_session_memory.md.
Agent Self-Improvement & Curation Recommendations
Core Self-Improvement & Curation Guidelines
| Aspect | Recommendation | Rationale |
|---|---|---|
| Inactivity Curation | Trigger background curation passes when the gateway is idle (e.g., 7 days elapsed, 2 hours user idle) CLAIM-208. Run passes on a dedicated background fork and cheaper auxiliary model (auxiliary.curator) to preserve prompt caches CLAIM-208. |
Avoids interrupting active developer loops and controls model execution costs. |
| Telemetry Separation | Record skill views, uses, and patches inside an isolated JSON sidecar file (.usage.json) rather than raw file frontmatter CLAIM-209. |
Keeps telemetry out of user-authored code trees and prevents VCS merge conflicts. |
| Lifecycle State Transitions | Automatically transition unused agent-created skills from active -> stale (30 days) -> archived (90 days, moved to .archive/ directory) CLAIM-211. |
Eliminates skill catalog rot and prevents token budget leakage during index scans. |
| Umbrella Consolidation | Parse candidate skills to cluster groups, merging overlaps into existing or new class-level umbrella files CLAIM-213, and demoting narrow session bugfixes to references/templates/scripts CLAIM-213, while rewriting relative links to preserve package integrity CLAIM-214. | Restructures micro-skills into structured, high-signal procedural directories. |
| Vulnerability Gating | Run static AST checks and security scanners on newly generated skills before registering them to the harness CLAIM-215. | Mitigates prompt injection attacks injecting arbitrary code execution paths. |
| Pre-Curation Tarballs | Auto-generate tarball snapshots (skills.tar.gz) pre-run under .curator_backups/ alongside a manifest CLAIM-216. |
Enables complete developer audibility and rollback of bad curation runs CLAIM-216. |
| Workspace Preference Logs | Extract developer style preferences to .claude/memory.md using a local auto-memory logger CLAIM-204. |
Keeps preferences auditable and editable using simple CLI tools CLAIM-204. |
For detailed research covering curation engines, preferences, and RISE/TT-SI loops, see: self_improving_agents_and_learning_loops.md.
Open Questions for Implementation
- Python or TypeScript for agent core? — Python has ecosystem, TS has full-stack advantage.
- Monolithic or distributed? — Hermes monolith is simpler, OpenClaw gateway scales better.
- Which model backend first? — Ollama for local dev, OpenRouter for quick multi-model access, LiteLLM when you need self-hosted auth/budgets at scale.
- Graph-based or while-loop? — While-loop for v1, graph for v2?
- Which memory providers to support first? — Start with SQLite + one cloud provider?