Embeddings, Rerankers, and Semantic Retrieval Pipelines
A modern AI agent harness must retrieve highly relevant context from large document databases or conversation logs to pack into the prompt. To do this efficiently, agents utilize a multi-stage retrieval pipeline combining Bi-encoders (Embeddings) and Cross-encoders (Rerankers).
graph TD
Query([User Query]) --> Exp[Query Expansion <br/> e.g., qmd-query-expansion-1.7B]
Exp --> Sparse[Sparse Retrieval <br/> BM25 / SQLite FTS5]
Exp --> Dense[Dense Vector Search <br/> embeddinggemma-300M]
Sparse --> Fusion{Hybrid Fusion <br/> Reciprocal Rank Fusion RRF}
Dense --> Fusion
Fusion --> Candidates[Top 30 Candidates]
Candidates --> Reranker[Cross-Encoder Reranker <br/> qwen3-reranker-0.6b]
Reranker --> Blend[Position-Aware Blending]
Blend --> Optimizer[Context Optimizer <br/> deduplication / lost-in-the-middle layout]
Optimizer --> Prompt[LLM Prompt Context]
1. Embeddings (Bi-encoders)
Embeddings represent text chunks as dense, continuous vectors in a high-dimensional vector space. They operate on a Bi-encoder architecture.
1.1 Architecture & Similarity Metrics
- Bi-encoder Design: The query and the documents are encoded independently into vectors by the embedding model. Because the document embeddings can be pre-computed and stored, similarity search reduces to simple vector math at query time.
- Similarity Metrics:
- Cosine Similarity: Measures the cosine of the angle between two vectors ($\cos(\theta) = \frac{A \cdot B}{|A| |B|}$). Best for normalized vectors where magnitude is irrelevant.
- Dot Product: Simple scalar product ($A \cdot B$). Extremely fast to compute, especially on hardware accelerators, but requires vectors of uniform length.
- L2 (Euclidean) Distance: Measures physical distance between endpoints ($|A - B|_2$).
[Query] -----> [Embedding Model] -----> [Query Vector] ──┐
├──> [Cosine Similarity]
[Document] --> [Embedding Model] -----> [Doc Vector] ────┘
1.2 Scaling & Vector Indexes
To search millions of vectors in sub-millisecond times, databases construct index graphs instead of running exhaustive scans ($O(N)$):
- HNSW (Hierarchical Navigable Small World): A multi-layer graph index. Top layers have long-range links for fast routing, while bottom layers contain dense, short-range clusters. Highly accurate, but requires holding the graph in RAM.
- IVF (Inverted File Index): Clusters vectors into voronoi cells. Queries only search vectors inside the nearest cell clusters. Minimizes memory footprints but can lower recall.
1.3 Embedding Model Selection
- API-based: OpenAI
text-embedding-3or Voyage-3 (which offers domain-specific models likevoyage-code-3). - Local Open-Source: Local Gemma or Qwen embeddings. For instance, the
qmdsearch engine in SKILL.md auto-downloadsembeddinggemma-300M-Q8_0(~300MB) for on-device vector generation.
[!TIP] Pros: Highly scalable. Can search millions of documents in milliseconds using Approximate Nearest Neighbors (ANN). Good at capturing semantic relationships. Cons: Poor at matching exact keywords, phone numbers, alphanumeric IDs, or rare spelling variations.
2. Rerankers (Cross-encoders)
Rerankers evaluate the relevance of document chunks against a query using a Cross-encoder architecture.
2.1 Architecture & Computational Complexity
- Cross-encoder Design: The query and a candidate document are fed together into the transformer model. Self-attention runs across both the query tokens and document tokens simultaneously.
- Relevance Scoring: Because the model performs joint attention, it captures high-granularity semantic alignments, syntax matches, and vocabulary overlaps that independent embeddings miss.
- Complexity: Evaluating a single pair requires a full transformer inference pass. For $N$ documents, running a cross-encoder requires $O(N)$ inference steps. Because of this high cost, rerankers cannot scale to search millions of documents directly.
[Query] + [Document] -----> [Reranker Model (Joint Attention)] -----> [Relevance Score (0.0 - 1.0)]
2.2 Reranker Model Selection
- API-based: Cohere
Rerank 4 Pro(high quality),Rerank 4 Fast(latency optimized), or Voyagererank-2.5. - Local Open-Source: local GGUF models. For example,
qmdindexes notes locally and usesqwen3-reranker-0.6b-q8_0(~640MB) to perform local cross-attention.
3. The Multi-Stage Semantic Retrieval Pipeline
To balance speed and quality, production agent systems implement a multi-stage pipeline that uses embeddings for high-throughput filtering, and rerankers for high-precision ordering.
Phase 1: Retrieval (Candidate Generation)
- Dual retrieval: Runs two parallel queries:
- Sparse Retrieval: BM25 keyword matching (e.g., using SQLite FTS5). Excellent for exact matches, code identifiers, and specific names.
- Dense Retrieval: Vector database search using embeddings. Excellent for natural language concepts.
Phase 2: Hybrid Fusion
The sparse and dense result lists are merged.
- Reciprocal Rank Fusion (RRF): Evaluates the rank of document $d$ in the sparse ($R{sparse}$) and dense ($R{dense}$) lists to generate a unified score: $$RRF(d) = \frac{1}{k + R{sparse}(d)} + \frac{1}{k + R{dense}(d)}$$ Where $k$ is a constant (commonly $60$).
- qmd Optimization: The local engine
qmdimplements RRF ($k=60$) and applies a top-rank weight boost: rank #1 gets $+0.05$, and ranks #2-3 get $+0.02$.
Phase 3: Reranking
- The top-k (e.g. top 30) fused candidates are submitted to the Cross-encoder Reranker.
- The reranker computes a precise score between $0.0$ and $1.0$ for each candidate.
Phase 4: Context Optimization & Blending
- Position-Aware Blending: The system blends the initial retrieval rank and the reranking score based on the candidate's rank tier. As documented in the
qmdpipeline:- Ranks 1–3: 75% retrieval weight / 25% reranker weight (trusts initial ranks).
- Ranks 4–10: 60% retrieval weight / 40% reranker weight.
- Ranks 11+: 40% retrieval weight / 60% reranker weight (trusts the reranker to elevate relevant items from the long tail).
- Deduplication & Truncation: Chunks from the same document are merged or pruned to maximize information density.
- Context Packing ("Lost in the Middle"): LLMs pay more attention to tokens at the absolute beginning and end of the prompt context window. The optimizer sorts the final $M$ document chunks in a "U-shape": placing the most relevant items first, the second most relevant last, and the remaining in the middle.
4. Architectural Integration in the Harness
To remain model-agnostic, the agent harness should define modular abstract interfaces for both embeddings and rerankers — the same pluggable-provider idea used across OpenRouter, LangChain, and optional proxies like LiteLLM:
4.1 Embedding Provider Interface
class EmbeddingProvider(ABC):
@abstractmethod
async def embed_query(self, query: str) -> List[float]:
"""Generate a single vector for search."""
pass
@abstractmethod
async def embed_documents(self, docs: List[str]) -> List[List[float]]:
"""Generate multiple vectors for indexing."""
pass
- Configuration: Supports routing to OpenAI (
/v1/embeddings), Voyage, or local llama.cpp endpoints via standard config maps.
4.2 Rerank Provider Interface
class RerankProvider(ABC):
@abstractmethod
async def rerank(
self, query: str, documents: List[str], top_n: int = 10
) -> List[Dict[str, Any]]:
"""Evaluate and score document candidates against the query."""
pass
- Implementation: Maps to Cohere
/v1/rerank, Jina, Voyage, or localqwen-rerankerexecution adapters. Allows the agent to trigger rerank operations natively as a utility function or local tool.