800M-Profile Vector Search at sub-second latency

Client: recruitRyte
Role: Chief Data Scientist
Period: July 2023 to May 2026
Stack: Qdrant, hybrid retrieval, cross-encoder reranking, fine-tuned embeddings, binary quantization, FastAPI, vLLM, Python

The problem

recruitRyte runs a recruiter-facing search product. The promise is simple: type a natural-language hiring query ("senior backend engineers in Berlin with FastAPI plus Postgres plus 5+ years"), get the right candidates back in under a second.

The hard part is the scale. The candidate index is 800M+ profiles. Naive cosine similarity over 800M dense vectors does not work. Off-the-shelf hosted vector DBs do not work at this price point. And the relevance bar is not "good enough for retrieval", it is "good enough for a recruiter to pay for".

So three constraints had to coexist: 800M-record scale, sub-second p95 latency, and recall good enough to win against keyword-only Boolean search.

What I built

A three-stage retrieval system tuned for the specific shape of recruiter queries.

Stage 1: hybrid candidate retrieval. Each profile is encoded into a dense vector (semantic meaning of skills, experience, seniority) and indexed in Qdrant. Alongside the dense vector, structured payload fields (location, years of experience, current role, tech stack) are stored on the same record. Qdrant runs payload-filtered HNSW so the filters apply DURING the search rather than after, which is what makes the latency budget work. A keyword search step runs in parallel on critical exact-match fields. The two retrieval paths get fused.

Stage 2: cross-encoder reranking. The top 100 from stage 1 go through a cross-encoder model that compares each profile against the original query with full bidirectional attention. This is roughly 1000x slower per pair than the bi-encoder used in stage 1, but on only 100 candidates the latency is acceptable. The reranker is what makes the difference between "this person is in tech" and "this person has shipped the exact stack you asked for".

Stage 3: structured re-ordering. Final ordering blends the reranker score with recruiter-relevant signals (recency of role change, response rate to past outreach, fit confidence). This is the boring layer that most demo systems skip, and it is where the production product actually wins.

The engineering work that made it ship

Binary quantization on the dense vectors. Raw f32 embeddings would be 2.4 TB of memory. Binary quantization compresses to roughly 75 GB at under 2% recall loss. Without this the infrastructure cost would have killed the unit economics.

Fine-tuned embeddings. Off-the-shelf embedding models score recruiter queries poorly because the domain language is so specific ("ex-FAANG", "10x", "into game design", "shipped at YC backed"). I fine-tuned a smaller embedding model on millions of triplet examples mined from recruiter-candidate matches and rejections. The fine-tuned model produces tighter clusters in the hiring-domain space, which is what made the dense retrieval stage usable.

Cross-encoder training. Same idea applied at the reranker layer. A general-purpose cross-encoder treats every text-text pair the same. A domain-tuned one understands that "senior" plus "5 years" plus "FastAPI" is a stronger match signal than the words individually would suggest.

Latency engineering. p95 under 900 ms across 800M profiles required Qdrant configuration tuning (HNSW M and ef_construct), payload-index ordering by selectivity, and aggressive use of Qdrant filters as pre-conditions rather than post-filters. The cross-encoder runs on a separate GPU-backed service via vLLM so the API path stays uncoupled from model inference latency.

The outcome

Metric	Result
Index size	800M+ candidate profiles
Storage (after quantization)	~75 GB (compressed from 2.4 TB)
p95 search latency	under 900 ms
Recall loss from quantization	under 2%
Daily new profiles processed	5M+
Recruiter NPS	significant lift vs the prior Boolean-only system

What this case is proof of

This is the build that anchors every Card 2 (Custom AI Agents + RAG) engagement. If you have a problem that looks like "retrieve from a huge corpus with semantic intent and structured filters at production latency", this is the architecture I will design for you. The exact same pattern works for legal-document search, internal knowledge bases, support-ticket triage, customer-feedback analysis, and a dozen other use cases.

The credential matters more than the slide-deck claim. Most "RAG developers" have shipped a demo over 10,000 documents. Shipping at 800M is a different category of engineering: you cannot brute-force it, you cannot just use the latest framework, you have to actually understand what is happening at every layer.

What I would do for you

A scoping engagement defining your retrieval goals, eval metrics, and corpus shape. Then an audit (vector DB choice, embedding strategy, reranker need, evaluation harness). Then a 30 to 60 day build to ship the first production system with eval-gated releases. Then optional ongoing retainer for optimization and new feature builds.

Book a 30-min discovery call if your retrieval problem is bigger than what an off-the-shelf RAG framework can handle. Calendly link in the header.