How does hybrid search with dense and sparse vectors work?

Hybrid search combines dense vector embeddings (capturing semantic meaning) with text payload fields for exact keyword matching. Qdrant applies payload filters during the HNSW search rather than after, enabling sub-second latency at 800M+ scale.

How do you search 800M+ profiles in under a second?

The architecture uses Qdrant vector search with binary quantization (compressing 2.4TB to 75GB with less than 2% recall loss), hybrid retrieval combining dense semantics with payload constraints, and cross-encoder reranking on top-100 results. The full pipeline achieves p95 latency under 900ms.

What is binary quantization and how much memory does it save?

Binary quantization compresses each high-dim float32 vector into a 1-bit-per-dim bitmap, cutting memory roughly 32x. For 800M profiles this dropped the index from 2.4TB to 75GB. Recall loss stays under 2% when paired with float rescoring on the top-K candidates.

How does Qdrant handle payload filtering at scale?

Qdrant applies payload filters during HNSW graph traversal rather than as a post-filter. Latency stays bounded even when filters select less than 1% of the corpus — critical for recruitment queries like '5+ years experience AND located in Bangalore' where post-filtering would scan too many vectors.

What infrastructure did the 800M-profile system run on?

A sharded Qdrant cluster with vectors split across nodes. A coordinator merges per-shard top-K results. Hot data sits in-memory; cold data on NVMe SSD with prefetching. Cross-encoder reranking runs on a separate GPU pool decoupled from the vector search nodes.

I Built a Search Engine for 800M+ Candidate Profiles

How we use Qdrant, hybrid retrieval, and cross-encoder reranking at recruitRyte

TL;DR

recruitRyte's search engine handles 800M+ candidate profiles with sub-second latency (p95 < 900ms). The architecture: Qdrant vector search with native payload filtering, hybrid retrieval combining dense vectors and sparse text matching, and cross-encoder reranking on top-100 results. Binary quantization compresses the index 32x (2.4TB to 75GB) with less than 2% recall loss. Fine-tuning the bi-encoder on domain-specific data matters more than model size.

Key Takeaways

Keyword search fails for recruitment — natural language queries like "senior backend engineer with distributed systems experience" need semantic understanding, not string matching
Hybrid retrieval combines the best of both worlds — dense vectors capture meaning, text payload filters handle exact constraints like location, skills, and company names
Cross-encoder reranking is the quality layer — bi-encoders are fast but approximate; a cross-encoder that scores full query-profile pairs pushes the best matches to the top
Binary quantization makes billion-scale feasible — compressing float32 vectors to binary cuts memory by up to 32x with minimal recall loss, keeping the full index in RAM
The stack: Qdrant for vector search, Text Embedding Inference (TEI) for model serving, AWS for infrastructure

The search pipeline
Scaling to 800M+
Lessons from building at this scale
What's next

At recruitRyte, recruiters need to search through over 800 million candidate profiles. Keyword search fell apart fast — a query like "senior backend engineer with distributed systems experience" returned noise because it couldn't understand what the recruiter actually meant. Traditional full-text search engines like Elasticsearch can match on keywords, but they miss the intent behind a query. A recruiter searching for "someone who's built data pipelines at scale" won't match profiles that say "designed ETL workflows processing 10TB daily" — even though that's exactly what they're looking for.

We needed semantic understanding. Here's how we built it.

The search pipeline

The search system has three layers, each solving a different part of the problem:

Vector search with Qdrant — fast approximate nearest neighbor search across 800M+ embeddings
Hybrid retrieval — combining semantic similarity with text payload matching and metadata filters
Cross-encoder reranking — a precision layer that re-scores the top candidates using full query-profile attention

Vector search with Qdrant

We evaluated several vector databases — Pinecone, Weaviate, and Milvus — before going with Qdrant. The deciding factors were:

Native payload filtering — Qdrant filters on metadata during the HNSW search (the hierarchical graph algorithm introduced by Malkov & Yashunin, 2016), not after. This is critical when a recruiter says "backend engineers in San Francisco with 5+ years experience." Post-filtering at 800M scale would return too few results or be too slow
Performance at scale — Qdrant's benchmarks show strong latency characteristics at high vector counts
On-premise deployment — we needed to run this on our own AWS infrastructure for data compliance reasons

Every candidate profile gets converted into a dense embedding using our own fine-tuned bi-encoder model trained specifically for recruitment. We built a golden dataset by using a strong LLM to score query-profile relevance across millions of pairs, then trained our bi-encoder on that data. No off-the-shelf sentence-transformers — the model was fine-tuned from scratch on recruitment-specific semantics.

The result is a model that understands this domain deeply: it knows that "full-stack developer" is close to "frontend + backend engineer," that "ML engineer" overlaps with "data scientist" in certain contexts, and that "10x engineer" is a personality signal, not a job title.

We deploy these models on Text Embedding Inference (TEI) from Hugging Face. TEI handles batching, dynamic padding, and GPU memory management, keeping embedding generation under 10ms per query even under heavy load. Qdrant's HNSW index then handles similarity search across the full 800M+ vector space.

Hybrid retrieval: semantic + payload filtering

Semantic search handles intent well, but recruiters also need exact matches. "Must have worked at Google" or "needs Python 3.11" — those are keyword problems, not semantic ones. Pure vector search might find someone who worked at a Google-like company or knows Python generally, but the recruiter wanted an exact constraint.

We solve this with Qdrant's text payload fields. Each profile is indexed with nearly 30 payload fields — skills, job titles, company names, locations, seniority levels, certifications, education, and more. When a recruiter needs an exact keyword match, we use Qdrant's payload filtering to match directly against these text fields during the vector search. No separate sparse vector index needed.

The combination works like this:

Dense vectors for semantic meaning — capturing what a recruiter means, not just what they typed
Text payload matching for keyword constraints — exact matching on skill names, company names, and certifications using indexed text fields
Payload filters for hard constraints — location, years of experience, seniority level, visa status

The query pipeline builds a single request that combines dense similarity with payload text matches and hard filters. Qdrant applies these filters during the HNSW search, not after — which is what makes this fast at 800M+ scale.

Cross-encoder reranking

The top-K results from Qdrant are relevant, but the ordering isn't perfect. Bi-encoders encode queries and documents independently — they're fast but can miss fine-grained relevance signals that only emerge when you look at the query and document together.

We pass the top 100 results through a fine-tuned cross-encoder reranker, also running on TEI. The cross-encoder processes each (query, candidate profile) pair through a single transformer, allowing full cross-attention between every query token and every profile token. This catches nuances that bi-encoders miss — like whether "distributed systems" in the query matches the candidate's specific experience building Kafka pipelines or just a passing mention in a course description.

Why 100 candidates? It's a latency-quality trade-off. The cross-encoder is significantly slower than the bi-encoder — roughly 5-10ms per pair on GPU. At 100 candidates, reranking adds around 500-800ms to the pipeline. We experimented with top-50 and top-200. Top-50 missed too many good candidates that the bi-encoder ranked low. Top-200 pushed latency past the one-second threshold without meaningfully improving the top-10 results.

The reranker assigns a final relevancy score, and the best matches float to the top. For more on when to use bi-encoders vs cross-encoders, see the SBERT documentation on cross-encoders.

Scaling to 800M+

Getting this to work at nearly a billion profiles took real engineering work. Here's what made it feasible:

Binary quantization

The single biggest unlock was binary quantization. Standard float32 vectors at 768 dimensions take 3KB each. At 800M profiles, that's roughly 2.4TB just for the vectors — far too much for in-memory search.

Binary quantization converts each float32 dimension to a single bit, reducing storage by up to 32x. As documented in Qdrant's binary quantization guide, this technique can also deliver up to 40x retrieval speed gains through native bitwise CPU operations. Our 2.4TB index compresses to under 75GB, which fits in memory on a single high-memory instance. The recall trade-off is minimal — we measured less than 2% recall loss on our benchmark dataset, and the cross-encoder reranker compensates for any ordering differences in the final results.

Clustered infrastructure

Qdrant runs in a clustered setup across multiple nodes:

Sharding — the 800M+ vector collection is split across shards distributed over multiple machines
Replication — each shard has replicas for fault tolerance and read throughput
AWS Auto Scaling Groups — nodes scale horizontally as data volume and query traffic grow
GPU inference via TEI — embedding generation and reranking run on GPU instances for the throughput we need in real-time

The result

A recruiter types a natural language query and gets ranked results across 800M+ candidate profiles in under a second. The p95 latency for the full pipeline — embedding, vector search, hybrid scoring, and cross-encoder reranking — stays under 900ms.

The stack works because each layer handles what it's good at: Qdrant does fast approximate retrieval, hybrid scoring balances precision and recall, and the cross-encoder provides the final quality layer that makes recruiters trust the results.

Lessons from building at this scale

Building a search system for 800M+ profiles taught us a few things that don't show up in tutorials:

Fine-tuning matters more than model size. Our fine-tuned bi-encoder outperforms general-purpose models with significantly more parameters on recruitment queries. The key was the training data — we used a strong LLM to generate a golden dataset of query-profile relevance scores, giving us high-quality supervision at scale. Domain-specific training data beats architecture size every time for narrow tasks.

Overscoring is real. The cross-encoder reranker solves retrieval quality, but it introduced a new problem: overscoring. Some profiles with dense keyword lists but shallow experience would score artificially high because the reranker sees many surface-level matches. We addressed this by including profile completeness and experience depth as additional features in the final scoring function, not just the reranker output.

Monitor recall, not just latency. It's tempting to optimize purely for speed — and binary quantization does sacrifice some recall. We run weekly recall benchmarks against a golden dataset of 5,000 query-profile pairs where recruiters manually rated relevance. This catches drift before it affects user experience. So far, we've maintained above 95% recall@100 against the full-precision baseline.

Payload indexing is underrated. We index nearly 30 fields per profile as payloads — everything from skills and job titles to education, certifications, and company history. At our scale, the difference between a well-indexed payload filter and a naive one is the difference between 50ms and 5 seconds. Indexing these fields as keyword payloads in Qdrant allows them to participate in the HNSW search rather than being applied as a post-filter. Getting the payload schema right was one of the highest-leverage decisions in the whole architecture.

What's next

We're working on a few improvements to the search pipeline:

Query understanding with LLMs — using a small language model to parse recruiter queries into structured intents before searching, separating semantic requirements from hard filters automatically
Personalized ranking — incorporating recruiter-specific signals (past hiring patterns, role preferences) into the reranking step
Real-time index updates — moving from batch indexing to streaming updates so new profiles are searchable within minutes of being added

The core architecture — Qdrant, hybrid retrieval, cross-encoder reranking — has held up well as we've scaled from 100M to 800M+ profiles. The search quality that comes from this three-layer approach is what keeps recruiters using the product daily.