How we use Qdrant, hybrid retrieval, and cross-encoder reranking at recruitRyte

recruitRyte's search engine handles 800M+ candidate profiles with sub-second latency (p95 < 900ms). The architecture: Qdrant vector search with native payload filtering, hybrid retrieval combining dense vectors and sparse text matching, and cross-encoder reranking on top-100 results. Binary quantization compresses the index 32x (2.4TB to 75GB) with less than 2% recall loss. Fine-tuning the bi-encoder on domain-specific data matters more than model size.
At recruitRyte, recruiters need to search through over 800 million candidate profiles. Keyword search fell apart fast — a query like "senior backend engineer with distributed systems experience" returned noise because it couldn't understand what the recruiter actually meant. Traditional full-text search engines like Elasticsearch can match on keywords, but they miss the intent behind a query. A recruiter searching for "someone who's built data pipelines at scale" won't match profiles that say "designed ETL workflows processing 10TB daily" — even though that's exactly what they're looking for.
We needed semantic understanding. Here's how we built it.
The search system has three layers, each solving a different part of the problem:

We evaluated several vector databases — Pinecone, Weaviate, and Milvus — before going with Qdrant. The deciding factors were:
Every candidate profile gets converted into a dense embedding using our own fine-tuned bi-encoder model trained specifically for recruitment. We built a golden dataset by using a strong LLM to score query-profile relevance across millions of pairs, then trained our bi-encoder on that data. No off-the-shelf sentence-transformers — the model was fine-tuned from scratch on recruitment-specific semantics.
The result is a model that understands this domain deeply: it knows that "full-stack developer" is close to "frontend + backend engineer," that "ML engineer" overlaps with "data scientist" in certain contexts, and that "10x engineer" is a personality signal, not a job title.
We deploy these models on Text Embedding Inference (TEI) from Hugging Face. TEI handles batching, dynamic padding, and GPU memory management, keeping embedding generation under 10ms per query even under heavy load. Qdrant's HNSW index then handles similarity search across the full 800M+ vector space.
Semantic search handles intent well, but recruiters also need exact matches. "Must have worked at Google" or "needs Python 3.11" — those are keyword problems, not semantic ones. Pure vector search might find someone who worked at a Google-like company or knows Python generally, but the recruiter wanted an exact constraint.
We solve this with Qdrant's text payload fields. Each profile is indexed with nearly 30 payload fields — skills, job titles, company names, locations, seniority levels, certifications, education, and more. When a recruiter needs an exact keyword match, we use Qdrant's payload filtering to match directly against these text fields during the vector search. No separate sparse vector index needed.
The combination works like this:
The query pipeline builds a single request that combines dense similarity with payload text matches and hard filters. Qdrant applies these filters during the HNSW search, not after — which is what makes this fast at 800M+ scale.
The top-K results from Qdrant are relevant, but the ordering isn't perfect. Bi-encoders encode queries and documents independently — they're fast but can miss fine-grained relevance signals that only emerge when you look at the query and document together.
We pass the top 100 results through a fine-tuned cross-encoder reranker, also running on TEI. The cross-encoder processes each (query, candidate profile) pair through a single transformer, allowing full cross-attention between every query token and every profile token. This catches nuances that bi-encoders miss — like whether "distributed systems" in the query matches the candidate's specific experience building Kafka pipelines or just a passing mention in a course description.
Why 100 candidates? It's a latency-quality trade-off. The cross-encoder is significantly slower than the bi-encoder — roughly 5-10ms per pair on GPU. At 100 candidates, reranking adds around 500-800ms to the pipeline. We experimented with top-50 and top-200. Top-50 missed too many good candidates that the bi-encoder ranked low. Top-200 pushed latency past the one-second threshold without meaningfully improving the top-10 results.
The reranker assigns a final relevancy score, and the best matches float to the top. For more on when to use bi-encoders vs cross-encoders, see the SBERT documentation on cross-encoders.

Getting this to work at nearly a billion profiles took real engineering work. Here's what made it feasible:
The single biggest unlock was binary quantization. Standard float32 vectors at 768 dimensions take 3KB each. At 800M profiles, that's roughly 2.4TB just for the vectors — far too much for in-memory search.
Binary quantization converts each float32 dimension to a single bit, reducing storage by up to 32x. As documented in Qdrant's binary quantization guide, this technique can also deliver up to 40x retrieval speed gains through native bitwise CPU operations. Our 2.4TB index compresses to under 75GB, which fits in memory on a single high-memory instance. The recall trade-off is minimal — we measured less than 2% recall loss on our benchmark dataset, and the cross-encoder reranker compensates for any ordering differences in the final results.
Qdrant runs in a clustered setup across multiple nodes:
A recruiter types a natural language query and gets ranked results across 800M+ candidate profiles in under a second. The p95 latency for the full pipeline — embedding, vector search, hybrid scoring, and cross-encoder reranking — stays under 900ms.
The stack works because each layer handles what it's good at: Qdrant does fast approximate retrieval, hybrid scoring balances precision and recall, and the cross-encoder provides the final quality layer that makes recruiters trust the results.
Building a search system for 800M+ profiles taught us a few things that don't show up in tutorials:
Fine-tuning matters more than model size. Our fine-tuned bi-encoder outperforms general-purpose models with significantly more parameters on recruitment queries. The key was the training data — we used a strong LLM to generate a golden dataset of query-profile relevance scores, giving us high-quality supervision at scale. Domain-specific training data beats architecture size every time for narrow tasks.
Overscoring is real. The cross-encoder reranker solves retrieval quality, but it introduced a new problem: overscoring. Some profiles with dense keyword lists but shallow experience would score artificially high because the reranker sees many surface-level matches. We addressed this by including profile completeness and experience depth as additional features in the final scoring function, not just the reranker output.
Monitor recall, not just latency. It's tempting to optimize purely for speed — and binary quantization does sacrifice some recall. We run weekly recall benchmarks against a golden dataset of 5,000 query-profile pairs where recruiters manually rated relevance. This catches drift before it affects user experience. So far, we've maintained above 95% recall@100 against the full-precision baseline.
Payload indexing is underrated. We index nearly 30 fields per profile as payloads — everything from skills and job titles to education, certifications, and company history. At our scale, the difference between a well-indexed payload filter and a naive one is the difference between 50ms and 5 seconds. Indexing these fields as keyword payloads in Qdrant allows them to participate in the HNSW search rather than being applied as a post-filter. Getting the payload schema right was one of the highest-leverage decisions in the whole architecture.
We're working on a few improvements to the search pipeline:
The core architecture — Qdrant, hybrid retrieval, cross-encoder reranking — has held up well as we've scaled from 100M to 800M+ profiles. The search quality that comes from this three-layer approach is what keeps recruiters using the product daily.


AI Consultant. 9+ years building production AI. Previously Chief Data Scientist at recruitRyte. IIT Dhanbad.