Stop Benchmarking Embedding Models for Search Quality

In the world of AI development, we are currently obsessed with leaderboards. Every few weeks, a new 'state-of-the-art' embedding model is released, claiming to top the MTEB (Massive Text Embedding Benchmark) and promising better retrieval for your RAG (Retrieval-Augmented Generation) pipelines. Founders and engineers frequently ask: "Should we switch from OpenAI text-embedding-3-small to Voyage-3?" or "Is Gemini-embedding-001 good enough for production?"

As the CTO at Vaultt (formerly StudentVenture), a recruitment marketplace for the top 1% of non-traditional talent, I have spent the last year running semantic matching in production across more than 10,000 complex candidate profiles. Our findings suggest that the obsession with the model itself is largely misplaced. If you want to improve your search quality, you shouldn't be looking at the leaderboard; you should be looking at your data pipeline.

The Experiment: Model Delta vs. Data Delta

To prove this, we ran a controlled experiment on our candidate corpus. We used a consistent evaluation set consisting of real recruiter queries—not synthetic data, but the actual, messy text that recruiters type when looking for talent. We measured success based on a simple metric: Did the top 10 results contain the candidates the recruiter actually ended up interviewing?

First, we tested the 'Model Variable.' We kept the data constant (raw profile JSON) and swapped five different embedding models:

An open-source local model (BGE-small)
Google gemini-embedding-001
OpenAI text-embedding-3-large
Voyage voyage-3.5-large
A specialized on-prem option

The difference between the best-performing model and the worst was a mere 7 points. In many cases, this spread is within the margin of error or what you might see between different random seeds. If you are using a reliable provider like those found on n1n.ai, the model choice is rarely the bottleneck.

Next, we tested the 'Data Variable.' We kept the worst-performing model from the first test but changed the input text.

Version 1 (Raw): We fed the embedder raw JSON blobs containing bio, skills arrays, and work history.
Version 2 (Structured Summary): We used a high-reasoning LLM to pre-process the data. This pipeline parses PDF portfolios, performs OCR on project images, and synthesizes a single, coherent natural-language paragraph describing the candidate's core identity and skills.

The result? A 40-point jump in retrieval quality.

Why Data Upstream Matters More

Embedding models are designed to map semantically similar text to nearby points in a vector space. The keyword here is 'text.' When you feed a model a raw JSON dump, you are introducing massive amounts of semantic noise. The model sees structural syntax, redundant keys, and fragmented lists. It produces a vector that represents 'a recruitment JSON object.'

When you feed the model a clean, LLM-generated summary, you are giving it a high-signal narrative. By using an LLM aggregator like n1n.ai to access powerful models for this summarization step, you ensure that the embedding model is working with the best possible representation of the underlying entity.

The Architectural Blueprint for High-Quality Search

If you want to move the needle by 40% instead of 7%, you need to change your architecture. Here is how we built the pipeline at Vaultt.

1. Amortize LLM Costs at Ingestion

One of the biggest mistakes teams make is trying to do all the heavy lifting at query time. We call a powerful LLM (like Claude 3.5 Sonnet or GPT-4o) exactly once per candidate during the ingestion phase. This LLM creates the 'Searchable Summary.' Because this happens at ingestion, the cost is amortized over every search that candidate will ever appear in.

At query time, we use a much cheaper embedding model to convert the user's query into a vector. This keeps latency low (sub-20ms) while maintaining the intelligence of the high-end LLM. Developers can easily test different summarization models using the unified API at n1n.ai to find the best balance of cost and summary quality.

2. The Power of pgvector and Postgres

We run everything on Postgres using the pgvector extension. For a corpus of 10,000 to 1,000,000 vectors, a dedicated vector database is often an unnecessary complexity.

With pgvector and HNSW indexing, we achieve incredible performance. The real advantage, however, is Hybrid Search. In a single SQL query, we can combine structured filters (location, salary, availability) with semantic distance:

SELECT name, bio_summary,
       embedding &lt;=&gt; '[0.12, -0.05, ...]' AS distance
FROM candidates
WHERE location = 'New York'
  AND availability = 'Immediate'
ORDER BY distance ASC
LIMIT 10;

This approach gives you one backup strategy, one permission model, and zero data synchronization headaches between your primary DB and a vector store.

3. Filtering vs. Embedding

Stop trying to make embeddings do everything. If a user wants a 'React Developer in London,' do not rely on the embedding to understand 'London.'

Structured Data: Use B-tree indexes for hard filters (Location, Role Type, Years of Experience).
Unstructured Meaning: Use embeddings for 'vibe' and 'technical depth' (The kind of thinker they are, the complexity of their projects).

Cost Analysis: Don't Overpay for 7 Points

Let's look at the numbers. Google's gemini-embedding-001 costs roughly $0.006 per million tokens. Voyage-3.5-large costs around$ 0.18 per million tokens. That is a 30x price difference.

If switching to the more expensive model only gives you a 7% boost in a benchmark that doesn't even reflect your specific domain, you are wasting capital. That money is better spent on a one-time LLM pass to clean your data. At a scale of 1 million profiles, the difference between these models can be thousands of dollars per month—dollars that yield higher ROI when spent on data quality.

The 4-Step Roadmap to Better Search

If you are ready to stop chasing leaderboards and start building better products, follow this order of operations:

Build a Real Eval Set: Collect 50-100 real queries from your users. Identify the 'Gold Standard' results for each. Without this, you are just 'vibe-checking' your AI.
Optimize the Input: Use an LLM to transform your messy data into clean, narrative summaries. Focus on removing noise and highlighting the most important entities.
Implement Hybrid Search: Move your vectors into your main database (like Postgres) and combine them with metadata filters.
Model Swapping: Only after steps 1-3 are complete should you consider testing a new embedding model.

Conclusion

Search quality is an engineering problem, not just a model selection problem. The 'unsexy' work of data cleaning, input construction, and evaluation rigor is where the 40% improvements live. The embedding leaderboard is a local maximum; the true peak is found upstream.

Get a free API key at n1n.ai.

Source: https://dev.to/yanishimself/stop-benchmarking-embedding-models-90-of-your-search-quality-lives-upstream-2pdf