Implementing Four Generations of Semantic Search: From TF-IDF to Transformers

The transition from matching keywords to understanding intent has been the most significant shift in Information Retrieval (IR) over the last decade. Today, developers building search engines or Retrieval-Augmented Generation (RAG) systems must choose between various architectures. This guide walks through the four generations of search, providing Python implementations for each, and exploring how modern APIs like n1n.ai are revolutionizing the final, most sophisticated stage.

Generation 1: Lexical Search (TF-IDF and BM25)

In the early days, search was purely about lexical overlap. If a user searched for "feline," but the document used the word "cat," the system would find nothing. The mathematical backbone of this era was TF-IDF (Term Frequency-Inverse Document Frequency).

TF-IDF calculates a score based on how often a word appears in a document relative to how common it is across the entire corpus.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = [
    "The cat sits on the mat.",
    "Dogs are a man's best friend.",
    "Feline creatures enjoy resting on rugs."
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
query_vec = vectorizer.transform(["cat on mat"])

similarity = cosine_similarity(query_vec, tfidf_matrix)
print(similarity)

The Limitation: Lexical search suffers from the "vocabulary mismatch" problem. It cannot handle synonyms or polysemy (words with multiple meanings).

Generation 2: Static Embeddings (Word2Vec and GloVe)

Around 2013, the introduction of Word2Vec shifted the focus to dense vectors. Every word was mapped to a point in a multi-dimensional space where words with similar meanings were close to each other.

To search a document, we would average the word vectors in a sentence to create a "Sentence Embedding." While an improvement, these embeddings were context-free. The word "bank" in "river bank" and "bank account" would have the same vector.

Generation 3: Contextual Embeddings (Bi-Encoders)

With the advent of BERT and Transformers, we entered the era of Bi-Encoders. Models like Sentence-BERT (SBERT) allow us to encode entire sentences into a single vector that captures context.

In this architecture, the query and the documents are mapped to the same vector space. We use vector databases to find the nearest neighbors using Cosine Similarity. This is the foundation of most modern RAG systems.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

doc_embeddings = model.encode(corpus)
query_embedding = model.encode("a kitten on a carpet")

# Even without keyword overlap, it finds the 'feline' sentence
hits = util.semantic_search(query_embedding, doc_embeddings, top_k=1)
print(hits)

For production-grade performance, developers often leverage high-speed model inference. Using n1n.ai allows you to access state-of-the-art embedding models with lower latency than self-hosting, ensuring your search remains responsive even with millions of documents.

Generation 4: Cross-Encoders and LLM Reranking

While Bi-Encoders are fast, they lose information by compressing a sentence into a single vector. The fourth generation introduces Cross-Encoders.

In a Cross-Encoder, the query and a candidate document are fed into the Transformer together. The model attends to every word in the query relative to every word in the document. This provides significantly higher accuracy but is computationally expensive.

The Modern Hybrid Pipeline:

Retrieval: Use a Bi-Encoder (or BM25) to find the top 100 candidates.
Reranking: Use a Cross-Encoder or a powerful LLM like Claude 3.5 Sonnet via n1n.ai to rerank those 100 candidates for the absolute best match.

Implementation Guide: Building a Hybrid Searcher

To implement a robust system, you should combine lexical and semantic signals. This is often called Hybrid Search.

Generation	Method	Pros	Cons
Gen 1	BM25	Fast, exact matches	No semantic understanding
Gen 2	Word2Vec	Better than keywords	Context-blind
Gen 3	Bi-Encoders	Fast semantic search	Information loss in compression
Gen 4	Cross-Encoders	Maximum accuracy	Slow for large datasets

For developers scaling these systems, managing multiple API keys for different models (DeepSeek-V3 for reasoning, Claude for reranking) can be a nightmare. This is where n1n.ai excels by providing a unified interface to all leading LLMs, allowing you to swap rerankers with a single line of code change.

Pro Tips for Semantic Search

Normalization: Always normalize your vectors if using Cosine Similarity to ensure consistent scoring.
Quantization: When storing millions of embeddings, use Product Quantization (PQ) to reduce memory usage by up to 90% with minimal accuracy loss.
Domain Adaptation: If you are searching medical or legal documents, fine-tune your Bi-Encoder on domain-specific data.
Leverage LLMs for Query Expansion: Before searching, use an LLM via n1n.ai to generate synonyms for the user's query to improve recall.

Conclusion

Semantic search has evolved from simple character matching to deep neural understanding. By combining the speed of Bi-Encoders with the precision of LLM-based reranking, you can build search experiences that truly understand user intent.

Get a free API key at n1n.ai.

Source: https://towardsdatascience.com/from-tf-idf-to-transformers-implementing-four-generations-of-semantic-search/