Back
Glossary

Hybrid Search

VeloDB Engineering Team· 2025/09/05
Keywords:

What is Hybrid Search?

Hybrid search is a powerful search approach that combines multiple search methodologies, primarily keyword-based (lexical) search and vector-based (semantic) search, to deliver more comprehensive and accurate search results. By leveraging the strengths of both exact term matching and semantic understanding, hybrid search provides users with relevant results that capture both literal matches and contextual meaning, significantly improving search precision and user satisfaction.

Limitations of Single Search Methods

Traditional keyword search excels at finding exact matches but struggles with synonyms, context, and semantic relationships. Vector search understands meaning but may miss exact term requirements or specific terminology that users expect.

Complementary Strengths

Keyword search provides precise control and explainable results, while vector search offers semantic understanding and handles intent-based queries. Combining both approaches leverages their respective strengths while mitigating individual weaknesses.

User Expectation Diversity

Different queries require different search approaches. Some users search for specific technical terms requiring exact matches, while others use natural language queries that benefit from semantic understanding. Hybrid search accommodates both patterns.

Improved Result Quality

Research demonstrates that combining semantic and keyword search methodologies often outperforms either approach individually, providing more accurate and comprehensive results with better user satisfaction.

Hybrid Search Architecture

Multi-Query Processing

Hybrid search executes multiple search queries simultaneously, typically including BM25-based keyword search and k-nearest neighbor (kNN) vector search, processing user queries through parallel search pipelines.

Result Fusion Mechanism

Advanced algorithms like Reciprocal Rank Fusion (RRF) combine results from different search methods, normalizing scores and merging ranked lists to produce unified, optimized result sets.

Reciprocal Rank Fusion: RRF combines multiple result sets with different relevance indicators into a single ranked list without requiring manual weight tuning or score normalization.

Adaptive Weighting

Dynamic scoring mechanisms balance keyword and semantic results based on query characteristics, user context, and historical performance data to optimize relevance.

Unified Index Architecture

Modern implementations maintain both inverted indices for keyword search and vector indices for semantic search within unified data structures, enabling efficient hybrid operations.

Best-of-Both-Worlds Accuracy

Combines exact keyword matching with semantic understanding, ensuring users find both precise terminology matches and conceptually related content in unified result sets.

Automatic Result Fusion

Advanced algorithms like RRF automatically combine and rank results from different search methods without requiring manual configuration or complex scoring adjustments.

Query-Adaptive Behavior

Intelligent systems can adjust the balance between keyword and semantic search based on query characteristics, user behavior patterns, and content domains.

Scalable Performance

Optimized implementations leverage parallel processing and efficient indexing structures to maintain fast response times while processing multiple search algorithms simultaneously.

Enhanced User Experience

Users benefit from more relevant results without needing to understand or choose between different search methods, creating intuitive search experiences that handle diverse query types.

Combine exact product name matching with semantic similarity to help customers find both specific items and similar products, handling queries like "iPhone 15 Pro" and "fast smartphone with good camera."

Enterprise Knowledge Management

Search internal documentation, wikis, and knowledge bases using both specific technical terms and natural language questions, accommodating different user expertise levels and search preferences.

Academic and Research Databases

Enable researchers to find papers using exact citation searches, technical terminology, and conceptual queries, providing comprehensive coverage of relevant literature across different search approaches.

Customer Support Systems

Help support agents and customers find solutions using both specific error codes, product names, and natural language descriptions of problems or symptoms.

Content Discovery Platforms

Power recommendation and search systems that understand both explicit user preferences and semantic content relationships, improving content discoverability across diverse media types.

Implementation Examples

Apache Doris Hybrid Search with RRF

Minimal schema (keyword + vectors)

CREATE DATABASE IF NOT EXISTS hybrid;
USE hybrid;

CREATE TABLE IF NOT EXISTS articles (
  id BIGINT,
  title STRING,
  content TEXT,
  title_embedding ARRAY<FLOAT>,     -- e.g., 384-dim
  content_embedding ARRAY<FLOAT>,   -- e.g., 384-dim
  updated_at DATETIME,
  INDEX idx_title(title) USING INVERTED PROPERTIES("parser"="english","support_phrase"="true"),
  INDEX idx_content(content) USING INVERTED PROPERTIES("parser"="english","support_phrase"="true")
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_allocation"="tag.location.default: 1");

Basic Hybrid (Keyword + Vector) with RRF (SQL via Python)

# pip install pymysql sentence-transformers
import pymysql
from sentence_transformers import SentenceTransformer

enc = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def arr(v): return "[" + ",".join(f"{x:.6f}" for x in v) + "]"

conn = pymysql.connect(host="127.0.0.1", port=9030, user="root",
                       password="", database="hybrid", autocommit=True)
cur = conn.cursor()

def index_documents(docs):
    """
    docs = [
      {"id":1,"title":"Introduction to Machine Learning",
       "content":"ML is a subset of AI ..."},
      ...
    ]
    """
    for d in docs:
        t_emb = arr(enc.encode(d["title"]).tolist())
        c_emb = arr(enc.encode(d["content"]).tolist())
        cur.execute(f"""
          INSERT INTO articles
            (id, title, content, title_embedding, content_embedding, updated_at)
          VALUES ({d["id"]}, %s, %s, {t_emb}, {c_emb}, NOW())
        """, (d["title"], d["content"]))

def hybrid_rrf(query, top_k=10, kw_window=200, vec_window=200, rrf_k=60):
    # simple keywords from query
    keywords = " ".join([w for w in query.split() if len(w) > 2])
    qvec = arr(enc.encode(query).tolist())

    sql = f"""
    WITH kw AS (
      SELECT id, title, content,
             ROW_NUMBER() OVER (ORDER BY updated_at DESC, id) AS kw_rank
      FROM articles
      WHERE (title MATCH_ANY %s OR content MATCH_ANY %s)
      LIMIT {kw_window}
    ),
    vec AS (
      SELECT id, title, content,
             cosine_distance(title_embedding, {qvec}) AS dist,
             ROW_NUMBER() OVER (ORDER BY cosine_distance(title_embedding, {qvec}) ASC) AS vec_rank
      FROM articles
      LIMIT {vec_window}
    ),
    unioned AS (
      SELECT id, title, content, kw_rank, NULL AS vec_rank FROM kw
      UNION ALL
      SELECT id, title, content, NULL, vec_rank FROM vec
    ),
    agg AS (
      SELECT id,
             any_value(title)  AS title,
             any_value(content) AS content,
             MIN(kw_rank)  AS kw_rank,
             MIN(vec_rank) AS vec_rank
      FROM unioned
      GROUP BY id
    )
    SELECT id, title, content,
           (CASE WHEN kw_rank  IS NOT NULL THEN 1.0/({rrf_k}+kw_rank)  ELSE 0 END) +
           (CASE WHEN vec_rank IS NOT NULL THEN 1.0/({rrf_k}+vec_rank) ELSE 0 END) AS rrf_score
    FROM agg
    ORDER BY rrf_score DESC
    LIMIT {top_k};
    """
    cur.execute(sql, (keywords, keywords))
    return cur.fetchall()

# Example
index_documents([
  {"id":"1","title":"Introduction to Machine Learning",
   "content":"Machine learning is a subset of artificial intelligence ..."},
  {"id":"2","title":"Deep Learning Neural Networks",
   "content":"Deep learning uses multi-layer neural networks ..."}
])

rows = hybrid_rrf("artificial intelligence algorithms", top_k=5)
for r in rows:
    print(f"ID={r[0]} | Title={r[1]} | RRF={r[3]:.4f}")

Advanced Hybrid with Multiple Retrievers (Title + Content + Keyword) + RRF

def hybrid_multi_rrf(query, top_k=10, kw_window=200, vec_window=200, rrf_k=60):
    keywords = " ".join([w for w in query.split() if len(w) > 2])
    qvec = arr(enc.encode(query).tolist())

    sql = f"""
    WITH kw AS (
      SELECT id, title, content,
             ROW_NUMBER() OVER (ORDER BY updated_at DESC, id) AS kw_rank
      FROM articles
      WHERE (title MATCH_ANY %s OR content MATCH_ANY %s)
      LIMIT {kw_window}
    ),
    v_title AS (
      SELECT id, title, content,
             ROW_NUMBER() OVER (ORDER BY cosine_distance(title_embedding, {qvec}) ASC) AS vt_rank
      FROM articles
      LIMIT {vec_window}
    ),
    v_content AS (
      SELECT id, title, content,
             ROW_NUMBER() OVER (ORDER BY cosine_distance(content_embedding, {qvec}) ASC) AS vc_rank
      FROM articles
      LIMIT {vec_window}
    ),
    unioned AS (
      SELECT id, title, content, kw_rank, NULL AS vt_rank, NULL AS vc_rank FROM kw
      UNION ALL
      SELECT id, title, content, NULL, vt_rank, NULL FROM v_title
      UNION ALL
      SELECT id, title, content, NULL, NULL, vc_rank FROM v_content
    ),
    agg AS (
      SELECT id,
             any_value(title) AS title,
             any_value(content) AS content,
             MIN(kw_rank) AS kw_rank,
             MIN(vt_rank) AS vt_rank,
             MIN(vc_rank) AS vc_rank
      FROM unioned
      GROUP BY id
    )
    SELECT id, title, content,
           (CASE WHEN kw_rank IS NOT NULL THEN 1.0/({rrf_k}+kw_rank) ELSE 0 END) +
           (CASE WHEN vt_rank IS NOT NULL THEN 1.0/({rrf_k}+vt_rank) ELSE 0 END) +
           (CASE WHEN vc_rank IS NOT NULL THEN 1.0/({rrf_k}+vc_rank) ELSE 0 END) AS rrf_score
    FROM agg
    ORDER BY rrf_score DESC
    LIMIT {top_k};
    """
    cur.execute(sql, (keywords, keywords))
    return cur.fetchall()

rows = hybrid_multi_rrf("database performance optimization", top_k=5)
for r in rows:
    print(f"ID={r[0]} | Title={r[1]} | RRF={r[3]:.4f}")

Custom Hybrid (Weighted fusion or RRF in Python using Doris retrievers)

from typing import List, Dict
import math

def keyword_ids(query: str, limit=200) -> Dict[int, int]:
    # returns {id: rank} by recency among keyword matches
    keywords = " ".join([w for w in query.split() if len(w) > 2])
    cur.execute("""
      SELECT id,
             ROW_NUMBER() OVER (ORDER BY updated_at DESC, id) AS rnk
      FROM articles
      WHERE title MATCH_ANY %s OR content MATCH_ANY %s
      LIMIT %s
    """, (keywords, keywords, limit))
    return {row[0]: row[1] for row in cur.fetchall()}

def vector_ids(query: str, field: str, limit=200) -> Dict[int, int]:
    q = arr(enc.encode(query).tolist())
    cur.execute(f"""
      SELECT id,
             ROW_NUMBER() OVER (ORDER BY cosine_distance({field}, {q}) ASC) AS rnk
      FROM articles
      LIMIT %s
    """, (limit,))
    return {row[0]: row[1] for row in cur.fetchall()}

def fuse_rrf(query: str, top_k=10, rrf_k=60) -> List[int]:
    kw = keyword_ids(query)
    vt = vector_ids(query, "title_embedding")
    vc = vector_ids(query, "content_embedding")

    scores = {}
    def add_rrf(bucket, kconst):
        for _id, rnk in bucket.items():
            scores[_id] = scores.get(_id, 0.0) + 1.0/(kconst + rnk)

    add_rrf(kw, rrf_k); add_rrf(vt, rrf_k); add_rrf(vc, rrf_k)
    return [doc_id for doc_id, _ in sorted(scores.items(), key=lambda x:x[1], reverse=True)[:top_k]]

print(fuse_rrf("Python machine learning tutorial", top_k=5))

Key Takeaways

Hybrid search represents the evolution of information retrieval by combining the precision of keyword search with the intelligence of semantic understanding. This approach addresses the limitations of individual search methods while leveraging their complementary strengths to deliver more accurate and comprehensive results. Modern implementations using algorithms like Reciprocal Rank Fusion make hybrid search accessible without complex configuration requirements. For organizations seeking to improve search experiences, hybrid search offers a practical path to better user satisfaction by accommodating diverse query patterns and user expectations. As search technologies continue to advance, hybrid approaches will likely become the standard for applications requiring both precision and semantic understanding.

Frequently Asked Questions

Q: How does hybrid search determine the balance between keyword and vector results?

A: Hybrid search uses algorithms like Reciprocal Rank Fusion (RRF) or weighted scoring to automatically balance results. RRF works without manual tuning, while weighted approaches allow custom adjustment of keyword vs. semantic importance based on specific use cases.

Q: Is hybrid search slower than individual search methods?

A: Hybrid search requires processing multiple search algorithms but modern implementations use parallel processing and optimized indexing to maintain fast response times. The slight increase in processing time is typically offset by significantly improved result quality.

Q: What types of queries benefit most from hybrid search?

A: Hybrid search excels with natural language queries, product searches combining specific and descriptive terms, technical documentation searches, and any scenario where users might use both exact terminology and conceptual descriptions.

Q: How do I implement hybrid search without Elasticsearch?

A: You can build custom hybrid search by combining existing keyword search engines (like Solr or custom BM25 implementations) with vector databases (like Pinecone, Weaviate) and implementing fusion algorithms like RRF or weighted combination.

Q: Can hybrid search work with domain-specific terminology?

A: Yes, hybrid search handles domain-specific terms through both keyword matching for exact terminology and vector search for related concepts. This makes it particularly effective for technical, medical, legal, and other specialized domains.