Distributed Search

8 min read Β· Updated 2026-04-25

Search is one of the most-touched features in any SaaS product. As the dataset grows, β€œjust SQL LIKE” doesn’t cut it. Distributed search engines like Elasticsearch and OpenSearch power the search bars, log analysis, and analytics in much of the modern web.

The Problem That Search Solves

Relational databases excel at structured queries: β€œfind all users where region = β€˜EU’.” They struggle at:

Full-text search
Match relevance against natural language. "Find docs about distributed transactions" β€” DB knows nothing about "about" or relevance scoring.
Fuzzy matching
Typos, synonyms, plurals, stemming. "Find Stripe" should also match "stripes" and "Stripe.com."
Faceted search
Aggregate filter counts. "127 results in Engineering, 89 in Sales, 23 in Support."
Multi-language
Stop words, tokenization, stemming work differently per language. DBs don't.

The Inverted Index

The data structure that powers search.

Document 1: "the quick brown fox"
Document 2: "the lazy brown dog"
Document 3: "quick foxes are clever"

Inverted index:
  brown β†’ [1, 2]
  clever β†’ [3]
  dog β†’ [2]
  fox β†’ [1]
  foxes β†’ [3]
  lazy β†’ [2]
  quick β†’ [1, 3]
  ...

Search for β€œquick brown” β†’ look up β€œquick” β†’ [1, 3], β€œbrown” β†’ [1, 2]. Intersection: [1]. Document 1 matches.

This data structure makes β€œfind documents containing word X” a constant-time operation, regardless of total document count.

Distributed Search Architecture

For datasets too big for one machine, search engines shard the index across nodes.

   [ Search request ]
          β”‚
          β–Ό
   [ Coordinator node ]
          β”‚
          β”‚ scatter to all shards
   β”Œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
   β–Ό      β–Ό      β–Ό       β–Ό
 Shard1 Shard2 Shard3 Shard4
   β”‚      β”‚      β”‚       β”‚
   β”‚      β”‚      β”‚       β”‚ (each searches its slice
   β”‚      β”‚      β”‚       β”‚  and returns top-K)
   β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”¬β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό gather + re-rank
       [ Final top-K to client ]

This scatter-gather pattern means every query touches every shard. Latency = slowest shard. Throughput = limited by per-shard capacity.

Major Search Engines

Elasticsearch / OpenSearch
The dominant choice
Distributed search built on Apache Lucene. Scales horizontally, rich query DSL, aggregations, geo-search. The default for log analysis (ELK stack) and search-as-a-feature.
Algolia / Meilisearch
Search-as-a-service
Hosted, opinionated, fast defaults. Less flexible than ES but dramatically simpler operations. Good fit for product search.
Solr
The other Lucene-based option
More mature than ES in some areas (legal, library, government use). Less popular for new projects but well-supported.
Postgres + pg_trgm + tsvector
Built-in search for modest needs
Full-text search inside Postgres. Saves running a separate system. Good enough for simple use cases under ~1M docs.

Scoring and Relevance

Elasticsearch’s default scoring uses BM25 (an evolution of TF-IDF):

For most uses, BM25 defaults are good. For more sophisticated needs:

Custom field boosts
Title matches count more than body matches: title^3, body^1.
Function scoring
Combine BM25 with custom signals β€” recency, popularity, user-specific scores.
Learning to rank
ML models that re-rank top-K results based on click logs and user signals.
Vector / semantic search
Use embeddings (from LLMs or specialized models) to find semantically similar content. Combined with BM25 for hybrid search.

The Modern Vector Search Wave

LLMs and embedding models have created a new search paradigm: semantic search based on vector similarity rather than keyword overlap.

Vector databases
Pinecone, Weaviate, Qdrant, Milvus. Store and search high-dimensional vectors. Approximate nearest neighbor (ANN) algorithms.
Postgres pgvector
Extension that adds vector type and ANN indexes. Lets your existing Postgres handle embeddings.
Elasticsearch dense vectors
ES added vector search support. Hybrid (BM25 + vector) is the modern best-of-both pattern.
RAG (Retrieval-Augmented Generation)
Search the vector index for relevant chunks, feed them to an LLM as context. The new SaaS-AI standard pattern.

Operational Considerations

Running a search engine well is its own discipline.

Sharding strategy
Number of shards per index, replicas per shard. Affects throughput, latency, recovery time. Hard to change post-creation.
Reindexing
Schema changes often require full reindex. Plan for it β€” design index aliases for cutover.
Storage tiering
Recent data on fast SSD; older data on cheaper storage. ES has hot/warm/cold tiers built in.
Security
Multi-tenant SaaS needs per-tenant index isolation. Tenant ID in every query as a filter, or separate indexes per tenant.

Indexing Patterns for SaaS

Where does the data come from? Two patterns:

Synchronous indexing
Write to DB and search engine together
On every DB write, also update the search index. Simple, but adds latency to writes; failure modes complicated (one succeeds, the other fails).
Asynchronous indexing
Stream changes to search engine
CDC (Debezium, Postgres logical replication) or outbox pattern. DB writes commit immediately; search index updates a few seconds later. Eventually consistent but cleaner.

For most SaaS, async indexing is the right answer. Search results that are 5 seconds stale are fine for almost every use case.

Recap