Distributed Search
Search is one of the most-touched features in any SaaS product. As the dataset grows, βjust SQL LIKEβ doesnβt cut it. Distributed search engines like Elasticsearch and OpenSearch power the search bars, log analysis, and analytics in much of the modern web.
The Problem That Search Solves
Relational databases excel at structured queries: βfind all users where region = βEUβ.β They struggle at:
The Inverted Index
The data structure that powers search.
Document 1: "the quick brown fox"
Document 2: "the lazy brown dog"
Document 3: "quick foxes are clever"
Inverted index:
brown β [1, 2]
clever β [3]
dog β [2]
fox β [1]
foxes β [3]
lazy β [2]
quick β [1, 3]
...
Search for βquick brownβ β look up βquickβ β [1, 3], βbrownβ β [1, 2]. Intersection: [1]. Document 1 matches.
This data structure makes βfind documents containing word Xβ a constant-time operation, regardless of total document count.
Distributed Search Architecture
For datasets too big for one machine, search engines shard the index across nodes.
[ Search request ]
β
βΌ
[ Coordinator node ]
β
β scatter to all shards
ββββββββΌβββββββ¬ββββββββ
βΌ βΌ βΌ βΌ
Shard1 Shard2 Shard3 Shard4
β β β β
β β β β (each searches its slice
β β β β and returns top-K)
ββββββββ΄βββ¬ββββ΄ββββββββ
β
βΌ gather + re-rank
[ Final top-K to client ]
This scatter-gather pattern means every query touches every shard. Latency = slowest shard. Throughput = limited by per-shard capacity.
Major Search Engines
Scoring and Relevance
Elasticsearchβs default scoring uses BM25 (an evolution of TF-IDF):
- Term frequency β how often the search term appears in the doc.
- Inverse document frequency β rare terms count more than common ones.
- Field length normalization β short matches in a field count more than long ones.
For most uses, BM25 defaults are good. For more sophisticated needs:
The Modern Vector Search Wave
LLMs and embedding models have created a new search paradigm: semantic search based on vector similarity rather than keyword overlap.
Operational Considerations
Running a search engine well is its own discipline.
Indexing Patterns for SaaS
Where does the data come from? Two patterns:
For most SaaS, async indexing is the right answer. Search results that are 5 seconds stale are fine for almost every use case.
Recap
- Search engines solve what RDBMS struggle with: full-text, fuzzy matching, faceting, multi-language.
- The inverted index is the core data structure β maps terms to documents.
- Distributed search uses scatter-gather across shards. Latency = slowest shard.
- Elasticsearch/OpenSearch dominate. Algolia/Meilisearch for hosted simpler. Postgres FTS for modest scales.
- Scoring: BM25 by default. Custom boosts, function scoring, LTR for sophistication. Vector search for semantic.
- Operationally: sharding strategy, reindexing pain, storage tiers, tenant isolation.
- For SaaS: async indexing via CDC or outbox is usually the right pattern.