E-commerce Search Infrastructure
Rebuilt search infrastructure handling 10M+ queries/day, improving relevance scores by 45% and reducing p99 latency to under 100ms
Overview
Led the redesign of search infrastructure for an e-commerce platform, replacing a basic database-backed search with a modern, ML-enhanced search system
Problem
The existing search was a simple LIKE query against PostgreSQL. It couldn't handle typos, synonyms, or relevance ranking. Search conversion rates were poor, and the system couldn't scale beyond 1000 QPS without significant latency degradation.
Constraints
- Must index 2M+ products with real-time updates
- p99 latency must be under 200ms
- Budget constraints ruled out managed search services
- Team had no prior Elasticsearch experience
Approach
Implemented Elasticsearch as the search backend with a custom relevance tuning pipeline. Built a real-time indexing system using CDC (Change Data Capture) to keep search index synchronized with the product database. Added query understanding layer for typo correction and synonym expansion.
Key Decisions
Use Elasticsearch over Algolia
Algolia's pricing at our scale was prohibitive ($50k+/year). Elasticsearch gave us more control over relevance tuning and the ability to run complex aggregations for faceted search.
- Algolia managed search
- Apache Solr
- Meilisearch
Implement CDC-based indexing instead of dual writes
Dual writes are error-prone and can lead to inconsistencies. CDC from PostgreSQL WAL ensures the search index is eventually consistent with the source of truth without application code changes.
- Application-level dual writes
- Periodic batch reindexing
Build custom query understanding layer
Off-the-shelf solutions didn't handle our domain-specific vocabulary well. Custom layer allowed us to incorporate product taxonomy and user behavior signals.
Tech Stack
- Elasticsearch
- Python
- Debezium
- Kafka
- PostgreSQL
- Redis
- Kubernetes
Result & Impact
- 45% improvement in click-through rateSearch Conversion
- Under 100ms (down from 800ms)p99 Latency
- 10,000+ QPS (up from 1,000)Query Capacity
- Reduced by 60%Zero-Result Searches
Search went from being a pain point to a competitive advantage. The merchandising team can now tune relevance without engineering involvement. The faceted search and autocomplete features have significantly improved the shopping experience.
Learnings
- Relevance tuning is an ongoing process, not a one-time setup—build tools for non-engineers to iterate
- CDC is powerful but adds operational complexity—invest in monitoring and alerting
- Search is a product, not just a feature—dedicate resources to continuous improvement
- Elasticsearch cluster management requires dedicated expertise
Relevance Tuning Journey
The initial Elasticsearch deployment actually performed worse than the PostgreSQL search for some queries. We spent significant time tuning BM25 parameters, field boosting, and function scores.
The breakthrough came when we incorporated click-through data into relevance scoring. Products that users actually clicked on after searching got boosted, creating a feedback loop that continuously improved results.
Operational Lessons
Running Elasticsearch at scale taught us a lot about JVM tuning, shard management, and cluster topology. We had several incidents early on due to GC pauses and unbalanced shards. Building comprehensive monitoring and runbooks was essential.