Building Search System

Requirements

Assume a social platform has 1.8 billion total users, with 900 million engaging daily.
On average, the platform processes 450 million posts each day.
Each post has an average size of 350 bytes.
Daily search volume reaches approximately 600 million queries.

Our objective is to design a system capable of efficiently storing and querying user posts.

Capacity Estimation and Constraints

Storage Capacity: With 450 million new posts daily, and each post averaging 350 bytes, the daily storage requirement is:

450M * 350 = 157.5GB/day

Storage per second:

157.5GB / 86400 seconds ≈ 1.8MB/second

API

A potential API definition for search might look like this:

search(api_dev_key, search_terms, max_results, page_cursor)

Parameters:

api_dev_key (string): API developer key for authentication and quota management.
search_terms (string): The query string containing search keywords.
max_results (number): Maximum number of posts to return.
page_cursor (string): Cursor token to retrieve specific pages of the result set.

Returns: (JSON)

A JSON object with details about posts matching the query. Each entry can include user ID, username, post content, post ID, timestamp, like count, and more.

Architecture

At a high level, the system will store all posts in a database and maintain an index, which will be used for search queries.

Data Ingestion and Preprocessing
- Collect and preprocess text posts (e.g., filtering, tokenize).
- Store metadata (e.g., post ID, timestamp, user ID).
Embedding Model
- Train a model to generate embeddings (dense vectors) that represent the semantic meaning of text posts.
Indexing System
- Use Approximate Nearest Neighbor (ANN) techniques to build scalable indexes for embeddings.
Aggregator Service
- A query pipeline that transforms user search inputs into embeddings, performs nearest neighbor lookup, ranks, and returns results.
Storage
- A distributed database for storing raw text posts, metadata, and embeddings.
Load Balancer and Cache
- Handle high traffic and reduce latency by caching frequent queries and balancing load across servers.

Detailed Component Design

Storage: Storing 157.5GB of new data daily requires an efficient partitioning scheme to distribute data across multiple servers. Planning for five years:

157.5GB * 365 days * 5 years = 287.6TB

To keep storage under 80% utilization, we would need 360TB. Adding redundancy for fault tolerance (e.g., a second copy), total storage grows to 720TB. Assuming modern servers hold up to 5TB each, we would require 144 servers to meet this capacity.

Partitioning could be based on PostID. Using a hash function, the system could map PostID to a specific storage server.

Indexing:

The system allows users to search text posts with high accuracy and speed. A machine learning approach is used for creating embeddings of the text, which represent the semantics of the content. The embeddings are indexed using Approximate Nearest Neighbors (ANN) to allow for efficient similarity searches.

Workflow

Data Ingestion
- New text posts are ingested via an API.
- Text is preprocessed (e.g., normalization, stop-word removal, tokenization).
- Pass the processed text through the embedding model to generate a vector representation.
- Store embeddings and metadata in a database.
Training the Embedding Model
- Use a pre-trained transformer model fine-tuned on a dataset of text posts.
- Training involves minimizing a loss function like contrastive loss or triplet loss to ensure semantically similar texts have closer embeddings.
Building the Index
- Use an ANN library (e.g., FAISS, Annoy, ScaNN) to build an index for the embeddings.
- Indexes are sharded to distribute across nodes and support scaling.
- Periodically rebuild the index to accommodate new posts.
Query Handling
- Convert user queries into embeddings using the same embedding model.
- Perform a nearest neighbor search in the ANN index to find similar embeddings.
- Retrieve corresponding text posts and metadata from the database.
- Rank results.
Scaling and Performance Optimization
- Sharding: Distribute data and indexes across multiple nodes based on hash-based sharding.
- Replication: Use replicas for high availability and fault tolerance.
- Caching: Cache frequent query results using Redis or Memcached.
- Batching: Batch multiple queries to reduce index lookup overhead.
- Horizontal Scaling: Add more nodes to handle increased traffic.
Serving Read Requests
- A distributed search service orchestrates ANN lookups across shards.
- Results are merged and ranked before returning to the user.

For 600 million daily queries, if we assume an ANN (Approximate Nearest Neighbor) index for efficient searches, we must consider index size and distribution:

Index Entries: Each post corresponds to one index entry. Over five years:
450M posts/day * 365 days/year * 5 years = 820 billion entries
Index Size: Assuming each entry in the index requires 200 bytes:
820 billion entries * 200 bytes/entry = 164TB for the index
Index Partitioning: Distributing the index using hash-based sharding with 256GB of memory:
164TB / 256GB ≈ 656 servers
Query Load: With 600 million daily queries distributed across servers:
600M queries/day / 656 servers ≈ 900k queries/server/day
900k queries/day / 86400 seconds ≈ 10 queries/second per server
Replication: If each index shard is replicated for fault tolerance, storage doubles to 512GB per server.

Hash-based partitioning ensures uniform distribution of both storage and query load across servers.

Fault Tolerance

What happens when an index server fails? A secondary replica of each server can take over in the event of a primary server failure. Both primary and secondary servers store identical copies of the index.

What if both primary and secondary servers fail simultaneously? In this case, a new server must be allocated, and the index rebuilt. To rebuild efficiently, a reverse index mapping PostIDs to servers can be maintained. An Index-Builder server holds this mapping in a Hashtable, where the ‘key’ is the server number and the ‘value’ is a HashSet of PostIDs stored on that server. This allows quick retrieval of all PostIDs needed to reconstruct the index. Having a replica of the Index-Builder server ensures additional fault tolerance.

Cache

To handle frequently accessed status objects, a caching layer can be introduced. Using a system like Memcached, hot post objects can be stored in memory. Application servers query the cache before accessing the database. The number of cache servers can be adjusted based on usage patterns. For cache eviction, the Least Recently Used (LRU) policy is suitable for this system.

Load Balancing

Load balancing can be applied at two levels:

Between Clients and Application Servers: A load balancer distributes incoming requests evenly across application servers. A Round Robin strategy can be used initially, removing failed servers from the rotation automatically.
Between Application Servers and Backend Servers: A more advanced load balancer can monitor server loads and direct traffic accordingly. This ensures no single server becomes a bottleneck.

‍

Building Search System

Requirements

Capacity Estimation and Constraints

API

Architecture

Detailed Component Design

Fault Tolerance

Cache

Load Balancing

Other System Design Resource Pages:

System Design Interview Resource

Consistency, Availability, Partition Tolerance

What’s: Consistent Hashing

Consistent Hashing Implementation in GO

What’s: Load Balancer

What’s: Proxy

What’s: Indexes

What’s: Caching

What’s: Partitioning

What’s: Distributed Queue

Client Server Communications

What’s: Redundancy and Replication

What’s: SQL and NoSQL

Real-time Communication

How to Approach System Design Interviews

Building a URL Shortening Service

Building a Text Sharing Platform

Building Pinterest

Building a Newsfeed

Building Video Sharing Platform

Building Chat App

Building Search System