What’s: Caching

Load balancing allows you to scale horizontally by distributing traffic across an increasing number of servers. However, caching enables more efficient use of your existing resources. Caches work by leveraging the fact that data that has been recently accessed is likely to be requested again. This approach is applied across nearly every layer of computing, including hardware, operating systems, web browsers, and web applications.

A cache functions like short-term memory. It has limited capacity but is faster than the primary data source and holds the most recently accessed data. While caches can be implemented at any architectural level, they are commonly placed close to the front end. This placement allows them to provide rapid responses to requests while reducing the load on downstream systems.

Application Server Cache

Introducing a cache directly into a request layer node allows for the quick retrieval of locally stored response data. When a service receives a request, the node can rapidly return cached data if available. If the desired data isn't cached, the node will fetch it from disk. This cache can be stored both in the node’s memory—offering extremely fast access—and on the node’s local disk, which is still quicker than fetching data from network storage.

But what happens when the system scales to multiple nodes? Even with numerous nodes, each one can maintain its own local cache. However, if a load balancer assigns requests randomly across these nodes, identical requests may hit different nodes, leading to increased cache misses. To address this challenge, two potential solutions are global caches and distributed caches.

Distributed Cache

A distributed cache divides its stored data among multiple nodes, with each node being responsible for a specific portion. A consistent hashing algorithm often manages this division, allowing a request node to quickly locate the relevant piece of data within the distributed cache. Here, each node holds a fraction of the cache, and it can request data from another node if needed before accessing the origin store. A key advantage of this approach is the ability to expand the cache size simply by adding more nodes to the system.

However, distributed caching has its drawbacks. Managing missing nodes can be challenging. Some implementations address this by storing redundant copies of data on different nodes, though this adds complexity, particularly when nodes are added or removed from the system. Still, even if a node fails and some cached data is lost, requests can fall back to the origin store, preventing major disruptions.

Global Cache

A global cache operates as a single, shared caching space accessible to all nodes. Typically, this setup involves a centralized server or a high-speed storage solution that acts as a unified cache. Nodes interact with this cache just as they would with their local caches. While effective in specific use cases—such as systems with specialized hardware or fixed datasets—a global cache has its challenges. As the number of clients and requests grows, the centralized cache can easily become a bottleneck.

There are two primary configurations for global caches, illustrated in the accompanying diagram. In one scenario, the cache itself retrieves missing data from the origin store when it isn't already cached. In the other, the responsibility to fetch uncached data falls to the request nodes. Both methods can be effective depending on the architecture and performance requirements.

Most applications that utilize global caches typically adopt the first approach, where the cache handles eviction and data retrieval. This strategy minimizes the risk of overwhelming the system with simultaneous requests for the same data from multiple clients. However, there are scenarios where the second approach is more advantageous.

For instance, when dealing with very large files, a low cache hit rate can cause the cache buffer to become overloaded with frequent cache misses. In such cases, it's beneficial to ensure that a significant portion of the overall dataset (or the "hot" dataset) is stored in the cache. Another example arises in architectures where the cached files are static and should not be evicted. This may be due to application requirements for low-latency access to specific data. In these scenarios, the application itself may have a better understanding of eviction strategies or hotspots within the dataset, allowing it to manage the cache more effectively than the caching system could.

Content Delivery Network (CDN)

A Content Delivery Network (CDN) is a specialized caching system designed to handle large volumes of static media. In a typical CDN configuration, when a user requests a piece of static content, the CDN first checks if it has the requested file stored locally. If available, it serves the content directly; otherwise, the CDN retrieves it from the back-end servers, caches it locally, and delivers it to the user.

For smaller systems not yet equipped with their own CDN, preparing for future scalability is essential. This can be achieved by serving static content through a dedicated subdomain (e.g., static.yourservice.com) using a lightweight HTTP server like Nginx. This setup allows for a seamless transition to a CDN later by simply updating the DNS to point from your servers to the CDN.

Cache Invalidation

While caching improves performance, it requires mechanisms to maintain consistency between the cache and the source of truth (e.g., a database). If database data is modified but not invalidated in the cache, inconsistencies can lead to erroneous application behavior.

Addressing this issue is known as cache invalidation. Common strategies include:

Write-through Cache:

In this method, data is simultaneously written to both the cache and the database. This ensures consistency between the cache and permanent storage. It also prevents data loss during crashes or system failures. However, the need to write data twice increases latency for write operations.

Write-around Cache:

Here, data is written directly to the database, bypassing the cache. This reduces the likelihood of the cache being overwhelmed with data that is unlikely to be accessed again. However, this approach can result in cache misses for recently written data, leading to slower reads from back-end storage.

Write-back Cache:

Data is written only to the cache initially, and the client receives an immediate confirmation of the write operation. The cache writes the data to permanent storage later, either at intervals or under specific conditions. While this approach reduces latency and enhances throughput for write-intensive applications, it introduces the risk of data loss during crashes, as the most recent changes exist only in the cache.

Cache Eviction Policies

To manage limited cache capacity, eviction policies determine which items are removed when new data needs to be stored. Common policies include:

First In First Out (FIFO): Removes the oldest cached item first, regardless of access frequency.
Last In First Out (LIFO): Evicts the most recently cached item first, without considering usage patterns.
Least Recently Used (LRU): Discards items that have not been accessed for the longest time.
Most Recently Used (MRU): Removes the most recently accessed items first, contrary to LRU.
Least Frequently Used (LFU): Evicts items that have been accessed the least number of times.
Random Replacement (RR): Randomly selects an item for removal when space is needed.

Each policy has advantages and is suited to different use cases depending on workload characteristics and system requirements.

‍

What’s: Caching

Application Server Cache

Distributed Cache

Global Cache

Content Delivery Network (CDN)

Cache Invalidation

Cache Eviction Policies

Other System Design Resource Pages:

System Design Interview Resource

Consistency, Availability, Partition Tolerance

What’s: Consistent Hashing

Consistent Hashing Implementation in GO

What’s: Load Balancer

What’s: Proxy

What’s: Indexes

What’s: Caching

What’s: Partitioning

What’s: Distributed Queue

Client Server Communications

What’s: Redundancy and Replication

What’s: SQL and NoSQL

Real-time Communication

How to Approach System Design Interviews

Building a URL Shortening Service

Building a Text Sharing Platform

Building Pinterest

Building a Newsfeed

Building Video Sharing Platform

Building Chat App

Building Search System