Achieving Highly-Available Redis with AWS ElastiCache and HAProxy
Introduction
Redis, the open source in-memory data store, has revolutionized application performance by providing lightning-fast read and write access to frequently used data. By keeping data in memory and using a simple key-value model, Redis offers sub-millisecond response times, enabling an entirely new class of real-time applications.
At SendBird, we leverage Redis extensively for caching, queuing, and storing ephemeral data like user presence status. The performance gains are tremendous – reading a value from Redis is up to 200x faster than querying data from our MySQL database. Redis has quickly become a mission-critical component of our infrastructure.
To operate Redis reliably at scale, we leverage Amazon ElastiCache, a fully-managed Redis service. ElastiCache makes it simple to deploy, scale, and secure Redis workloads while offloading the undifferentiated heavy lifting of managing infrastructure.
However, we quickly realized that ElastiCache‘s built-in high availability features did not quite meet our needs. In this post, I‘ll share how we used the HAProxy load balancer to overcome these limitations and achieve a highly-available, self-healing Redis architecture.
The Challenges of ElastiCache High Availability
ElastiCache provides a turnkey way to deploy a highly-available Redis environment. When you provision an ElastiCache cluster, it automatically configures a single primary node for writes and one or more read replicas. Redis‘ asynchronous primary/replica replication keeps the replicas in sync with the primary node.
To scale read capacity, ElastiCache allows you to add additional replicas to the cluster and distribute read requests across them. As the load on your cluster grows, you can add more replicas to increase throughput.
ElastiCache exposes an endpoint for each node in the cluster. There is a single "primary" endpoint that always points to the current primary node. Each replica has its own individual endpoint for read requests.
The challenge is that ElastiCache does not provide any built-in load balancing across the replica endpoints. If a replica node becomes unhealthy or gets replaced as part of a scaling event, its endpoint will start failing requests.
Here‘s an illustration of this architecture:
As you can see, the application is responsible for distributing load across the replica endpoints. A common approach is to implement a client-side load balancing algorithm like round-robin or least-connections to select a replica for each request.
Here‘s some sample Python code using the popular redis-py client that shows one way to implement round-robin load balancing:
import redis
replicas = [
{"host": "replica1.xyz.com", "port": 6379},
{"host": "replica2.xyz.com", "port": 6379},
{"host": "replica3.xyz.com", "port": 6379}
]
replica_index = 0
def get_replica():
global replica_index
replica = replicas[replica_index]
replica_index = (replica_index + 1) % len(replicas)
return replica
while True:
replica = get_replica()
r = redis.Redis(host=replica["host"], port=replica["port"])
try:
value = r.get("mykey")
print(value)
except redis.ConnectionError:
print(f"Failed to connect to {replica}")
This logic will rotate through the list of replicas, sending each request to the next replica in the list. If a connection fails, it will simply print an error and move on to the next replica.
While this approach works, it has several significant drawbacks:
-
The application must implement load balancing logic, adding complexity and code that is not core to the business domain.
-
If a replica starts failing, the application will not know about it until a request is sent to that replica. This means the failure is only detected after it has already impacted the application.
-
When a failed replica recovers, the application will not know to start sending traffic to it again until it is selected in the round-robin rotation.
-
Each client (e.g. web server, microservice) must implement this logic independently, leading to duplication of effort.
-
The necessary retry logic further adds to the application complexity.
At SendBird, we initially implemented client-side load balancing in our application code. However, as our Redis usage grew, it became increasingly painful to manage this logic across all of our services. We needed a centralized, application-agnostic way to handle Redis load balancing and health checking.
Exploring Alternative Architectures
We considered several alternative approaches before landing on our ultimate solution.
Smart Redis Clients
One option was to use or build a "smart" Redis client library with built-in load balancing and health checking functionality. The client would track the error rate of each replica and avoid sending requests to replicas with high error rates.
While this approach would eliminate the need to implement load balancing logic in each application, it still has several drawbacks. The client would need complex logic to track errors, implement circuit breaking and handle retries. This would be a significant undertaking, likely requiring a fork of the Redis client library.
Even with a smart client, the failure detection would be delayed until the first failed request. Each client would still need to track replica health independently, reducing the efficiency of health checks.
Sidecar Proxies
Another approach was to deploy sidecar proxies like Envoy alongside each application instance. The proxies would be configured to load balance across the Redis replicas and would implement health checking and circuit breaking.
This service mesh approach is very powerful and would give us a lot of flexibility to control routing and observe traffic. However, it would require deploying and managing a fleet of sidecar proxies, adding operational complexity. It would also require modifying our application deployment processes to include the sidecar.
While a service mesh like Envoy is a great solution for certain use cases, it felt heavy-handed for the limited Redis routing capabilities we needed.
Central Redis Reverse Proxy
After some research, we landed on a third approach: run a centralized Redis reverse proxy that would sit between the application and ElastiCache. The proxy would expose a single endpoint to the application. Internally, it would load balance requests across the Redis replicas and monitor their health using active checks.
With this approach, the application remains simple. It doesn‘t need to know about individual Redis replicas at all – it just sends all read requests to the proxy endpoint. The proxy becomes the single source of truth for Redis routing and health.
We evaluated several different reverse proxy options including Twemproxy, Envoy and HA Proxy. After some extensive testing, we decided on HAProxy due to its extensive load balancing and health check capabilities, and its native support for the Redis protocol.
Our Ultimate Solution: HAProxy
HAProxy is a high performance, open source load balancer. It‘s used extensively by many of the largest websites in the world to distribute traffic across backend servers.
For load balancing Redis, HAProxy has several key advantages:
-
Protocol awareness: HAProxy natively understands the Redis protocol, enabling more sophisticated load balancing and health checking than a simple TCP proxy.
-
Flexible health checking: HAProxy supports both passive health checks (based on observed errors) and active health checks that probe the Redis servers on a regular interval.
-
Powerful load balancing: HAProxy supports a variety of load balancing algorithms including round-robin, least connections, and consistent hashing.
-
Observability: HAProxy provides detailed stats on the health and performance of each backend server, enabling us to closely monitor our Redis infrastructure.
Here‘s a simplified version of our HAProxy configuration for Redis load balancing:
global
daemon
maxconn 4096
defaults
mode tcp
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
listen redis_read
bind *:6379
mode tcp
balance roundrobin
option tcp-check
tcp-check send PING\r\n
tcp-check expect string +PONG
tcp-check send info\ replication\r\n
tcp-check expect string role:slave
tcp-check send QUIT\r\n
tcp-check expect string +OK
server replica1 replica1.xyz.com:6379 check inter 1s
server replica2 replica2.xyz.com:6379 check inter 1s
server replica3 replica3.xyz.com:6379 check inter 1s
This configuration does the following:
- Listens on port 6379, the standard Redis port
- Uses round-robin load balancing to distribute requests across the replicas
- Configures active health checks using the Redis protocol
- Send a PING command and expect a PONG response
- Send an INFO REPLICATION command and verify the node is a replica
- Send a QUIT command and expect an OK response
- Performs the health check on each replica every second
- Specifies the hostname and port for each replica
With this configuration in place, we simply updated our application configuration to point to the HAProxy endpoint instead of the individual ElastiCache endpoints:
import redis
r = redis.Redis(host="haproxy.xyz.com", port=6379)
value = r.get("key")
Now the application is completely insulated from failures or scaling of the Redis replicas. If a replica fails, HAProxy will detect the failure within a second and stop sending it traffic. When the replica recovers, it will be automatically added back into the load balancing rotation.
We can inspect the state of each replica using the HAProxy stats page:
As you can see, HAProxy reports the health status, current connections, and various performance metrics for each replica. If a replica is down, it will be marked as DOWN and will not receive traffic.
The Results
Deploying HAProxy had an immediate positive impact on the stability and performance of our Redis infrastructure.
We track Redis errors as a key metric – before HAProxy, we would routinely see large error spikes whenever there was an issue with a Redis node. These errors would often lead to cascade failures as the sudden load on the remaining nodes caused them to become overloaded.
After deploying HAProxy, Redis errors became a rarity. HAProxy would quickly detect unhealthy nodes and remove them from rotation, giving the cluster time to recover without impacting application performance.
We also saw a significant improvement in Redis latency. By continuously monitoring the health and load of each replica, HAProxy is able to route each request to the replica that is most likely to return a fast response.
Quantitatively, we saw the error rate on Redis read requests drop by 95% and the mean latency decrease by 30% after introducing HAProxy:
From a qualitative standpoint, HAProxy greatly simplified our architecture and removed a significant operational burden. Instead of each application implementing custom Redis load balancing logic, the process is centralized in HAProxy. Our developers can focus on implementing the core product instead of reinventing infrastructure wheels.
After deploying HAProxy, we were able to delete thousands of lines of custom Redis client code from our applications. As one of our senior engineers put it: "HAProxy handles all the minutiae of Redis routing that I never want to think about again."
Conclusion
Operating a high-performance, highly-available Redis cluster at scale requires thoughtful design and robust infrastructure. While managed services like AWS ElastiCache greatly simplify Redis operations, they don‘t completely abstract away the challenges of distributed systems.
A key lesson we‘ve learned is that load balancing and failure handling should be decoupled from the application whenever possible. Implementing infrastructure concerns like circuit breaking and health checks directly in application code is a quick path to unreliable, hard-to-maintain systems.
By deploying a smart reverse proxy like HAProxy in front of ElastiCache, we were able to centralize the complexities of Redis routing while keeping our application simple. HAProxy acts as an ever-vigilant gatekeeper for Redis traffic, maximizing availability and minimizing the impact of failures.
If you are using Redis at scale, I highly recommend evaluating whether a load balancing proxy would improve your reliability and simplify your architecture. The results might just surprise you – they certainly did for us.