A Deep Dive into Envoy Proxy Clusters
Envoy Proxy has emerged as the most popular open source service proxy for cloud native applications. Originating at Lyft, Envoy is now a graduated project in the Cloud Native Computing Foundation and is widely used in service mesh implementations like Istio, Consul, and AWS App Mesh to handle communication between microservices.
At its core, Envoy is a L4 and L7 proxy that can be deployed as a sidecar to services. It provides a range of powerful features for traffic management, observability, and security. One of the key concepts in Envoy is the cluster. In this post, we‘ll take an in-depth look at Envoy clusters and explore how they are used to enable reliable and scalable service-to-service communication.
What is an Envoy Cluster?
In Envoy, a cluster is a group of logically similar upstream hosts that can be load balanced and health checked together. An upstream host is an IP/port combination that Envoy can forward requests to. Clusters allow Envoy to abstract away the individual instances of a service and treat them as a single entity for routing and load balancing decisions.
Envoy supports several types of clusters for different service discovery and resolution scenarios:
-
Static Clusters: The upstream hosts are statically defined in the Envoy configuration file. This is suitable for fixed, known backends.
-
Strict DNS Clusters: Envoy will continuously resolve DNS and update the cluster membership as DNS results change. Each host is independently resolved using DNS. This is useful for services with predictable DNS names.
-
Logical DNS Clusters: The cluster has a single logical DNS name that resolves to multiple IP addresses. Envoy will select an IP address in a round-robin fashion for each request. This can be used to balance requests across a set of hosts exposed via a single DNS name.
-
EDS (Endpoint Discovery Service) Clusters: The cluster gets its member hosts from a remote endpoint discovery service via the gRPC APIs defined in the
envoy.api.v2.ClusterLoadAssignment
protocol buffer. This is often used in service mesh scenarios with dynamic host addition/removal. -
Original Destination Clusters: The cluster uses the downstream connection‘s original destination address as the destination for the upstream connection. This is useful in a proxyless service mesh.
Configuring Envoy Clusters
Clusters are statically defined in the Envoy configuration file or dynamically added via xDS APIs. Here‘s an example of a basic static cluster configuration:
clusters:
- name: my_service
connect_timeout: 0.25s
type: STATIC
load_assignment:
cluster_name: my_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 1234
This defines a cluster named "my_service" with a single static endpoint at 127.0.0.1:1234. The connect_timeout
specifies the timeout for new network connections to hosts in the cluster.
Envoy provides a wide range of configuration options for clusters to control various aspects such as connection pooling, health checking, circuit breaking, outlier detection, load balancing, TLS context, and more. The Envoy documentation provides a complete reference of all the available configuration fields.
Load Balancing Clusters
Load balancing is one of the primary functions of a cluster. Envoy supports several load balancing algorithms and policies:
-
Round Robin: Envoy cycles through available hosts in the cluster in a round robin fashion.
-
Least Request: The host with the fewest active requests is selected.
-
Random: A host is selected at random, weighted by the ratio of healthy hosts.
-
Ring Hash: The request‘s hash key is used to lookup the corresponding host in a hash ring composed of the cluster‘s hosts.
-
Maglev: A hash-based load balancing algorithm that provides more even distribution than ring hash.
Load balancing policies can be set cluster-wide with the lb_policy
field, or on a per-route basis in virtual host or route configurations.
Envoy also supports zone-aware load balancing to distribute requests across failure domains. Each host can be assigned a locality consisting of a region, zone, and subzone. Envoy will attempt to select hosts from the same locality as the originating request to optimize latency.
Panic Thresholds & Circuit Breaking
To protect upstream services from being overwhelmed, Envoy can be configured with panic thresholds and circuit breakers. The panic threshold sets a lower bound on the percentage of healthy hosts in a cluster. If the proportion of healthy hosts falls below this threshold, Envoy will assume that there is systematic issue with the cluster and will eject all hosts, effectively circuit breaking the service.
Panic thresholds are configured as a percentage type in the healthy_panic_threshold
field:
common_lb_config:
healthy_panic_threshold:
value: 10.0
To disable panic mode, set the value to 0.
Envoy also supports more fine-grained circuit breakers to limit connections, requests, pending requests, retries, and timeouts to upstream hosts. These limits are configured in the circuit_breakers
field of the cluster and help prevent single endpoints from being flooded.
Health Checking & Outlier Detection
Envoy actively health checks cluster members to determine their availability to serve requests. Several different health checking methods are supported:
-
HTTP: Envoy sends a HTTP request to a configurable endpoint. If the response matches the expected status code and body, the host is considered healthy.
-
TCP: Envoy attempts to open a TCP connection to the host. If the connection is established, the host is considered healthy.
-
gRPC: Envoy sends a gRPC health check request and expects a corresponding response.
Health checks are configured under the health_checks
field in the cluster definition. Hosts that fail health checks are removed from the cluster until they pass again.
In addition to health checking, Envoy also supports outlier detection to dynamically determine whether some hosts in a cluster are performing statistically differently than others. Outlier detection uses success rate, consecutive gateway failures, and success rate ejection algorithms to identify and evict misbehaving hosts. This is configured under the outlier_detection
field.
HTTP/2 & gRPC
Envoy has first-class support for HTTP/2 and gRPC for both incoming and outgoing connections. For clusters, the HTTP version and protocol options are configured via the http2_protocol_options
and http_protocol_options
fields respectively:
clusters:
- name: grpc_service
connect_timeout: 0.25s
type: static
lb_policy: round_robin
http2_protocol_options: {}
load_assignment:
cluster_name: grpc_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: grpc_server
port_value: 80
This configures an HTTP/2 cluster for a gRPC service. Envoy will use HTTP/2 for new connections to the cluster.
For gRPC services, Envoy can also use the gRPC service config to determine routing, load balancing, and failover policies. The gRPC service config is provided via the grpc_services
field.
Monitoring Clusters
Envoy exposes comprehensive stats and metrics about cluster usage and performance. These can be emitted as statsd-style counters and gauges, prometheus metrics, or pulled via the /stats endpoint. Some key cluster-related stats include:
- upstream_cx_total: Total connections to the cluster
- upstream_rq_total: Total requests to the cluster
- upstream_rq_timeout: Requests that timed out to the cluster
- upstream_rq_retry: Requests that were retried
- upstream_rq_500: Requests that had a 500 response code
- upstream_rq_pending_overflow: Requests that overflowed connection or request limits
- upstream_cx_connect_fail: Connection failures to the cluster
- membership_healthy: Current healthy host count
- membership_degraded: Current degraded host count
- membership_total: Current total host count
These stats are essential for monitoring the health and performance of Envoy clusters and quickly identifying issues. They can be dashboarded and alerted on for observability.
Best Practices
Some best practices to follow when configuring Envoy clusters include:
- Ensure clusters are appropriately sized for the expected load. This may require load testing.
- Avoid excessively low connection/request circuit breaker limits that cause requests to be rejected prematurely.
- Tune outlier detection settings based on the typical response time and error rates for the service.
- Dashboard and alert on key cluster metrics to proactively identify failures and performance issues.
- Prefer DNS and service discovery-based clusters over statically defined hosts for flexibility.
- Use health checks to quickly remove unhealthy hosts and ensure high availability.
- Enable panic thresholds as a fast circuit breaker to protect against systemic failures.
- Leverage zone-aware load balancing and outlier detection for clusters that span regions/zones.
By following Envoy cluster best practices and taking advantage of its advanced features, operators can build highly reliable, scalable, and observable service-to-service communication.
The Future
Envoy‘s cluster implementation continues to rapidly evolve to support new use cases and integrate new technologies. Some areas of active development include:
- Integration with UDPA (Universal Data Plane API) for a unified xDS transport layer
- Improved extensibility with WebAssembly for custom filters and protocols
- Aggregate clusters for weighted load balancing across multiple clusters
- Enhanced service discovery integrations (Consul, Eureka, etc.)
- Tighter integration with Istio and other service mesh implementations
- More granular traffic shadowing and mirroring controls
As Envoy and the service mesh ecosystem mature, expect clusters to gain even more sophisticated traffic shaping, observability, and resilience features to help operators keep microservices reliable and performant. Staying up-to-date with Envoy‘s latest releases and understanding cluster configuration best practices will be key to successfully managing complex service mesh architectures at scale.