Load Balancing: How It Works and Why It Matters

Load balancing distributes incoming traffic across multiple servers to prevent overload.

Load Balancing

When your website grows beyond a single server, load balancing is what keeps it running reliably. It distributes incoming traffic across multiple servers, preventing any one from being overwhelmed, and keeping the service available even when individual servers fail.

What Is Load Balancing

A load balancer is a device or service that sits between incoming client requests and your pool of backend servers. Its primary job is to distribute those requests across the available servers in a way that maximises throughput, minimises response time, and ensures no single server becomes a bottleneck. If one server in the pool fails or becomes unhealthy, the load balancer automatically stops sending traffic to it and routes requests to the remaining healthy servers, maintaining availability without any manual intervention.

Without load balancing, scaling a web application means pointing more traffic at a single server until it reaches its limit. With load balancing, you can add servers horizontally as demand grows, with the load balancer distributing work across all of them transparently. This is the foundation of horizontal scaling, the approach used by virtually every high-traffic website and API.

Load Balancing Algorithms

The algorithm a load balancer uses determines which server receives each incoming request. Different algorithms suit different workloads, server configurations, and application architectures. Most load balancers support several algorithms and allow you to choose the most appropriate one for your use case.

AlgorithmHow It WorksBest For
Round RobinEach new request is sent to the next server in a fixed rotation. After the last server receives a request, the cycle restarts from the first.Servers with similar hardware specifications handling requests of similar duration and resource cost
Weighted Round RobinEach server is assigned a weight reflecting its capacity. Servers with higher weights receive proportionally more requests per cycle.Server pools with mixed hardware where some servers have significantly more CPU, RAM, or bandwidth than others
Least ConnectionsEach new request goes to the server currently handling the fewest active connections, regardless of the order of previous requests.Long-lived connections such as WebSockets, file uploads, or streaming where requests hold connections open for varying durations
Weighted Least ConnectionsCombines the least connections approach with server weights, routing to the server with the best ratio of current connections to capacity.Mixed server pools with long-lived, resource-intensive connections that vary significantly in duration
IP HashThe client's IP address is hashed to produce a consistent server assignment. The same client always reaches the same server as long as it remains in the pool.Applications that store session state in server memory and require a specific user to always reach the same server
Resource-BasedThe load balancer monitors each server's actual CPU and memory usage and routes requests to whichever server has the most available capacity at that moment.Applications with highly variable workloads where some requests are much more computationally expensive than others

Layer 4 vs Layer 7 Load Balancing

Load balancers operate at different layers of the network stack, and the layer determines how much information about the request they can see and act on. Choosing the right layer depends on how much routing intelligence your application requires.

FeatureLayer 4 (Transport Layer)Layer 7 (Application Layer)
Routes Based OnIP address and TCP or UDP port number only. Does not inspect the content of the request.HTTP headers, URL path, hostname, cookies, query parameters, and request body content
SpeedVery fast because no content parsing is required. Decisions are made purely on network-level information.Slightly slower because the load balancer must parse and inspect the HTTP request before making a routing decision
Routing GranularityCoarse. All traffic on a given port is treated the same regardless of what it contains.Fine-grained. Different URL paths, subdomains, or header values can be routed to completely different server pools.
TLS TerminationTypically passes TLS traffic through to the backend without decrypting itCan terminate TLS at the load balancer, decrypting traffic once and forwarding plain HTTP to backends
Example Use CaseDistributing all incoming HTTPS connections evenly across a pool of identical serversRouting /api requests to API servers, /media requests to a CDN, and all other traffic to web servers
Common ToolsHAProxy in TCP mode, AWS Network Load Balancer (NLB)Nginx, HAProxy in HTTP mode, AWS Application Load Balancer (ALB), Cloudflare

Health Checks

A load balancer that routes traffic to a failed server is worse than no load balancer at all. Health checks are the mechanism that prevents this by continuously monitoring the status of every server in the pool.

The load balancer periodically sends a request to each backend server, typically to a dedicated endpoint such as /health or /status, and expects a successful response within a defined timeout. If a server fails to respond or returns an error status code a specified number of times consecutively, the load balancer marks it as unhealthy and removes it from the active pool. No more requests are sent to it until it passes the health check again, at which point it is automatically returned to rotation.

Example Nginx health check endpoint configuration:
upstream backend {
    server app1.example.com;
    server app2.example.com;
    server app3.example.com;
}

server {
    location /health {
        return 200 'healthy';
        add_header Content-Type text/plain;
    }

    location / {
        proxy_pass http://backend;
    }
}

Health check endpoints should be lightweight and verify that the application is genuinely able to serve requests, not just that the server is reachable. A good health check confirms the application process is running, the database connection is available, and any critical dependencies are responsive. A health endpoint that always returns 200 without checking underlying dependencies provides false confidence.

Sticky Sessions

By default, load balancers treat each request independently and may route consecutive requests from the same user to different servers. For stateless applications this is ideal, but for applications that store session data in server memory it creates a problem: a user might be logged in on server A and then have their next request sent to server B, which has no record of their session.

Sticky sessions, also called session persistence, solve this by ensuring a user's requests always reach the same server. This is typically achieved through IP hashing or by the load balancer inserting a cookie that identifies the target server. While sticky sessions solve the session problem, they reduce the effectiveness of load distribution because traffic from high-volume clients becomes pinned to specific servers. The preferred solution for most modern applications is to store session data in a shared external store such as Redis so that any server can handle any request.

High Availability for Load Balancers

A load balancer that fails takes down your entire service, making it a single point of failure. Production environments address this by running multiple load balancer instances in redundant configurations.

  • Active-passive: One load balancer handles all traffic while a standby instance monitors it. If the active instance fails, the standby takes over automatically using a shared virtual IP address. This is simple to configure and provides immediate failover with no traffic split.
  • Active-active: Two or more load balancer instances share the traffic load simultaneously. DNS round-robin or a higher-level routing mechanism distributes connections between them. If one fails, the others absorb its traffic. This approach also increases total throughput capacity.
  • Cloud-managed load balancers: AWS ALB, Google Cloud Load Balancing, and Azure Load Balancer are managed services that handle their own redundancy, scaling, and health monitoring internally. Using a managed service eliminates the need to operate load balancer infrastructure yourself.

Frequently Asked Questions

  1. What is a single point of failure and how do load balancers relate to it?
    A single point of failure is any component whose failure would bring down the entire service. A single load balancer instance is itself a single point of failure. If it crashes, all traffic stops regardless of how many healthy backend servers remain. The solution is to run at least two load balancer instances in an active-passive or active-active configuration, so that the failure of any one instance does not interrupt service. Managed cloud load balancers handle this redundancy automatically as part of the service.
  2. How do load balancers handle sessions?
    Stateless APIs that store no server-side session data work correctly with any load balancing algorithm because each request is self-contained. Stateful applications that store session data in server memory require either sticky sessions to ensure a user always reaches the same server, or a shared external session store such as Redis or Memcached that all servers can access. The shared session store approach is strongly preferred in modern architectures because it preserves full load balancing effectiveness and handles server failures without losing session data.
  3. Do cloud providers include load balancing?
    Yes. All major cloud providers offer managed load balancing services. AWS provides the Application Load Balancer for Layer 7 HTTP routing and the Network Load Balancer for Layer 4 TCP traffic. Google Cloud offers Cloud Load Balancing with both regional and global options. Azure provides Azure Load Balancer for Layer 4 and Azure Application Gateway for Layer 7. These managed services include automatic health checks, auto-scaling, TLS termination, and integration with other cloud services, removing the need to operate load balancer infrastructure yourself.
  4. What is the difference between a load balancer and a reverse proxy?
    A reverse proxy sits between clients and one or more backend servers, forwarding requests on their behalf. Load balancing is one of the functions a reverse proxy can perform, but reverse proxies also handle TLS termination, caching, compression, request transformation, and rate limiting. All load balancers act as reverse proxies, but not all reverse proxies perform load balancing. Tools like Nginx are commonly used as both simultaneously, terminating TLS, caching static content, and distributing requests across a backend pool in a single configuration.
  5. Can a load balancer improve security?
    Yes, in several ways. By terminating TLS at the load balancer, you centralise certificate management and ensure all traffic between the load balancer and backends travels on a controlled private network. Load balancers can enforce rate limiting to protect against denial of service attacks by rejecting clients that exceed a request threshold. They can filter requests by IP address or geographic origin. Web Application Firewalls are often integrated at the load balancer layer to inspect and block malicious request patterns before they reach any backend server. Some cloud load balancers include DDoS protection built into the service.

Conclusion

Load balancing is the foundation of reliable, scalable web infrastructure. By distributing requests intelligently across a pool of backend servers using algorithms suited to your workload, continuously monitoring server health, and handling failures automatically, load balancers allow your application to grow horizontally and remain available under heavy traffic and during server failures. Choosing between Layer 4 and Layer 7 balancing depends on how much routing intelligence your architecture needs, and addressing session state through a shared store rather than sticky sessions keeps your application fully scalable. See also reverse proxy and stateless vs stateful systems to build a complete picture of scalable application architecture.