Scaling based on load balancing serving capacity
Stay organized with collections
Save and categorize content based on your preferences.
This document describes how to scale a managed instance group (MIG)
based on the serving capacity of an external Application Load Balancer or an internal Application Load Balancer.
This means that
autoscaling adds or removes VM instances in the group when the load balancer
indicates that the group has reached a configurable fraction of its fullness,
where fullness is defined by the
target capacity of the selected balancing mode
of the backend instance group.
If you haven't already, set up authentication.
Authentication verifies your identity for access to Google Cloud services and APIs. To run
code or samples from a local development environment, you can authenticate to
Compute Engine by selecting one of the following options:
Select the tab for how you plan to use the samples on this page:
Console
When you use the Google Cloud console to access Google Cloud services and
APIs, you don't need to set up authentication.
gcloud
Install the Google Cloud CLI.
After installation,
initialize the Google Cloud CLI by running the following command:
Scaling based on HTTP(S) load balancing serving capacity
Compute Engine provides support for load balancing within your
instance groups. You can use autoscaling in conjunction with load balancing by
setting up an autoscaler that scales based on the load of your instances.
An external or internal HTTP(S) load balancer distributes requests
to backend services according to its URL map. The load balancer can have one or
more backend services, each supporting
instance group or network endpoint group (NEG) backends. When
backends are instance groups, the HTTP(S) load balancer offers two
balancing modes:
UTILIZATION and RATE. With UTILIZATION, you can specify a maximum target
for average backend utilization of instances in the instance group. With RATE,
you must specify a target number of requests per second on a per-instance basis
or a per-group basis. (Only zonal instance groups support specifying a maximum
rate for the whole group. Regional managed instance groups don't support
defining a maximum rate per group.)
The balancing mode and the target capacity that you specify define the
conditions under which Google Cloud determines when a backend VM is at
full capacity. Google Cloud attempts to send traffic to healthy VMs that
have remaining capacity. If all VMs are already at capacity, the target
utilization or rate is exceeded.
When you attach an autoscaler to an instance group backend of an
HTTP(S) load balancer, the autoscaler scales the managed instance group to
maintain a fraction of the load balancing serving capacity.
For example, assume the load balancing serving capacity of a managed instance
group is defined as 100 RPS per instance. If you create an autoscaler with
the HTTP(S) load balancing policy and set it to maintain a target utilization
level of 0.8 or 80%, the autoscaler adds or removes instances from the
managed instance group to maintain 80% of the serving capacity, or 80 RPS per
instance.
The following diagram shows how the autoscaler interacts with a managed
instance group and backend service:
The autoscaler watches the serving capacity of the managed instance group,
which is defined in the backend service, and scales based on the target
utilization. In this example, the serving capacity is measured in the
maxRatePerInstance value.
Applicable load balancing configurations
You can set one of three options for your load balancing
serving capacity. When you
first create the backend, you can choose among
maximum backend utilization,
maximum requests per second per instance, or
maximum requests per second of the whole group. Autoscaling
only works with maximum backend utilization and maximum requests per
second/instance because the value of these settings can be controlled by
adding or removing instances. For example, if you set a backend to handle 10
requests per second per instance, and the autoscaler is configured to maintain
80% of that rate, then the autoscaler can add or remove
instances when the requests per second per instance changes.
Autoscaling does not work with maximum requests per group because this setting
is independent of the number of instances in the instance group. The load
balancer continuously sends the maximum number of requests per group to the
instance group, regardless of how many instances are in the group.
For example, if you set the backend to handle 100 maximum requests per group
per second, the load balancer sends 100 requests per second to
the group, whether the group has two instances or 100 instances.
Because this value cannot be adjusted, autoscaling does not work with a load
balancing configuration that uses the maximum number of requests per second per
group.
Enable autoscaling based on load balancing serving capacity
To enable autoscaling based on load balancing serving capacity, use one of the
following options.
Permissions required for this task
To perform this task, you must have the following
permissions:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-06-11 UTC."],[],[]]