The reliability pillar in the Google Cloud Well-Architected Framework provides principles and recommendations to help you design, deploy, and manage reliable workloads in Google Cloud.
This document is intended for cloud architects, developers, platform engineers, administrators, and site reliability engineers.
Reliability is a system's ability to consistently perform its intended functions within the defined conditions and maintain uninterrupted service. Best practices for reliability include redundancy, fault-tolerant design, monitoring, and automated recovery processes.
As a part of reliability, resilience is the system's ability to withstand and recover from failures or unexpected disruptions, while maintaining performance. Google Cloud features, like multi-regional deployments, automated backups, and disaster recovery solutions, can help you improve your system's resilience.
Reliability is important to your cloud strategy for many reasons, including the following:
To make your systems reliable, you need a plan and an established strategy. This strategy must include education and the authority to prioritize reliability alongside other initiatives.
Set a clear expectation that the entire organization is responsible for reliability, including development, product management, operations, platform engineering, and site reliability engineering (SRE). Even the business-focused groups, like marketing and sales, can influence reliability.
Every team must understand the reliability targets and risks of their applications. The teams must be accountable to these requirements. Conflicts between reliability and regular product feature development must be prioritized and escalated accordingly.
Plan and manage reliability holistically, across all your functions and teams. Consider setting up a Cloud Centre of Excellence (CCoE) that includes a reliability pillar. For more information, see Optimize your organization's cloud journey with a Cloud Center of Excellence.
The activities that you perform to design, deploy, and manage a reliable system can be categorized in the following focus areas. Each of the reliability principles and recommendations in this pillar is relevant to one of these focus areas.
The recommendations in the reliability pillar of the Well-Architected Framework are mapped to the following core principles:
Authors:
Other contributors:
This principle in the reliability pillar of the Google Cloud Well-Architected Framework helps you to assess your users' experience, and then map the findings to reliability goals and metrics.
This principle is relevant to the scoping focus area of reliability.
Observability tools provide large amounts of data, but not all of the data directly relates to the impacts on the users. For example, you might observe high CPU usage, slow server operations, or even crashed tasks. However, if these issues don't affect the user experience, then they don't constitute an outage.
To measure the user experience, you need to distinguish between internal system behavior and user-facing problems. Focus on metrics like the success ratio of user requests. Don't rely solely on server-centric metrics, like CPU usage, which can lead to misleading conclusions about your service's reliability. True reliability means that users can consistently and effectively use your application or service.
To help you measure user experience effectively, consider the recommendations in the following sections.
To truly understand your service's reliability, prioritize metrics that reflect your users' actual experience. For example, measure the users' query success ratio, application latency, and error rates.
Ideally, collect this data directly from the user's device or browser. If this direct data collection isn't feasible, shift your measurement point progressively further away from the user in the system. For example, you can use the load balancer or frontend service as the measurement point. This approach helps you identify and address issues before those issues can significantly impact your users.
To understand how users interact with your system, you can use tracing tools like Cloud Trace. By following a user's journey through your application, you can find bottlenecks and latency issues that might degrade the user's experience. Cloud Trace captures detailed performance data for each hop in your service architecture. This data helps you identify and address performance issues more efficiently, which can lead to a more reliable and satisfying user experience.
This principle in the reliability pillar of the Google Cloud Well-Architected Framework helps you define reliability goals that are technically feasible for your workloads in Google Cloud.
This principle is relevant to the scoping focus area of reliability.
Design your systems to be just reliable enough for user happiness. It might seem counterintuitive, but a goal of 100% reliability is often not the most effective strategy. Higher reliability might result in a significantly higher cost, both in terms of financial investment and potential limitations on innovation. If users are already happy with the current level of service, then efforts to further increase happiness might yield a low return on investment. Instead, you can better spend resources elsewhere.
You need to determine the level of reliability at which your users are happy, and determine the point where the cost of incremental improvements begin to outweigh the benefits. When you determine this level of sufficient reliability, you can allocate resources strategically and focus on features and improvements that deliver greater value to your users.
To set realistic reliability targets, consider the recommendations in the following subsections.
Aim for high availability such as 99.99% uptime, but don't set a target of 100% uptime. Acknowledge that some failures are inevitable.
The gap between 100% uptime and a 99.99% target is the allowance for failure. This gap is often called the error budget. The error budget can help you take risks and innovate, which is fundamental to any business to stay competitive.
Prioritize the reliability of the most critical components in the system. Accept that less critical components can have a higher tolerance for failure.
To determine the optimal reliability level for your system, conduct thorough cost-benefit analyses.
Consider factors like system requirements, the consequences of failures, and your organization's risk tolerance for the specific application. Remember to consider your disaster recovery metrics, such as the recovery time objective (RTO) and recovery point objective (RPO). Decide what level of reliability is acceptable within the budget and other constraints.
Look for ways to improve efficiency and reduce costs without compromising essential reliability features.
This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to plan, build, and manage resource redundancy, which can help you to avoid failures.
This principle is relevant to the scoping focus area of reliability.
After you decide the level of reliability that you need, you must design your systems to avoid any single points of failure. Every critical component in the system must be replicated across multiple machines, zones, and regions. For example, a critical database can't be located in only one region, and a metadata server can't be deployed in only one single zone or region. In those examples, if the sole zone or region has an outage, the system has a global outage.
To build redundant systems, consider the recommendations in the following subsections.
Map out your system's failure domains, from individual VMs to regions, and design for redundancy across the failure domains.
To ensure high availability, distribute and replicate your services and applications across multiple zones and regions. Configure the system for automatic failover to make sure that the services and applications continue to be available in the event of zone or region outages.
For examples of multi-zone and multi-region architectures, see Design reliable infrastructure for your workloads in Google Cloud.
Continuously track the status of your failure domains to detect and address issues promptly.
You can monitor the current status of Google Cloud services in all regions by using the Google Cloud Service Health dashboard. You can also view incidents relevant to your project by using Personalized Service Health. You can use load balancers to detect resource health and automatically route traffic to healthy backends. For more information, see Health checks overview.
Like a fire drill, regularly simulate failures to validate the effectiveness of your replication and failover strategies.
For more information, see Simulate a zone outage for a regional MIG and Simulate a zone failure in GKE regional clusters.
This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you use horizontal scalability. By using horizontal scalability, you can help ensure that your workloads in Google Cloud can scale efficiently and maintain performance.
This principle is relevant to the scoping focus area of reliability.
Re-architect your system to a horizontal architecture. To accommodate growth in traffic or data, you can add more resources. You can also remove resources when they're not in use.
To understand the value of horizontal scaling, consider the limitations of vertical scaling.
A common scenario for vertical scaling is to use a MySQL database as the primary database with critical data. As database usage increases, more RAM and CPU is required. Eventually, the database reaches the memory limit on the host machine, and needs to be upgraded. This process might need to be repeated several times. The problem is that there are hard limits on how much a database can grow. VM sizes are not unlimited. The database can reach a point when it's no longer possible to add more resources.
Even if resources were unlimited, a large VM can become a single point of failure. Any problem with the primary database VM can cause error responses or cause a system-wide outage that affects all users. Avoid single points of failure, as described in Build highly available systems through resource redundancy.
Besides these scaling limits, vertical scaling tends to be more expensive. The cost can increase exponentially as machines with greater amounts of compute power and memory are acquired.
Horizontal scaling, by contrast, can cost less. The potential for horizontal scaling is virtually unlimited in a system that's designed to scale.
To transition from a single VM architecture to a horizontal multiple-machine architecture, you need to plan carefully and use the right tools. To help you achieve horizontal scaling, consider the recommendations in the following subsections.
Managed services remove the need to manually manage horizontal scaling. For example, with Compute Engine managed instance groups (MIGs), you can add or remove VMs to scale your application horizontally. For containerized applications, Cloud Run is a serverless platform that can automatically scale your stateless containers based on incoming traffic.
Modular components and clear interfaces help you scale individual components as needed, instead of scaling the entire application. For more information, see Promote modular design in the performance optimization pillar.
Design applications to be stateless, meaning no locally stored data. This lets you add or remove instances without worrying about data consistency.
This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you proactively identify areas where errors and failures might occur.
This principle is relevant to the observation focus area of reliability.
To maintain and improve the reliability of your workloads in Google Cloud, you need to implement effective observability by using metrics, logs, and traces.
Metrics, logs, and traces help you monitor your system continuously. Comprehensive monitoring helps you find out where and why errors occurred. You can also detect potential failures before errors occur.
To detect potential failures efficiently, consider the recommendations in the following subsections.
To track key metrics like response times and error rates, use Cloud Monitoring and Cloud Logging. These tools also help you to ensure that the metrics consistently meet the needs of your workload.
To make data-driven decisions, analyze default service metrics to understand component dependencies and their impact on overall workload performance.
To customize your monitoring strategy, create and publish your own metrics by using the Google Cloud SDK.
Implement robust error handling and enable logging across all of the components of your workloads in Google Cloud. Activate logs like Cloud Storage access logs and VPC Flow Logs.
When you configure logging, consider the associated costs. To control logging costs, you can configure exclusion filters on the log sinks to exclude certain logs from being stored.
Monitor CPU consumption, network I/O metrics, and disk I/O metrics to detect under-provisioned and over-provisioned resources in services like GKE, Compute Engine, and Managed Service for Apache Spark. For a complete list of supported services, see Cloud Monitoring overview.
For alerts, focus on critical metrics, set appropriate thresholds to minimize alert fatigue, and ensure timely responses to significant issues. This targeted approach lets you proactively maintain workload reliability. For more information, see Alerting overview.
This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you to design your Google Cloud workloads to fail gracefully.
This principle is relevant to the response focus area of reliability.
Graceful degradation is a design approach where a system that experiences a high load continues to function, possibly with reduced performance or accuracy. Graceful degradation ensures continued availability of the system and prevents complete failure, even if the system's work isn't optimal. When the load returns to a manageable level, the system resumes full functionality.
For example, during periods of high load, Google Search prioritizes results from higher-ranked web pages, potentially sacrificing some accuracy. When the load decreases, Google Search recomputes the search results.
To design your systems for graceful degradation, consider the recommendations in the following subsections.
Ensure that your replicas can independently handle overloads and can throttle incoming requests during high-traffic scenarios. This approach helps you to prevent cascading failures that are caused by shifts in excess traffic between zones.
Use tools like Apigee to control the rate of API requests during high-traffic times. You can configure policy rules to reflect how you want to scale back requests.
Configure your systems to drop excess requests at the frontend layer to protect backend components. Dropping some requests prevents global failures and enables the system to recover more gracefully.With this approach, some users might experience errors. However, you can minimize the impact of outages, in contrast to an approach like circuit-breaking, where all traffic is dropped during an overload.
Build your applications to handle partial errors and retries seamlessly. This design helps to ensure that as much traffic as possible is served during high-load scenarios.
To validate that the throttle and request-drop mechanisms work effectively, regularly simulate overload conditions in your system. Testing helps ensure that your system is prepared for real-world traffic surges.
Use analytics and monitoring tools to predict and respond to traffic surges before they escalate into overloads. Early detection and response can help maintain service availability during high-demand periods.
This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you design and run tests for recovery in the event of failures.
This principle is relevant to the learning focus area of reliability.
To be sure that your system can recover from failures, you must periodically run tests that include regional failovers, release rollbacks, and data restoration from backups.
This testing helps you to practice responses to events that pose major risks to reliability, such as the outage of an entire region. This testing also helps you verify that your system behaves as intended during a disruption.
In the unlikely event of an entire region going down, you need to fail over all traffic to another region. During normal operation of your workload, when data is modified, it needs to be synchronized from the primary region to the failover region. You need to verify that the replicated data is always very recent, so that users don't experience data loss or session breakage. The load balancing system must also be able to shift traffic to the failover region at any time without service interruptions. To minimize downtime after a regional outage, operations engineers also need to be able to manually and efficiently shift user traffic away from a region, in as less time as possible. This operation is sometimes called draining a region, which means you stop the inbound traffic to the region and move all the traffic elsewhere.
When you design and run tests for failure recovery, consider the recommendations in the following subsections.
Clearly define what you want to achieve from the testing. For example, your objectives can include the following:
Decide which components, services, or regions are in the testing scope. The scope can include specific application tiers like the frontend, backend, and database, or it can include specific Google Cloud resources like Cloud SQL instances or GKE clusters. The scope must also specify any external dependencies, such as third-party APIs or cloud interconnections.
Choose an appropriate environment, preferably a staging or sandbox environment that replicates your production setup. If you conduct the test in production, ensure that you have safety measures ready, like automated monitoring and manual rollback procedures.
Create a backup plan. Take snapshots or backups of critical databases and services to prevent data loss during the test. Ensure that your team is prepared to do manual interventions if the automated failover mechanisms fail.
To prevent test disruptions, ensure that your IAM roles, policies, and failover configurations are correctly set up. Verify that the necessary permissions are in place for the test tools and scripts.
Inform stakeholders, including operations, DevOps, and application owners, about the test schedule, scope, and potential impact. Provide stakeholders with an estimated timeline and the expected behaviors during the test.
Plan and execute failures by using tools like Chaos Monkey. You can use custom scripts to simulate failures of critical services such as a shutdown of a primary node in a multi-zone GKE cluster or a disabled Cloud SQL instance. You can also use scripts to simulate a region-wide network outage by using firewall rules or API restrictions based on your scope of test. Gradually escalate the failure scenarios to observe system behavior under various conditions.
Introduce load testing alongside failure scenarios to replicate real-world usage during outages. Test cascading failure impacts, such as how frontend systems behave when backend services are unavailable.
To validate configuration changes and to assess the system's resilience against human errors, test scenarios that involve misconfigurations. For example, run tests with incorrect DNS failover settings or incorrect IAM permissions.
Monitor how load balancers, health checks, and other mechanisms reroute traffic. Use Google Cloud tools like Cloud Monitoring and Cloud Logging to capture metrics and events during the test.
Observe changes in latency, error rates, and throughput during and after the failure simulation, and monitor the overall performance impact. Identify any degradation or inconsistencies in the user experience.
Ensure that logs are generated and alerts are triggered for key events, such as service outages or failovers. Use this data to verify the effectiveness of your alerting and incident response systems.
Measure how long it takes for the system to resume normal operations after a failure, and then compare this data with the defined RTO and document any gaps.
Ensure that data integrity and availability align with the RPO. To test database consistency, compare snapshots or backups of the database before and after a failure.
Evaluate service restoration and confirm that all services are restored to a functional state with minimal user disruption.
Document each test step, failure scenario, and corresponding system behavior. Include timestamps, logs, and metrics for detailed analyses.
Highlight bottlenecks, single points of failure, or unexpected behaviors observed during the test. To help prioritize fixes, categorize issues by severity and impact.
Suggest improvements to the system architecture, failover mechanisms, or monitoring setups. Based on test findings, update any relevant failover policies and playbooks. Present a postmortem report to stakeholders. The report should summarize the outcomes, lessons learned, and next steps. For more information, see Conduct thorough postmortems.
To validate ongoing reliability and resilience, plan periodic testing (for example, quarterly).
Run tests under different scenarios, including infrastructure changes, software updates, and increased traffic loads.
Automate failover tests by using CI/CD pipelines to integrate reliability testing into your development lifecycle.
During the postmortem, use feedback from stakeholders and end users to improve the test process and system resilience.
This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you design and run tests for recovery from data loss.
This principle is relevant to the learning focus area of reliability.
To ensure that your system can recover from situations where data is lost or corrupted, you need to run tests for those scenarios. Instances of data loss might be caused by a software bug or some type of natural disaster. After such events, you need to restore data from backups and bring all of the services back up again by using the freshly restored data.
We recommend that you use three criteria to judge the success or failure of this type of recovery test: data integrity, recovery time objective (RTO), and recovery point objective (RPO). For details about the RTO and RPO metrics, see Basics of DR planning.
The goal of data restoration testing is to periodically verify that your organization can continue to meet business continuity requirements. Besides measuring RTO and RPO, a data restoration test must include testing of the entire application stack and all the critical infrastructure services with the restored data. This is necessary to confirm that the entire deployed application works correctly in the test environment.
When you design and run tests for recovering from data loss, consider the recommendations in the following subsections.
You need to verify that your backups contain consistent and usable snapshots of data that you can restore to immediately bring applications back into service. To validate data integrity, set up automated consistency checks to run after each backup.
To test backups, restore them in a non-production environment. To ensure your backups can be restored efficiently and that the restored data meets application requirements, regularly simulate data recovery scenarios. Document the steps for data restoration, and train your teams to execute the steps effectively during a failure.
To minimize data loss during restoration and to meet RPO targets, it's essential to have regularly scheduled backups. Establish a backup frequency that aligns with your RPO. For example, if your RPO is 15 minutes, schedule backups to run at least every 15 minutes. Optimize the backup intervals to reduce the risk of data loss.
Use Google Cloud tools like Cloud Storage, Cloud SQL automated backups, or Spanner backups to schedule and manage backups. For critical applications, use near-continuous backup solutions like point-in-time recovery (PITR) for Cloud SQL or incremental backups for large datasets.
Set a clear RPO based on your business needs, and monitor adherence to the RPO. If backup intervals exceed the defined RPO, use Cloud Monitoring to set up alerts.
Use Google Cloud Backup and DR service or similar tools to track the health of your backups and confirm that they are stored in secure and reliable locations. Ensure that the backups are replicated across multiple regions for added resilience.
Combine backups with disaster recovery strategies like active-active failover setups or cross-region replication for improved recovery time in extreme cases. For more information, see Disaster recovery planning guide.
This principle in the reliability pillar of the Google Cloud Well-Architected Framework provides recommendations to help you conduct effective postmortems after failures and incidents.
This principle is relevant to the learning focus area of reliability.
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve the incident, the root causes, and the follow-up actions to prevent the incident from recurring. The goal of a postmortem is to learn from mistakes and not assign blame.
The following diagram shows the workflow of a postmortem:
The workflow of a postmortem includes the following steps:
Conduct postmortem analyses after major events and non-major events like the following:
Define postmortem criteria before an incident occurs so that everyone knows when a post mortem is necessary.
To conduct effective postmortems, consider the recommendations in the following subsections.
Effective postmortems focus on processes, tools, and technologies, and don't place blame on individuals or teams. The purpose of a postmortem analysis is to improve your technology and future, not to find who is guilty. Everyone makes mistakes. The goal should be to analyze the mistakes and learn from them.
The following examples show the difference between feedback that assigns blame and blameless feedback:
For each piece of information that you plan to include in the report, assess whether that information is important and necessary to help the audience understand what happened. You can move supplementary data and explanations to an appendix of the report. Reviewers who need more information can request it.
Before you start to explore solutions for a problem, evaluate the importance of the problem and the likelihood of a recurrence. Adding complexity to the system to solve problems that are unlikely to occur again can lead to increased instability.
To ensure that issues don't remain unresolved, publish the outcome of the postmortem to a wide audience and get support from management. The value of a postmortem is proportional to the learning that occurs after the postmortem. When more people learn from incidents, the likelihood of similar failures recurring is reduced.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-02-14 UTC.