Availability
What is availability?
Availability can be thought of in a couple of ways. One way to consider it is how resistant a system is to failures. For instance, what happens if a server in your system fails? What happens if your database fails? Will your system go down completely, or will it still be operational? This is often described as a system’s fault tolerance.
Another way to think about availability is the percentage of time in a given period, like a month or a year, during which a system is operational and capable of satisfying its primary functions.
Availability is crucial to consider when evaluating a system. In today’s world, most systems have an implied guarantee of availability.
Imagine a system supporting airplane software, which allows an airplane to function properly. If that system were to go down while an airplane is flying, it would be absolutely unacceptable. Similarly, consider stock or crypto exchange systems, where downtime could lead to customers losing trust and money.
Even with less critical examples like YouTube or Twitter, downtime would be detrimental, as hundreds of millions of people use these platforms daily.
Another example might be cloud providers, such as AWS, Azure, or GCP, also need to maintain high availability. If parts of their systems go down, it affects all the businesses and customers relying on their services. For example, in summer 2019, Google Cloud Platform experienced a significant outage that lasted for a few hours, affecting many businesses, including Vimeo.
In summary, availability is of great importance in system design and operations.
How to measure availability?
Availability is usually measured as the percentage of a system’s uptime in a given year.
For instance, if a system is up and operational for half of an entire year, then we can say that the system has 50% availability, which is quite poor.
In the industry, most services or systems aim for high availability, so we often measure availability in terms of “nines” rather than exact percentages.
“Nines” are essentially percentages, but they specifically represent percentages with the number nine. For example, if a system has 99% availability, we can say that the system has two nines of availability because the number nine appears twice in this percentage. Similarly, 99.9% would be considered three nines of availability, and so on.
This terminology is a standard way that people discuss availability in the industry.
Below you can find a chart from Wikipedia that showcases a range of popular availability percentages, which can help illustrate the differences between various levels of system availability.
Availability % | Downtime per year | Downtime per month | Downtime per week | Downtime per day |
---|---|---|---|---|
55.5555555% ("nine fives") | 162.33 days | 13.53 days | 74.92 hours | 10.67 hours |
90% ("one nine") | 36.53 days | 73.05 hours | 16.80 hours | 2.40 hours |
95% ("one and half nines") | 18.26 days | 36.53 hours | 8.40 hours | 1.20 hours |
97% | 10.96 days | 21.92 hours | 5.04 hours | 43.20 minutes |
98% | 7.31 days | 14.61 hours | 3.36 hours | 28.80 minutes |
99% ("two nines") | 3.65 days | 7.31 hours | 1.68 hours | 14.40 minutes |
99.5% ("two and a half nines") | 1.83 days | 3.65 hours | 50.40 minutes | 7.20 minutes |
99.8% | 17.53 hours | 87.66 minutes | 20.16 minutes | 2.88 minutes |
99.9% ("three nines") | 8.77 hours | 43.83 minutes | 10.08 minutes | 1.44 minutes |
99.95% ("three and a half nines) | 4.38 hours | 21.92 minutes | 5.04 minutes | 43.20 seconds |
99.99% ("fours nines") | 52.60 minutes | 4.38 minutes | 1.01 minutes | 8.64 seconds |
99.995% ("four and a half nines") | 26.30 minutes | 2.19 minutes | 30.24 seconds | 4.32 seconds |
99.999% ("five nines") | 5.26 minutes | 26.30 seconds | 6.05 seconds | 864.00 milliseconds |
99.9999% ("six nines") | 31.56 seconds | 2.63 seconds | 604.80 milliseconds | 86.40 milliseconds |
99.99999% ("seven nines") | 3.16 seconds | 262.98 milliseconds | 60.48 milliseconds | 8.64 milliseconds |
99.999999% ("eight nines") | 315.58 milliseconds | 26.30 milliseconds | 6.05 milliseconds | 864.00 microseconds |
99.9999999% ("nine nines") | 31.56 milliseconds | 2.63 milliseconds | 604.80 microseconds | 86.40 microseconds |
As you can see, even though 99% availability seems impressive, being down for three and a half days or more per year is still quite problematic and might be considered unacceptable. For systems that involve life-and-death situations, such downtime is undoubtedly unacceptable. Even for services like Facebook or YouTube, which serve billions of users, that amount of downtime is too high.
Five nines of availability (99.999%) is often considered the gold standard for availability. If your system achieves this level of availability, it can be regarded as a highly available system.
A Service Level Agreement and Service Level Objective are related concepts used to define and measure the quality of service provided by a service provider.
SLA (Service Level Agreement)
An SLA is a formal, legally binding contract between a service provider and a client that outlines the expected level of service, performance metrics, and responsibilities of both parties. It specifies measurable targets such as availability, response time, and throughput, as well as consequences for not meeting these targets, such as refunds or service credits.
SLO (Service Level Objective)
An SLO is a specific, measurable goal or target within an SLA that defines a particular aspect of the service quality. SLOs serve as benchmarks to evaluate the service provider’s performance and ensure that the agreed-upon service levels are met. Examples of SLOs include system availability (e.g., 99.9% uptime), maximum response time for support requests, or error rates.
While availability is a critical consideration in system design, it’s not always of utmost importance. Achieving five nines of availability (99.999%) isn’t always necessary, as high availability comes with trade-offs. Ensuring a high level of availability can be challenging and resource-intensive.
When designing a system, it’s essential to carefully evaluate whether your system requires high availability or if only specific components need it. This assessment helps allocate resources effectively and prioritize the most critical components for high availability while allowing other parts to function with lower levels of redundancy. Ultimately, balancing the need for high availability with the system’s overall requirements and constraints is crucial for efficient and sustainable system design.
For example, consider a payment processing system like Stripe or Visa. In such a system, the transaction processing component is critical and requires high availability to ensure uninterrupted service for customers making payments. Downtime or failures in this component could result in lost revenue and negatively impact the user experience.
On the other hand, there may be secondary components like an analytics dashboard or reporting tools that, while important, do not need the same level of high availability. These components can tolerate occasional downtime without significantly impacting the overall system performance or user experience.
How to achieve HA system?
Conceptually, achieving a high availability (HA) system is straightforward. First and foremost, you need to ensure that your system doesn’t have single points of failure (SPOFs). These are points that, if they fail, the entire system fails. Keep in mind that even team members with specialized knowledge can be considered SPOFs.
To eliminate SPOFs, you can introduce redundancy. Redundancy involves duplicating, triplicating, or even further multiplying certain parts of the system.
For instance, if you have a simple system where clients interact with a server and the server communicates with a database, the server is a single point of failure. If it gets overloaded or goes down for any reason, the entire system fails.
To enhance availability and eliminate the SPOF, you can add more servers and a load balancer between clients and servers to distribute the load across multiple servers.
However, the load balancer itself could become a single point of failure, so it should also be replicated and run on multiple servers.
Passive Redundancy
In passive redundancy, a primary system/component is backed up by a secondary (standby) system/component, which remains idle until the primary fails. Upon failure, the secondary takes over the operation. This approach is also called “hot standby” or “failover.” Examples include redundant power supplies, database replication with a standby database, or a backup server that takes over when the primary server fails.
Active Redundancy
In active redundancy, multiple systems/components work in parallel and share the workload, ensuring continued operation even if one or more components fail. Active redundancy may also provide load balancing and improved performance. Examples include redundant network connections, RAID configurations for data storage, or multiple load-balanced servers providing the same service.
It’s also important to mention that you’ll want to have a rigorous process in place to handle system failures, as they might require human intervention. For instance, if servers in your system crash, you’ll need a person to bring them back online, and it’s crucial to establish processes that ensure timely recovery. So, keeping that in mind is essential for maintaining high availability in your system
Trade-offs
Like many aspects of programming, achieving availability also involves trade-offs, such as:
-
Cost: Higher availability typically requires redundant resources, such as additional servers, storage, and networking infrastructure, leading to increased operational costs.
-
Complexity: Implementing fault tolerance, failover, and load balancing mechanisms to achieve high availability can introduce complexity into the system architecture, making it harder to understand, maintain, and troubleshoot.
-
Performance: Highly available systems may require distributing data across multiple nodes or geographic locations, which can increase latency and reduce overall performance.
-
Consistency: In distributed systems, maintaining high availability may involve sacrificing strong consistency for eventual consistency or using weaker consistency models, which can impact application logic and data integrity.
In summary, while high availability is crucial for many systems, it’s essential to carefully consider the trade-offs involved and balance them against the specific requirements and constraints of your application.
Comments