Fail to plan, plan to fail. Or so the old adage goes. And that concept is never more relevant when we are talking about choosing the right high availability tactics to handle workloads on the AWS cloud.
Increasingly, more workloads are deployed by companies on the AWS cloud. As workloads increase, one of the key questions that arises is how to provide for a reliable site or service setup. In this article we’ll explore the AWS cloud resources that can help you plan for failure.
High availability tactics – it’s all about zones
One of the key tools at our disposal is offered by the way Amazon build and expands its global infrastructure with the use of availability zones (AZs) per region. This overview map shows that most of the AWS regions worldwide have three AZs, or at least two in some cases.
An Availability Zone (AZ) within an AWS region is an isolated data center environment from the other AZs. This allows for cloud resources located in different AZs in the same region to be available in the case of a significant failure that will bring down a whole AZ.
Elastic IP address
Another important cloud resource relevant to high availability tactics planning in the AWS cloud are the Elastic IP addresses (EIPs). These elastic IPs can be used to address EC2 instances, while they remain a decoupled unit. Having this capability means that a system administrator can choose which machine to allocate the EIP. Then he/she can reassign the same EIP to another machine in case of a failure in the first instance.
Now, having the AZs and EIPs at hand, a system administrator operating on the AWS cloud can plan for very highly constrained reliability for a site setup. At first, there is the cost to be considered to follow the setup requirements. EIPs come at a standard, predictable price for a few tens of dollars. Using AZs can be extremely expensive without the right planning or even without the real need to go down this path.
Network communication between AWS Availability Zones is charged on a per usage basis. Like the external network connectivity but with lower rates. So, if your site setup requires a lot of data replication or any other type of network communication between AWS AZs, you will get the corresponding bill. Based on the metered network consumption.
High Availability tactics considerations
The main aspects to consider to have a complete set of requirements for the site reliability on the AWS cloud should also include the following:
- Performance and network latency tolerance. Because in some applications the performance impact can be significant when deploying over multiple Availability Zones.
- Downtime for possible service outage or part of it can be in the order of milliseconds or a few seconds. This makes planning for High Availability completely different.
We can easily imagine that the higher the requirements, the higher the bill will be. And in the case of the cloud, this comes at the comfortable monthly or hourly rate. But still, it raises the cost by a significant magnitude.
At this point it’s important to highlight the basic architecture patterns that you can apply on AWS cloud deployments.
An active site deployment is operating in a single AZ, while there is in place a mechanism to clone an identical deployment in a different AZ at the event of a failure. The failover is based on reassigning EIPs from the primary deployment that fails, to the new clone that is being deployed. This scenario is the most affordable and comes at the cost of a few minutes downtime. Long enough for the clone deployment to stand up.
An active deployment is spanned across two AZs. With the second AZ to host only a subset of the resources hosted in the first AZ. The failover should be based on managing DNS records pointing to EIPs across both AZs. And rotating over only the EIPs mapped to active sever instances. This scenario increases the cost, as during normal operation some kind of data redundancy should be in place. This is so we can either replicate data files or databases. Also, some of the front-end servers should communicate with data or databases across AZ, which is a chargeable action. Another impact is that during failover it might occur that the active system is lower in performance than the primary. The benefit is the downtime drops significantly to just few seconds or even less.
Full scale failover
The deployment is spread evenly in two or more AZs, most commonly three. This means you distribute the load and the allocated resources with failover always among similar setups. This is the case to utilize again the EIPs assigned to the different front-ends, across all the ΑΖs. And transfer the failed front-end EIP to one of those that are online. In this case, extensive data replication must be in place. And the plan can include patterns to minimize the inter-AZ network traffic. Of course, in terms of performance and downtime this is the best case to pursue a highly reliable site.
There you go folks, the right high availability tactics on the AWS cloud are essential when considering the reliability of your site. Especially if you want to avoid the case when something unexpected happens, which leads to an outage for your production service. That is the last thing anyone wants.
While we’re at it, one final thought here on site reliability is that at Stackmasters, we always look at the issue in combination with the right operational processes – namely automation and Disaster-Recovery planning. Any other way of working that isn’t Cloud agnostic and/or holistic, in our opinion, is just a sloppy shortcut inviting trouble.