Journey to a Successful AWS Deployment – Part 3 Cloud Reliability
Insight recently partnered with Amazon Web Services (AWS) to help customers with their cloud journey. Each customer’s requirements for the cloud can be different, from applications born in the cloud to a “lift and shift” migration of existing workloads to the cloud. A consultative approach is taken to ensure the migration to the cloud is efficient and fit for purpose.
What is common between each approach, is how the solution is architected to result in a successful deployment. Insight approach each design following the AWS Well-Architected Framework. The framework is a published set of best practices for deployment to AWS services. This framework provides consistency across deployments.
The AWS Well-Architected Framework consists of five pillars:
- Operational Excellence
- Performance Efficiency
- Cost Optimisation
The following post is part 3 of a blog series covering each of the five pillars and how Insight use these whilst architecting in the cloud. AWS provide a whitepaper on this subject – AWS Well-Architected Framework.
The third pillar in the AWS Well-Architected Framework is Reliability. This pillar covers how to architect reliable systems in AWS, how to recover from an infrastructure or service distribution and how to dynamically scale to meet demand.
The design principles covered are as follows.
- Test Recovery Procedures: Test how a system fails and validate the recovery procedures. Use automation to simulate failures to be able to recreate failure behaviours from various scenarios. Fix issues from testing before failures occur in the real environment.
- Automatically Recover From Failure: Monitor key performance indicators (KPIs) to trigger automation when a certain threshold is reached. Anticipate and remediate failures before they occur using automatic notifications and failure tracking.
- Scale Horizontally To Increase Aggregate System Availability: Aggregate availability by using multiple small resources instead of fewer, larger, resources. Requests are distributed across more resources, ensuring resources don’t share a single point of failure.
- Stop Guessing Capacity: Traditional on-premises data centres were designed and sized to support the maximum performance, even if that was only met for a small period of time. Traditional data centres were also hard to scale and resources could be saturated during unexpected workload periods. Within the cloud, demand on resources can be monitored with automation configured to add and remove resources as required to satisfy optimal performance levels.
- Manage Change In Automation: Remove processes of manual changes but rather changes to infrastructure should be automated, changes can be managed through automation.
Availability is measured in various ways, service availability can be measured by a percentage of time an application is operating normally. This percentage covers what is expected as normal operational time and can exclude maintenance windows. Service availability percentage is a common practice throughout the industry such as 99.9% or 99.999% often referred to as number of 9s such as “Five 9s”.
Expectable Outage Per Year
3 days 15 hours
8 hours 45 minutes
4 hours 22 minute
The higher availability required can result in increased costs, AWS offer availability SLA on managed services out of the box as the underlying infrastructure resiliency is managed by AWS. For instance, Amazon S3 Standard storage class is designed for 99.99% availability. Other services that are not managed such as Amazon EC2, they must be architected in a way to meet the required availability percentage.
Design for High Availability
Designing for high availability can vary depending on service and this is an area where Insight can add a lot of value. One area to consider, which is not initially obvious, is the limits of AWS services. A common lack of availability can result from resource constraints. AWS apply default limits to services such as limiting the number of instances within an Auto Scaling group, available instance network IO throughput and Amazon Relational Database Service (RDS) allocated storage.
Limits are enforced per AWS region and per AWS account, limits can be lifted from the default easily but it should be considered, for instance consider per region if the application is architected to run in multiple regions. Amazon CloudWatch can set alarms to indicate then the limits are close to being reached to allow for either manual action or automated.
Consider IP address assignment, whether workloads are designed on-premises data centre or in the AWS cloud, IP address assignment can play a big part in availability. Amazon VPC allows for subnets and CIDR blocks assigned. VPC can be used to link the on-premises to cloud resources, link additional VPCs and additional AWS account resources. It’s important there is no CIDR blocks overlap in the environment, plan enough CIDR block space for planned workloads including scale for unexpected workload and to absorb attacks such as a Denial of Service (DoS) attack.
For high availability levels such as 99.999% it can prove costly and difficult to achieve, every part of the application beyond just the underlying infrastructure must be designed accordingly and testing each aspect of the application. Remove human interventions out of the operation model and adopt automation. Sources of interruption can include:
- Hardware Failure
- Deployment Failure
- High load and data requests
- Service dependencies
It’s important to understand availability needs of applications, the application may require higher availability for a front end interface and persistent storage over another part of the application such as reporting. It may be acceptable for the application to have different availability requirements due to the time of the day and when its accessed. Designing an application with 99% compared to 99.999% can have significant impact on effort, performance and cost. Insight can help analyse the availability requirements and what architecture in AWS can meet these requirements.
Designing applications for high availability can be broken down into the following common practices:
- Fault Isolation Zones – Technique to increase availability beyond individual components. Adopt fault isolation by using multiple Availability Zones (AZs) within a region, a region consist of two or more AZs.
- Redundant Components – Avoid single points of failure within the system, no reliance on single AZ, single compute node, single storage location or database.
- Micro-Service Architecture – Create services with a minimal set of functionality, this helps with availability and with scaling deployment.
- Recovery-Oriented Computing – Is a term applied systematic approaches to improving recovery. A method to identify characteristics in systems that enhance recovery. Identify the type of failure that can affect the system, such as a hardware or communication failure, and implement procedures to reflect accordingly.
Operations have an impact on availability, aside from software and hardware, erroneous configuration changes can have an unexpected impact. To meet available design goals, plan the automated and human processes for the full lifecycle of the application.
Automate deployments to eliminate impact and reduce human involvement, where possible.
- Canary Deployment – is a practice of directing a small number of users or customers to a new environment that is monitored for any errors generated. If critical problems arise, move the users or customer back to the previous deployment. If the canary deployment is successful, continue the roll out.
- Blue-Green Deployment – Deploy a full fleet of the application in parallel to existing environment. Send groups of users or customers to blue or green deployments that have the updated deployment, if there are problems simply redirect users back to the original environment. AWS Code Deploy can be configured to work with a blue-green deployment.
- Feature Toggles – Deploy software with a feature turned off to avoid customers viewing the feature until ready, the feature can then be enabled or removed if problems are detected.
As part of operations, monitoring is important to effectively detect failures that can affect availability. Failures in components must be detected and alerted to ensure remediation. Determine the metrics required to be monitored and logs that need to be collected to ensure availability for that system.
Examples of High Availability
AWS provide the building blocks for highly available applications to meet business requirements, architecting these services and architecting how they work together whilst remaining highly available is where Insight can assist. There are many different ways to design for resilience and it depends on what services are in use.
The following will show, at a high level, a couple of examples on how to architect for high availability in AWS. The following examples will focus on a handful of core services such as Amazon EC2, Amazon RDS and VPC.
The first example will look at a simple application, EC2 instances are spread across multiple AZs within a region and traffic from users is load balanced using Elastic Load Balancer (ELB) which AWS handle the resilience as a service. Although this is a basic example, it shows the premise. This sort of scenario could fit a web application with generally static data, Amazon S3 could be leverage for storage but is not shown in the illustration.
The next example will take the same premise and elaborates on it further and also adds scale. This time we look at a 3 tier application. The application is split across AZs similar to the first example with application and database layer. Database is using Amazon RDS service configured in a multi-AZ configuration, this creates a synchronous copy of the data in another AZ. Read Replicas can then be created to help scale the database in terms of reads.
Web tier sits behind a public facing ELB that distributes the load, this tier can be configured to Auto Scale with demand. Application tier sits behind an internal ELB to distribute the load and again can be Auto Scaled with configuration management tools used to ensure consistency.
Availability, in this instance, can be measured by each layer of the application. Percentages here are just an example, however the overall levels of the application must be considered to get a full idea of availability. Due to the configuration of the database tier, this tier can increase the overall availability of the application and speed of recovery.
Resilience is the third pillar within the AWS Well-Architected Framework and with all of the framework, it represents an ongoing effort. By planning the AWS deployment up front and following this methodology, it ensures best practice is followed from the start and not bolted on as an afterthought.
The remaining pillars in the framework will be covered in this blog series.
If you are interested in finding out more, please contact your Insight Account Manager or get in touch via our contact form here.
Why not also read 'Journey to a Successful AWS Deployment – Part 4 Cloud Performance'?