Journey to a Successful AWS Deployment – Part 1 Cloud Operations
Insight recently partnered with Amazon Web Services (AWS) to help customers with their cloud journey. Each customer’s requirements for the cloud can be different, from applications born in the cloud to a “lift and shift” migration of existing workloads to the cloud. A consultative approach is taken to ensure the migration to the cloud is efficient and fit for purpose.
What is common between each approach, is how the solution is architected to result in a successful deployment. Insight approach each design following the AWS Well-Architected Framework. The framework is a published set of best practices for deployment to AWS services. This framework provides consistency across deployments.
The AWS Well-Architected Framework consists of five pillars:
- Operational Excellence
- Performance Efficiency
- Cost Optimisation
The following blog series will cover each of these pillars and how Insight use these whilst architecting in the cloud. AWS provide a whitepaper on this subject – AWS Well-Architected Framework.
The first pillar in the AWS Well-Architected Framework is – Operation Excellence. This pillar of the framework covers the initial deployment of systems as well as gaining insight into ongoing operations. The six design principles considered for Operation Excellence are as follows:
- Perform Operations as Code: Adopting the same engineering discipline used in application code to the environment. Applications and infrastructure can be deployed and managed as code. Using this methodology, operation procedures can be automated, reducing human errors, and can respond to alerts to provide consistency across the environment.
- Annotated Documentation: An automated way to create documentation after every build. This can be achieved by using annotations as an input into operations code. This reduces human error when creating manual documentation for an environment. Manual documentation can be prone to error and may not be fully up to date with configuration drift within the environment. Annotated documentation can be used by humans and systems.
- Make Frequent, Small, Reversible Changes: One of the biggest changes in mindset versus traditional datacentre workloads. Make regular, but small incremental changes in a way that can be easily reversed if they fail and without affecting customers. This lowers the impact to customers but also to components.
- Refine Operations Procedures Frequently: Look at ways to improve operation procedures as they are used, and evolve with the workload. Perform regular “retrospective” meetings to progressively improve procedures. AWS use the term “Game Days”, a time to get the teams together and test procedures in a controlled, simulated busy period on the live environment. A great way to identify improvements without waiting for failures to occur.
- Anticipate Failure: Traditional data centre operations are usually reactive to failures followed by a post-mortem of that failure. By accepting failure will occur, the mindset here is to become proactive and perform pre-mortem exercises to identify potential sources of failure so they can be removed. Test different failure scenarios and have a full understanding of their impact, “Game-Days” can also help here.
- Learn From All Operation Failures: Establish a culture of continuous evolution by using the design principles above. Drive evolution by sharing between teams, share impact of failures from post-mortems of any failures, and share lessons learned from operations and projects.
Operational Excellence is then broken down into three focus areas of best practice:
The following section highlights the key takeaway’s of each area.
Preparation within Operation Excellence relates to understanding the workload and understanding its expected behaviour. With this understanding, correct procedures can be designed and built to support them.
Understanding a workload can involve multiple teams and their dependencies such as devolvement, operations and performance teams. It can also include the required external regulatory and compliance requirements.
Educate each workload stakeholder and the impact their team can have on AWS and its services. Insight can work with each team to help educate on AWS services as well as advise on external regulatory and compliance requirements. For existing environments, Insight can help identify gaps in operations and provide core checks around the cloud infrastructure.
A large part of preparation is the design of workloads, once the workload is fully understood it can be designed in a way which includes initial deployment and how it is updated and operated. Implementing design principles listed above such as Perform Operations as Code. Infrastructure, applications and operations can all be defined and updated using just code. By defining in code, application engineering discipline can be adopted such as version control, continuous delivery and continuous integration.
Version controlled templates of the infrastructure can be achieved by using AWS CloudFormation templates or open-source HashiCorp’s Terraform.
Continuing the design principle, Continuous Integration / Continuous Deployment (CI/CD) pipelines should be adopted. Many DevOps tools are available to do this and Insight can advise further on products that could fit. From an AWS perspective they offer a comprehensive set of tools as part of AWS Developer Tools such as AWS CodeCommit, AWS CodeBuild, AWS CodePipeline and AWS CodeDeploy.
Once deployed, monitoring and logging is key to operations and importantly it should be baked in from the start and not bolted on. Logs from infrastructure components such as network logs via VPC, recording and logging API calls within the environment and logs from services such as AWS Lambda. Service logs and application logs can be centralised to Amazon CloudWatch. Events can be triggers on certain alarms based on certain log events, such as unauthorised API access to a key application.
As well as logging, key metrics of applications can be configured to be monitored using Amazon CloudWatch to understand an application behaviour and to be able to proactively adjust and scale as required.
By adopting this design principle, the environment can minimise human errors and config drifts. To assist further, AWS Config service can be leveraged to track changes across the entire estate. By taking snapshots of the environment it makes it easy to go back and see what changes were made and when.
Operation is focused around understanding operational health and responding to reported events. It is important for teams to understand the ongoing operational health of workloads utilising metrics based monitoring using Amazon CloudWatch. These metrics can be used to build custom dashboards for monitoring.
Once the metrics are in place, alarms can be configured to alert when a certain threshold is met or from a state of change. Target services for such alarms can consist of AWS services such as Lambda functions, Amazon SNS topics for notifications, Amazon ECS tasks such as scaling container instances and AWS System Manager automation. CloudWatch can also directly scale services such as Amazon EC2 instances and Amazon SQS message queuing service based on performance metrics.
To elaborate further on logging covered in the previous section, logs can be centralised to CloudWatch Logs from VPC Flow Logs, AWS Lambda, AWS CloudTrail and agent based instance application logs. Logs can also be consumed using S3 which can prove cheaper than using CloudWatch Logs, Insight can advise around different approaches. Using CloudWatch Logs, 3rd party log analysis and Business Intelligent tools can leverage the APIs and SDKs.
Evolution draws further on the design principles – learning from experience and learn by sharing. Learning from experience comes from regularly analysing the environment and not only from post-mortems after a failure but also from discovery from “Game Days”. Analyse any failures and regularly review teams lessons learned and validate improvements.
Centrally collected logs can be easily analysed for not only failures but operation activity, changes to the environment and applications. Over time common trends can be detected, correlate events and activities to outcomes. CloudTrail, which tracks API activity in the environment, can be consumed by 3rd party software via APIs to be able to detect activity that’s abnormal for the environment.
Learn by sharing is a method of sharing information between teams to pick up those avoidable errors and help with deployments using previous experience and challenges. As the environment can be operated in code, that code can be easily shared within and across AWS accounts. Use version controlled code across the environment and share application libraries, scripts and procedure documentation.
Cross account access can be easily enabled using AWS Identity and Access Management (IAM), AWS’s tool for handling things such as users, groups, passwords, certificates and permissions. Different AWS accounts can share resources by assuming roles configured within IAM of roles with the assigned granular permissions. This access can be temporary or permanent depending on the requirement. Cross account management can be achieved by various methods and not one method fits all, Insight can evaluate and recommend the best method to achieve this.
Operational Excellence is the first pillar within the AWS Well-Architected Framework and with all of the framework, it represents an ongoing effort. By planning the AWS deployment up front and following this methodology, it ensures best practice is followed from the start and not bolted on as an afterthought.
The remaining pillars in the framework will be covered in this blog series.
If you are interested in finding out more, please contact your Insight Account Manager or get in touch via our contact form here.