AWS Deep Dive

AWS Well-Architected Framework

The Pillars of the Framework


Best Practices

Key question: How do you securely operate your workload?

(Recall here that the Well-Architected Framework uses “workload” to mean “the collection of resources and processes that provides an atomic business function”.)

This section ends with the recommendation that each workload have a dedicated AWS account, which makes sense conceptually and is the first guidance I’ve seen regarding the recommended way to determine account boundaries.

Identity and Access Management

Finally, a definition of “principal”!

Principal: Something that performs an action within an AWS account (e.g., accounts themselves, users, roles, and — in some cases — services).

Key questions:

Identities are managed by things like logins, AWS access keys, IdPs, etc. Permissions are managed using roles and IAM policies (which might be attached to roles, or directly to users/machines).


Key question: How do you detect and investigate security events?

Infrastructure Protection

Key questions:

I knew that AMIs were used as images for EC2 instances, but apparently they’re also used in the Amazon Elastic Container Service and AWS Elastic Beanstalk.

Data Protection

Key questions:

Incident Response

Interesting idea: Using CloudFormation to spin up known-clean, isolated forensics environments. (That said, given that almost everything you deal with as an admin in AWS is virtualized and never accessed directly, there are real limits on how much you can do here. Still, this is useful, especially if/when the forensics investigation becomes a legal investigation…)

Key question: How do you anticipate, respond to, and recover from incidents?

I wonder if there’s a way to take memory snapshots of EC2 instances…?


The Reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.

Design Principles

Amazon suggests monitoring KPIs that measure some aspect of business value provided by a service, rather than purely technical operational performance. Not 100% sure what I think of this — some aspects of technical performance are going to be important for providing business value, but are also not going to translate to that value in a straight-forward way.

There’s also an emphasis on simulating workload failures in this section. This is obviously a lot easier to do with turnkey infrastructure.

Best Practices

Key questions:

These questions are of a set with the “Infrastructure Protection” questions above. However, rather than focusing on the security parameters of the architecture, these questions are about capacity and structure.

Workload Architecture

Key questions:

These are all actually hard questions! Lots of trade-offs here, especially w.r.t. performance.

Change management

Key questions:

Failure Management

Key questions:

I’d be curious what the best practices for implementing fault isolation are. I assume the key is a redundant, modular design within workloads, and the prevention of cross-module dependencies.