When superior performance comes at a higher price, innovation makes it affordable. This is quite evident from the way AWS has developed its services:
- gp3, successor to gp2 volumes: It offers the same durability, supported volume size, max. IOPS per volume and max. IOPS per instance. The main difference between gp2 and gp3 is gp3 separation of IOPS, bandwidth and volume size. This flexibility to configure each part independently is where it’s at savings enter.
- AWS Graviton3 processors: It offers 25% better computing, double floating point and improved cryptographic performance compared to its predecessors. It is 3x faster than Graviton 2 and supports DDR5 memory, providing 50% more bandwidth than DDR4 (Graviton 2).
To be better at assessing your basic infrastructure needs, knowing AWS services is only half the battle. In my previous blog I discussed a number of areas where engineering teams often falter. Be sure to read it! Unpacking our findings from the assessment of numerous infrastructures – Part 1
What we will discuss here is:
- Are your systems really reliable?
- How do you respond to a security incident?
- How to reduce defects, facilitate repair and improve production flow? (Operational Excellence)
Are your systems really reliable?
Almost 67% of teams showed high risk in the resilience testing questions. Starting with a lack of basic forethought about how things might go wrong and making plans about what you would do in that case. Of course, teams performed root cause analysis after things actually went wrong — which we can think of as learning from mistakes. For most of them — there is no manual/procedure for failure investigation and post-incident analysis.
How do you plan for disaster recovery?
Eighty percent of the workloads we reviewed have high risk in this area. Despite disaster recovery being a vital need, many organizations avoid it due to its perceived complexity and cost. Some other common reasons were – insufficient time, inadequate resources, inability to set priorities due to lack of qualified personneletc.
An easy way to start is by noting:
- Recovery point target: How much data are you willing to lose?
- Recovery time goal: How long can you tolerate downtime to serve your customers?
The next important step is to plan and work on recovery strategies. Let’s consider Lambda function. How you can think about different error scenarios:
- Manual setup errors: Risk of setting incorrect code or configuration changes.
- Cold start delay: What happens with Lambda is that it takes time to initialize the underlying hardware, resulting in the first request taking longer to be served, which is often attributed to the instance timing out due to inactivity. This results in a poor user experience.
- Limitation of lambda concurrency: Risk of lowering the default concurrency limit, where if exceeded, the lambda is no longer called, resulting in the loss of all requests.
Or maybe answering questions like — what’s going to happen to yours application if yours the database is leaving? — Is it reconnecting? Does it connect correctly? Is it re-resolving DNS names?
While the cloud takes most of your “hard work” of managing your infrastructure, it doesn’t include managing your applications and business requirements.
Some best practices to follow
- Be aware of fixed service quotas, service limits and physical resource limits to prevent service interruptions or financial overruns.
- Validate backup integrity and processes by performing recovery tests.
- Ensure there is a sufficient gap between current quotas and maximum usage to allow diversion.
How do you respond to a security incident?
75% of technology teams are not doing a good job of responding to security incidents. They do not plan ahead for things that happen in the security environment. Only 30% of teams knew what tools they would use to mitigate or investigate a security incident.
Now we are talking about security incidents caused by exploited frameworks. Some of the common signs observed were:
- Allowing untrusted code execution on your machines.
- Failure to put in place appropriate access controls to storage services, such as leading to data leakage from the S3 bucket, potentially making the data public.
- Accidental exposure of API keys, for example when checking into a public Git repository.
Another aspect of security is understanding the health of your workload, which includes monitoring and telemetry. In this framework, we distinguish between user behavior monitoring and actual user monitoring versus workload behavior. This is significant here because teams are undoubtedly collecting all kinds of data, but not doing much with it.
- More than half of them have clearly defined their KPIs, but fewer have actually established baselines for what normal looks like.
- The number further decreases when it comes to setting alerts for those monitored items.
Then comes access and granting least privilege. Although the teams understood what they were doing and what approach they should take, few followed through. He existed absolute absence from:
- Role-based access mechanism
- Multi-factor authentication
- Password rotation and,
- Using secret vaults like Secrets Managers or HashiCorp Vault (and instead simply injecting them into the configuration for your applications), etc.
Briefly, automation of credential management it almost doesn’t exist.
How to reduce defects, facilitate repair and improve the production setup process?
Yes, finally, we are talking about the pillar – operational excellence. People are pretty familiar with version control systems and use Git (mostly). They do a lot of automated testing in their CIs, basically a lot of smoke tests and integration tests.
Operational excellence is focused on defining, executing, measuring and improving standard operating procedures in response to incidents and customer requests. Following the DevOps philosophy is not enough if the tools and workflow do not support it. Lack of proper documentation and sole dependence on DevOps engineers to use automation led to exhaustion. Manually cobbling together solutions by DevOps engineers for each situation resulted in slow workflow development and fragile operations.
According to Gartner, platform engineering is an emerging trend within digital transformation efforts that “improves developer experience and productivity by providing self-service capabilities with automated infrastructure operations.” Apart from the commercial hype, an Internal development platform is a curated set of tools, capabilities, and processes packaged together for ease of use by development teams. Reduced human dependency and standardized workflow empower engineering teams to scale effectively.
I guess the primary takeaway for us through the reviews was that people today are better at building platforms than they are at securing or managing them. This is a real lesson, and chances are it applies to you too.
What’s next?
Over time, your workloads evolve and adapt to demanding business needs and highly reliant clients; which is more than necessary to ensure that they remain safe, reliable and efficient in order to better serve them.
You should thoroughly try the Well-Designed Architecture Review tool available in your AWS console. You can start by working through these questions and following up on related information to better understand your own practice.
Remove the ‘AWS tag’ from the WAR tool and you’re left with best practices that help you provide a consistent approach to designing secure and scalable systems on the AWS Cloud.