Testing cybersecurity resilience with Chaos Engineering

Chaos engineering principles, when applied to cybersecurity, can help build more dynamic, proactive and responsive security controls on-cloud

$300,000 an hour.

Unfortunately, this isn’t the salary of even the highest paid software engineer in the world. Far from it.

It is the cost incurred for a single hour of server downtime. As per Information Technology Intelligence Consulting’s 2021 Hourly Cost of Downtime survey, 91% of mid-sized enterprises (SMEs) and large enterprises stand to lose this amount or more, sometimes even going up to $5 million.

There is not even the slightest wiggle room to host a gamble when outages have enormous and immediate business and customer impact.

Chaos Engineering

The scalability and flexibility that comes with cloud migration and a distributed or micro services architecture is unparalleled. It is easy to be dazzled when there is so much power at your fingertips – the cloud is your unrestricted playing field. However, cloud’s complexity also makes it easy to miss the fissures until your interlinked systems are hacked or other production-level issues surface.

A good way to test your systems to build resilience, security and reliability at scale with minimal downtime and faster recovery is to have a fault injection system or chaos engineering.

Dare let the monkey loose?

Recall the era when you ‘streamed’ videos on YouTube? You would have completely forgotten what the video was all about given the number of reloads as it crashed halfway.

Streaming giant Netflix took a great approach to cushion their migration to cloud over a decade ago. In order to offer a seamless streaming experience, they decided to test the resilience of their distributed systems in case a certain number of critical components failed. With their in-house tool Chaos Monkey, they deliberately engineered chaos to stress-test their systems and build in fail-safes – by randomly disabling production instances and services. They then brought out the more advanced Simian Army toolset, available open source.

It’s smart to carry out these experiments in a test environment first rather than jump head-first into production, for obvious reasons. A controlled environment and blast radius will give you a start-point, end-point and result arrived at within a framework.

Gartner estimates that by 2023, 80% of organizations that use Chaos Engineering practices as part of SRE initiatives will reduce their mean time to resolution (MTTR) by 90%.

The case for Chaos Engineering

Site reliability engineering (SRE) strategies are hardly new. However, chaos engineering put a smart twist to it. According to the Gremlin’s Kolton Andrus, among the leading voices on the practice – by intentionally injecting harm into systems, organizations can monitor impact and fix failures before they negatively impact customer experiences. Businesses can avoid costly outages while reducing MTTD and MTTR, preparing teams for the unknown, and protecting customer experience.

These observances are amply backed by data from the Gremlin’s inaugural State of Chaos Engineering Report (2021). Teams who frequently ran Chaos Engineering experiments boasted four nines of availability with an MTTR of less than one hour.

The report outlined causes of SEV0 (catastrophic impact) and SEV1 (critical impact) incidents which include:

  • Deploying bad code
  • Internal dependency issues
  • Configuration errors in cloud infrastructure or container orchestrator
  • Networking issues
  • Third-party dependency issues
  • Managed service provider issues
  • On-prem machine/infrastructure failure
  • Database/messaging/cache issues

Network and resource attacks are popular and account for a whopping 84%, while target type is host (80%) and container (29%). Adoption is also increasing across a diverse set of teams – from just the SREs earlier, to teams managing platform, infrastructure, operations, and application development.

Chaos waiting to happen: The cybersecurity angle

Early 2018, Charles Nwatu, then-Director of Security at Stitch Fix and Aaron Rinehart, who developed ChaoSlingr – one of the first ever security chaos engineering tools, wrote a detailed piece on opensource.com. They make a very strong case for why security chaos engineering is the next phase of evolution. Deliberately introducing security loopholes, imitating user behaviour and mimicking security system glitches / failures can help enterprises build stronger, more robust and targeted cybersecurity strategies.

Some pointers to keep in mind while assessing security chaos testing as your weapon of choice:

  • Keep it simple: Engineering a complicated first-time attack will get you nowhere. By going overboard with a whole host of variables, you will lose the plot – of getting your teams to understand the concept and understand the result. Start simple, with just a single variable. You can discuss within your teams and categorize known security gaps / issues. Begin with the one with least impact.
  • Get the team involved: Every team in the DevSecOps pipeline should be part of the chaos testing experiments. This will help them keenly understand how to build, secure and recover the system at fault with minimal outage, and translate this to an at-scale scenario.
  • Chart out the nitty-gritties: Clearly charting out the process is a great way to get teams to understand and contribute while assessing if everything is going according to plan. Defining the start, progress, response and end – how it will take off, which application / system will be targeted, what kind of security fault will be injected, how the system should ideally respond, in what manner and how quickly the response team tacklesan issue that arises, what to do if the experiment gets out of hand, at what point the experiment will end. This kind of approach will also enable fresh / different perspectives to emerge.
  • Set a recovery point :It could so happen that the experiment could lead to some unexpected issues arising or create network instabilities. Setting a threshold beyond which the experiment should be terminated ensures the problem doesn’t cascade. This is also the point to execute a quick roll-back to the original stable state with all errors removed.
  • Review before a repeat: A deep-dive with possible solutions to be executed should be discussed and executed before doing a repeat. The discussion could bring up some of the finer aspects of initiating security controls while helping to gauge the team’s skillsets and readiness to tackle critical incidents large-scale.
  • Validate in production environment: It is imperative to validate controls in a production environment. Since the test environment is a controlled one, things could work very differently in a real-world scenario.
  • Be proactive and constantly monitor: In a world where the cloud is a dynamic place and business-critical applications cannot afford to fail, chaos testing proactively ensures robust security, resilience and availability. Also, a larger number and diversity of security fault scenarios gives response teams the agility, experience, confidence and reference to plug gaps with minimal downtime. Team members across the board will also be able to spot potential security gaps faster.