Photo by Ivan Vranić on Unsplash https://unsplash.com/@hvranic

Design For Failure

On November 2, 2020, Cloudflare had an incident that impacted the availability of the API and dashboard for six hours and 33 minutes.
Amazon Web Services (AWS) had a widespread outage in US-EAST-1 on November 25, 2020, which impacted thousands of third-party online services for several hours.
A 45 minutes outage in google cloud December, 14 2020, caused by their global authorization system.

These are some examples of infrastructure outage at major cloud companies in a period of only six weeks.
These outages had severe impact on many companies. The impact could range from being complete off-line to not being able to send mail, but it shows the dependency you can have on other systems.
And not only the outage of third party components, but also of your own services can have catastrophic impact on your system.

When using the principle “Design for Failure” your system should be able to react to outages and failures of other components.
Ideally it would be highly fault-tolerant where the system would run with full functionality, even when things are failing.
This is the case for the power supply of a data center. When the primary power supply goes down, aggregates take over the complete power supply for the data center.

This can be very difficult or expensive so most of the time you have to choose gracefull-degradation where the most important functionality will be available. To stick with the example of power supply, in a hospital the backup aggregates will only support the most important functions like the ICU and operating theaters.

An example in software development is designing, building or changing a system for ordering photos
You can imagine this will consist of

An authentication component;
A CDN for storing the photo’s;
A frontend for showing and searching photo’s;
Cloud VM’s as Web servers;
A stock component;
A payment component;
An ordering component;
A Cloud Database.

You should think about these risk you have to mitigate:

risk	mitigation
The authentication component is down	Still serve viewing of the photo’s
Network traffic is getting out of hand	Serve photo’s with lower quality
Database is down	Geo redundant databases takes over
A web server is going down	A load balancer route traffic to the healthy servers
Can I do zero downtime deployments	A load balancer and a redundant database
Stock component is down	Ordering component still accepts orders, even if out of stock
Payment component is down	Ordering component still accepts orders, pay with invoice
Ordering component is down	Users can tag interest in photo’s, so they can order later

As you can see, the design is not only limited to the software, but also has impact on infrastructure (and maybe hardware). Even the way you deploy your applications must be designed with no (or limited) downtime in mind.

Another important factor is build or buy. For instance Auth0 has an uptime > 99.99%. Are you able (skills, time and money) to build an authorization with the same uptime?

For building your own components I will describe some common patterns/best practices building and operate a high available system and how it should interact with other components. Because each topic deserves an article on his own, I will only scratch the surface of topic.

To keep it clear I will talk about the system and it’s components (software that can run independent like an API or a web frontend). Components run on instances (infrastructure on which a component is deployed). I assume that a component is always deployed on multiple instances (step 1 for availability)

change management

Roughly 70% of outages are due to deployment of new software or configuration changes on live systems. You must design your change process, so it’s able to quickly identify failure and is able to restore normal operation as soon as possible.

redundancy

For availability redundancy is a no-brainer, To avoid dependency on one cloud vendor you can go for a multi cloud strategy. For instance have your web servers running on Azure and Google Cloud Compute or run one kubernetes cluster on premise and one in the cloud. Redundant components can be on hot, cold or warm stand-by. This depends on how fast a component must be available after failure.

immutable infrastructure

When you use an immutable infrastructure, you never do an in place upgrade. Once the instance exists, it’s never changed. For an upgrade you create a brand-new instance. If there’s any error you throw it away and try again. But if it’s successful you can switch your traffic. Green/blue or canary deployment give you a rolling strategy for deploying components.

canary deployment

With canary deployment you deploy the component to a limited number of instances. Once it’s clear that everything is functioning normal, the deployment can be done on the other instances.

blue/green deployment

In this case you run two sets of instances one active and one inactive. You deploy a new component to the inactive environment and route the traffic to the new deployment. In case of failures you can reroute the traffic back to the old deployment.

feature flags

With feature flags you are able to switch (new) software on or off. This is very helpful for continuous delivery, so you can deploy with unfinished functionality switched off, but it’s also good for recovering from failures introduced by the new software. Just switch off the new software and go back to the proven functionality.

operations

After deployment you need to check if your system is behaving properly. It’s preferred to automate this and also automate the mitigating actions

observability

An important thing is the observability of the system. We need data (buzzwords KPI, SLI, SLO) to see if our system is healthy, we need logging to see if the flow through our components is behaving correct, we need monitoring to raise alerts and warnings. There are some great tools like the ELK stack (Elastic, Logstash, Kibana), TICK stack (Telegraph, Influx, Capacitor and Kibana), Prometheus, SPLUNK, Grafana each with its own benefits and complexity.

It is important is to make a conscious decision on which tools you are going to use for which purpose. You don’t want to be stuck with having everything in your logging stack, but you also don’t want to learn a gazillion of new tools.

logging

Logs should be aggregated to a single place (in tools like Splunk or elastic search). It’s also important that log levels are used correctly. An error must be an error and not a warning or some business exception. It’s a good idea to add a correlation or transit ID to the logging. This is a unique identifier for a certain chain of transactions or events and is submitted to every next request in the chain. This will make identify or investigate issues in the system.

metrics

Metrics are important indicators of system health. Data from your instances like response times, CPU usage, number of request should be collected and stored in a time-series database. Next to metrics of the instance, some metrics about the component itself is also valuable. For instance, you can track how often a method is called and what the response time is of that method. You can use something like StatsD. It has libraries for most coding languages and plugins to most databases.

monitoring

From the logs and the metrics you can create dashboard to visualize the state of the system and components. The monitoring also can hold the rules for sending out alerts in case a threshold is crossed.

health-check

Each instance must report its health. That is the health of the instance (CPU, memory usage, disk space), but also the health of the component running on it. The component should report if its able to connect to dependent components. For instance an API should have a health endpoint that also report if its able to reach the database and if the database is fast enough. Mind that the database itself should also report its health so in case of an issue you can determine where the root cause is (database or component).
These health-checks can be used in dashboards for monitoring, but they can also be used for load balancers and event-driven automation.

service registry

A service registry is like a data-store of components, including information about available instances and their locations. When a component need to call another component, you get a healthy instance from the store. The registry must ensure that the service healthy or not by calling the Health-check. In a simple system this can be implemented by a NGINX load-balancer but in more complicated systems you can use tools like Consul or F5 Big-IP

event driven automation

By monitoring your components you can restart unhealthy instances or just restart a single service on an instance. Ideally a component can take the necessary steps to recover from a failure, but you can implement it by an external system that monitors the health Self-healing can be very usefull, but you must be aware of flapping systems where your self healing process will restart components over and over again Most implementations are IFTTT like applications as StackStorm, Puppet Relay or Zapier

isolation

When a component is failing you want to limit the impact on the rest of the system. The so-called blast radius must be limited. You must be aware of cascading effects in your components. When your stock component fails, you don’t want your ordering component to go down as well. Just let it run with the risk that an ordered article is sold out or maybe has a later delivery.

failover caching

Failover caching will provide the necessary data to our component. When another component is not available, the cache still can provide the (maybe outdated) data, but the outdated data is better than nothing.

exponential back-off.

When a component is unresponsive it’s tempting to call it again and again. This way you will overload an already overloaded system. With exponential back-off you don’t retry a call immediately but instead you multiply a time-span with a certain factor before calling it again. So after 500ms, then 1000ms, then 20000ms etc etc. This will limit the load on the target component and give it time to cool down. You need to give some thought to the design of the (asynchronous) call, have a proper callback handler and think about stop conditions.

Some real world example, due to some dodgy coding (it was me) we were calling an external API every 10 ms with an exponential back-off with factor 2. We deployed it on a Friday (yes we were not scared). Next Monday the creators of the external API asked WTF we were doing, but we didn’t break their system.

circuit Breakers

On the other hand the receiving component is tempted to try to process every request. When being overloaded, you are not controlling the number of calls you receive. To prevent form being overloaded you can add a circuit breaker. If the component is overloaded the circuit breaker flips and incoming requests are failed immediately without handling in the component.

rate limiters

Rate limiting is limiting the requests (per period) processed for a particular client during a timeframe. This way you filter out demanding customers or prioritize more valuable customers over others.

load shedders

Load shedders are like circuit breakers when an instance approaches overload, it rejects excess requests. But a load shedder ensures there are always enough resources to serve critical transactions. It keeps some resources for high priority requests and doesn’t allow for low priority transactions to use all of them.

bulkheads

Bulkheads can be applied in to segregate resources. With bulkheads pattern, you can protect limited resources from being exhausted. So suppose your payment component uses the same data API as your front-end the frontend can flood the API with request, but you want to make sure the payment process is still working. The payment process should use another instance of the data API so it’s not affected by the traffic on the frontend. The component that overuses the API won’t bring down all the other components.

defensive programming

As Steven Seagal says in Under Siege “assumptions is the mother of all fuck up”. In IT Murphy’s law is deeply applicable,
So when coding always validate input, handle exceptions properly, always check for null values, think about default return values. You should do this for calls outside your component but even for call in the component. Even though everybody says it’s not going to happen.

testing

Even when you’ve made sure all risks are mitigated, still something unexpected can and will happen. Tests will help you find issues you didn’t even think about.

failover test

You should check if the switch to redundant instances is working correct. Especially for cold or warm standby instances you should regularly switch to the standby instance to see if failover is still working. A special form is disaster recovery test where you check the complete outage of a huge part of your infrastructure (like burning down of y our datacenter)

disaster recovery

A disaster recovery test (DR test) is the examination of each step in a disaster recovery plan as outlined in an organization’s business continuity/disaster recovery (BCDR) planning process. Evaluating the DR plan helps ensure that an organization can recover data, restore business critical applications and continue operations after an interruption of services.

chaos testing

One of the most popular testing solutions is the Chaos Monkey resiliency tool by Netflix.

Conclusion

Designing, building or changing a high available system is not easy, but once you have layout the fundaments just stick to it, understand the patterns, implement them, monitor and see where you have to improve.
Focus on the important parts of your system and choose wisely where to put your own effort and where to use or buy the tools available. Be aware that there is no silver bullet and that most systems are not Netflix or Twitter, so you can get away with much less complex solutions.

Design for failure