This content originally appeared on Level Up Coding - Medium and was authored by Chinedu Ikechi

A microservice application with an uptime percentage of 99.9% can be considered highly available, but a downtime of 0.1% quickly becomes pronounced as volumes increases. Per 1000 requests, there might be only one failure, but per million requests? That’s 1000 failures.
It’s impossible to eliminate failure in microservice applications. Instead, you need to focus on designing microservices that are failure-tolerant, and are able to gracefully recover from failures or mitigate the impact of those failures on your system.
In this article, we’ll look at the sources of failure in a microservice application, how to mitigate them or their impact on a system, and how to maximize service availability — essentially, distinct ways to design and ensure a reliable, failure-tolerant microservice.
Sources of failure
Failures leads to unreliable microservices. But in a complex system, failure is inevitable. It’s not uncertain that at a point in the lifetime of an application, a failure will occur.
You need to understand the different types of failure your application might be susceptible to, as this would enable you to react rapidly and architect appropriate mitigation strategies.
In a microservice, every point of interaction between your service and another component represents a possible point of failure. There are four major areas failures could occur in:
- Hardware
- Communication
- Dependencies
- Internal
Let’s look at each point separately:
- Hardware:
Regardless of where you run your services — in a public cloud or on-premise, the reliability of your services will heavily depend on the physical and virtual infrastructure that underpins them. Failure at this layer of your application is often the most catastrophic because hardware failure affects the operation of multiple services within an organization. Some of the causes of failure within the hardware layer of an application:
- Host
- Data center
- Host configuration
- Physical network
- Operating system and resource isolation - Communication:
Microservices is basically all about communication: synchronous communication; asynchronous communication; communication with internal dependencies; communication with external dependencies. Failures can occur at the point of one of these interactions, and the effect, if not handled properly, can cascade through the whole application.
The possible sources of communication failures include:
- Network
- Firewall
- Messaging systems
- DNS
- Poor health checks - Dependencies:
A microservice consists of multiple dependencies, both external and internal. Externally, a microservice application might rely on a third-party API, and internally, a service might rely on a database. These points of dependencies are potential points of failure in a microservice. Some of the sources of dependency-related failures includes:
- Timeouts
- Nonbackwards-compatibility functionality
- Internal component failures
- External dependencies - Internal:
This source of failure is largely caused by poor design, inadequate development, poor testing, or incorrect deployment of services. If a service is poorly designed, and actually fails in production, it’s effect on the whole system will be detrimental.
You’ve seen the various areas where failures are likely to occur in your system. Next, let’s look at the different strategies you can utilize to mitigate the impact of these failures in your system.
Strategies to mitigate the impact of failures in a system
If you’re out to build a complex system, it is inevitable that failures will occur in your system. With this knowledge, you need to design and build your services to be able to minimize the impact of failures, maximize availability, and recover rapidly.
The following strategies can be employed to enable us achieve a fault-tolerant and reliable system:
- Retries
This strategy is a tricky one. It is capable of masking the occurrence of abnormal behaviors from end users, but when used wrongly, it can exacerbate the original issue and deteriorate the system.
Failures might be persistent or isolated. If the failure is isolated, then a retry strategy would help mitigate the effect of the failure, but if the failure is persistent, retries may worsen the issue and further destabilize the system.
You need to use a retry strategy that would improve your system’s resiliency during intermittent failures without leading to the collapse of your system when persistent failures occur. This strategy is known as exponential back-off. The idea is to give a system under load time to recover.
How can this be achieved?
Use a variable time between successive retries to try to spread them out evenly and reduce the frequency of the retry-based load.
Retries is perfect for handling intermittent failures, but caution needs to be applied in the face of persistent failures as they can lead to the destabilization of your system. - Timeouts
Resources are being consumed when you have to wait for some time to get a response after making a request to a service. These resources can be conserved by setting a deadline or timeout within your HTTP requests. You want to timeout if you haven’t received any response, but not if the response is slow to download.
Some errors happen instantly, but many failures are slow. For instance, if a service is overloaded with requests, response might be slow, in turn consuming resources of the service that made the request while it waits for a response that may never come. If you don’t set request timeouts, it’s easy for unresponsiveness to cascade throughout the entire microservices. - Fallbacks
There are three fallback options we can employ when failures occur:
— Graceful degradation: Say you’re building a micro-blogging application, you would want your users who are unable post new updates to be able to view, like, and interact with other users updates. The graceful degradation technique is well suited for this situation. So rather than a user being unable to use our application due to a service failure, we can degrade that service gracefully, and allow them to carry out other functionalities in the application.
— Caching: If we cached the results of a query to a service right before failure, we would still be able to provide information to our customers and service collaborators. This ensures an improved performance of our system, and leads to a reliable system.
— Redundancy: Implementing redundancy in your system ensures you have multiple sources you can make requests to in case your primary source fails. If your system is globally distributed, you could even fall back on services hosted in another region — this is how Amazon can serve a given customer from any of their global data centers. - Circuit breakers:
In electrical wiring, the basic function of a circuit breaker is to interrupt current flow after protective relays detect a fault. Similarly, in distributed systems, a circuit breaker is a pattern for pausing requests made to a failing service to prevent cascading failures.
Two principles are behind the design of a circuit breaker:
— Remote communication should fail quickly in the event of an issue, rather than wasting resources waiting for responses that might never come.
— If a dependency is failing consistently, it’s better to stop making further requests until that dependency recovers.
You can track the number of failed requests; if the error rates exceeds a threshold or a configured limit, then the circuit breaker is opened. Further requests to collaborating services should be short-circuited, and appropriate fallbacks should be performed where possible.
The circuit shouldn’t always be left open once it has been opened. The circuit breaker needs to send a trial request to determine whether the service is available. In this trial state, the circuit is half open. If the request succeeds, the circuit should be closed; otherwise, it will remain open. As with other retries, these trial requests should be scheduled with an exponential back-off. - Asynchronous communication:
Long chains of synchronous communication between services leads to low overall availability of the system. Asynchronous communication, using a communication broker like message queue, is a strategy you can use to improve the reliability of your system.
This technique is recommended to be used where you don’t need immediate responses. Asynchronous communication enforces a higher level of microservice autonomy and helps prevent the problems common to services that interacts synchronously.
Maximizing service availability
In the previous section, we looked at the strategies that can be employed to ensure fault tolerance in the communication between services. In this section, we’ll explore techniques we can use to maximize availability within an individual service:
- Load balancing and service health:
You can deploy multiple instances of your application to ensure redundancy. A load balancer will distribute requests from other services between those instances, depending on their health and ability to serve requests.
In the previous section, you could only ascertain a service’s health and ability to serve requests only when you’ve made a request to it. This is not an optimal approach. You need to know prior to making a request to a service if that service is healthy enough to serve requests. This can be achieved through health checks.
Every service should implement an appropriate health check. If a service instance becomes unhealthy, it should no longer receive traffic from other services. Health checks can be classified into two types:
— Liveness: This check determines whether an application has started and is running, and if it’s able to accept requests and respond. If a service instance is unhealthy — if it’s unresponsive, or returns an error message — the load balancer should not deliver requests there.
— Readiness: That a service is alive doesn’t guarantee that it will serve requests. The readiness check is designed to indicate whether a service is ready to serve requests. This check also helps us determine whether a service’s dependencies — databases, configurations, third-party services, etc. — are healthy enough to serve requests. - Rate limits:
If service request calls are not limited, unhealthy service usage patterns can arise. Upstream collaborators might make several calls, where a single batch call would be more appropriate. Explicitly limiting the rate of requests or total requests available to collaborating services in a timeframe is a strategy we can employ to ensure that a service isn’t overloaded.
Performance testing
The strategies and techniques we’ve explored will enable you to maximize service availability, but you need a way to validate that your services can tolerate failure and recover gracefully. This can be achieved through thorough testing. Testing provides assurance that your design is effective when both predictable and unpredictable failures occur. In this section we’ll explore load testing and chaos testing:
- Load testing:
Load testing helps you understand the behavior of an application when a large amount of data is being transferred between single services. It also helps you expose components of the application that are not optimized for scalability. With this, you can prevent failures caused by large user loads in the production environment.
These are some tips to keep in mind when load testing microservices:
- Aim for high-risk services, rather than 100% testing
- Leverage multiple runtime environments
- Test to a service level agreement
- Measure beyond request/response ratio as performance metric
- Use service virtualization, instead of waiting for functional dependencies - Chaos testing:
Chaos testing pushes your microservice application to fail in production. It tests for failures caused by external factors, that is, failures that don’t arise from within the microservices, for example, network failures, virtual machine failures, database failures, etc. By introducing instability and failure, it mimics real system failures. This would expose your system’s capability to withstand real chaos, and prepare your engineering team to be able to react to those failures.
According to the principles of chaos website, chaos testing are experiments to uncover systemic weakness, and can be carried out via these four steps:
- Define the measurable output of a normal system as ‘steady state.’
- Hypothesize that this system will be unchanged in both the control group and experimental group.
- Introduce variables that reflects real-world failure events — for example, removing servers, using malfunctioning hard drives, severing network connections, introducing higher levels of frequency, etc.
- Try to disprove the hypothesis in the second step by looking for a difference in the steady state between the control group and the experimental group.
If the steady state remains unchanged, we’ll have an increased confidence in our system, but if a weakness is found, we’ll know what part of our system to improve to prevent or reduce the probability of that failure occurring in production.
In a microservice application, failure is inevitable. In this article we looked at the various points where failures are likely to occur in a microservice application. We explored the strategies and techniques we can utilize to mitigate the impact of these failures in our system. Finally, we saw how can we use performance testing methods to validate that our services are truly fault-tolerant, and can recover gracefully.
Additional resources
- Strategies for handling partial failure
- Designing a Microservices Architecture for Failure - RisingStack Engineering
- Resiliency and high availability in microservices
How to Design Reliable Microservices was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding - Medium and was authored by Chinedu Ikechi

Chinedu Ikechi | Sciencx (2022-03-02T14:11:18+00:00) How to Design Reliable Microservices. Retrieved from https://www.scien.cx/2022/03/02/how-to-design-reliable-microservices/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.