This content originally appeared on Level Up Coding - Medium and was authored by Chris Bao
In this article, I want to talk about circuit breaker pattern based on a popular open source project hystrix (in fact, I will take a look at the golang version hystrix-go, instead of the original version which is written in Java).
In the first section of this article, I will give a general introduction to circuit breaker, let you know what it is and why it is important. Moreover, let’s review the background about the project hystrix-go and hystrix, and understand the basic usage with a small demo example.
Software in distributed architectures generally have many dependencies, and the failure at some point for each dependency(even the most reliable service) is inevitable.
What happens if our failing service becomes unresponsive? All services that rely on it have risks to become unresponsive, too. This is called catastrophic cascading failure.
The basic idea behind the circuit breaker is very simple. A circuit breaker works by wrapping calls to a target service and keeps monitoring the failure rates. Once the failures reach a certain threshold, the circuit breaker will trip ，and all the further calls to the circuit return with a fault or error.
The design philosophy behind the circuit breaker pattern is fail fast: when a service becomes unresponsive, other services relying on it should stop waiting for it and start dealing with the fact that the failing service may be unavailable. By preventing a single service’s failure cascading through the entire system, the circuit breaker pattern contributes to the stability and resilience of the whole system.
The circuit breaker pattern can be implemented as a finite-state machine shown below:
There are three statuses: open, closed and half-open
- closed: Requests are passed to the target service. Keep monitoring the metrics like error rate, request numbers and timeout. When these metrics exceed a specific threshold(which is set by the developer), the breaker is tripped and transitions into open status.
- open: Requests are not passed to the target service, instead the fallback logic(which is defined by developer as well) will be called to handle the failure. The breaker will stay open status for a period of time called sleeping window, after which the breaker can transition from open to half-open.
- half-open: In this status, a limited number of requests are passed to the target service, which is aims at resetting the status. If the target service can response successfully then the break is reset back to closed status. Or else the breaker transitions back to open status.
That’s basic background about circuit breaker, you can find much more information about it on line.
Next, let’s investigate the project hystrix.
hystrix is a very popular open source project. You can find everything about it in this link.
I want to quote several important points from the above link. Hystrix is designed to do the following:
- Give protection from and control over latency and failure from dependencies accessed (typically over the network) via third-party client libraries.
- Stop cascading failures in a complex distributed system.
- Fail fast and rapidly recover.
- Fallback and gracefully degrade when possible.
- Enable near real-time monitoring, alerting, and operational control.
You can see hystrix perfectly implements the idea of circuit breaker pattern we talked about in the last section, right?
The hystrix project is developed with Java. In this article I prefer to use a golang version hystrix-go, which is a simplified version but implements all the main designs and ideas about circuit breaker.
For the usage of hystrix-go, you can find it in this link, which is very straightforward to understand. And you can easily find many other articles online with demo examples to show more usage level stuff. Please go head to read.
In my article, I want to go into the source code of hystrix-go and have an advanced investigation about how circuit breaker is implemented. Please follow up to read the following sections.
Three service degradation strategies
Hystrix provides three different service degradation strategies to avoid the cascading failure happening in the entire system: timeout, maximum concurrent request numbers and request error rate.
- timeout: if the service call doesn’t return response successfully within a predefined time duration, then the fallback logic will run. This strategy is the simplest one.
- maximum concurrent request numbers: when the number of concurrent requests is beyond the threshold, then the fallback logic will handle the following request.
- request error rate: hystrix will record the response status of each service call, after the error rate reaches the threshold, the breaker will be open, and the fallback logic will execute before the breaker status changes back to closed. error rate strategy is the most complex one.
This can be seen from the basic usage of hystrix as follows:
In the above usage case, you can see that timeout is set to 10 seconds, the maximum request number is 100, and the error rate threshold is 25 percentages.
In the consumer application level, that’s nearly all of the configuration you need to setup. hystrix will make the magin happen internally.
In this article, I plan to show you the internals of hystrix by reviewing the source code.
Let’s start from the easy ones: max concurrent requests and timeout. Then move on to explore the complex strategy request error rate.
Based on the above example, you can see Go function is the door to the source code of hystrix, so let’s start from it as follows:
Go function accept three parameters:
- name: the command name, which is bound to the circuit created inside hystrix.
- run: a function contains the normal logic which send request to the dependency service.
- fallback: a function contains the fallback logic.
Go function just wraps run and fallback with Context, which is used to control and cancel goroutine, if you’re not familiar with it then refer to previous article. Finally it will call GoC function.
GoC function goes as follows:
I admit it’s complex, but it’s also the core of the entire hystrix project. Be patient, let’s review it bit by bit carefully.
First of all, the code structure of GoC function is as follows:
- Construct a new Command object, which contains all the information for each call to GoC function.
- Get the circuit breaker by name (create it if it doesn’t exist) by calling GetCircuit(name) function.
- Declare condition variable ticketCond and ticketChecked with sync.Cond which is used to communicate between goroutines.
- Declare function returnTicket. What is a ticket? What does it mean by returnTicket? Let’s discuss it in detail later.
- Declare another function reportAllEvent. This function is critical to error rate strategy.
- Declare an instance of sync.Once, which is another interesting synchronization primitives provided by golang.
- Launch two goroutines, each of which contains many logics too. Simply speaking, the first one contains the logic of sending requests to the target service and the strategy of max concurrent request number, and the second one contains the timeout strategy.
- Return a channel type value
Let’s review each of them one by one.
command struct goes as follows, which embeds sync.Mutex and defines several fields:
Note that command object iteself doesn’t contain command name information, and its lifecycle is just inside the scope of one GoC call. It means that the statistic metrics about the service request like error rate and concurrent request number are not stored inside command object. Instead, such metrics are stored inside circuit field which is CircuitBreaker type.
As we mentioned in the workflow of GoC function, GetCircuit(name) is called to get or create the circuit breaker. It is implemented inside circuit.go file as follows:
The logic is very straightforward. All the circuit breakers are stored in a map object circuitBreakers with the command name as the key.
The newCircuitBreaker constructor function and CircuitBreaker struct are as follows:
All the fields of CircuitBreaker are important to understand how the breaker works.
There are two fields that are not simple type need more analysis, include executorPool and metrics.
- executorPool: used for max concurrent request number strategy.
- metrics: used for request error rate strategy, all right?
We can find executorPool logics inside the pool.go file:
It makes use of golang channel to realize max concurrent request number strategy. Note that Tickets field, which is a buffered channel with capicity of MaxConcurrentRequests is created. And in the following for loop, make the buffered channel full by sending value into the channel until reaching the capacity.
As we have shown above, in the first goroutine of GoC function, the Tickets channel is used as follows:
Each call to GoC function will get a ticket from circuit.executorPool.Tickets channel until no ticket is left, which means the number of concurrent requests reaches the threshold. In that case, the default case will execute , and the service will be gracefully degraded with fallback logic.
On the other side, after each call to GoC is done, the ticket need to be sent back to the circuit.executorPool.Tickets, right? Do you remember the returnTicket function mentioned in above section. Yes, it is just used for this purpose. The returnTicket function defined in GoC function goes as follows:
It calls executorPool.Return function:
The design and implementation of Tickets is a great example of golang channel in the real-world application.
In summary, the max concurrent request number strategy can be illustrated as follows:
In the above section, max concurrent requests strategy in hystrix is reviewed carefully, and I hope you can learn something interesting from it.
Now let’s investigate timeout strategy together in the next section.
Compared with max concurrent request number strategy, timeout is very straightforward to understand.
As we mentioned in the previous section, the core logic of hystrix is inside the GoC function. GoC function internally runs two goroutines. You already see that the first goroutine contains the logic to send request to the target service and the strategy of max concurrent request number. How about the second goroutine? Let’s review it as follows:
Note that A Timer is created with the timeout duration value from the settings. And a select statement lets this goroutine wait until one case condition receives value from the channel. The timeout case is just the 3nd one (when the first two cases are not triggered), which will run fallback logic with ErrTimeout error message.
So far you should be clear about the main structure and functionalities of these two goroutines. But in detail, there are two Golang techniques need your attention: sync.Once and sync.Cond.
You may already notice the following code block, which is repeated several times inside GoC function:
returnOnce is type of sync.Once, which makes sure that the callback function of Do method only runs once among different goroutines.
In this specific case, it can guarantee that both returnTicket() and reportAllEvent() execute only once. This really makes sense, because if returnTicket() runs multiple times for one GoC call, then the current concurrent request number will not be correct, right?
I wrote another article about sync.Once in detail, you can refer to that article for more in-depth explanation.
The implementation of returnTicket function goes as follows:
ticketCond is a condition variable, and in Golang it is type of sync.Cond.
Condition variable is useful in communication between different goroutines. Concretely, Wait method of sync.Condwill hung the current goroutine, and Signal method will wake up the blocking goroutine to continue executing.
In hystrix case , when ticketChecked is false, which means the current GoC call is not finished and the ticket should not be returned yet. So ticketCond.Wait() is called to block this goroutine and wait until the GoC call is completed which is notified by Signal method.
Note that the above two lines of code are always called together. ticketChecked is set to true means that the current GoC call is finished and the ticket is ready to return. Moreover, the Wait method to hang the goroutine is placed inside a for loop, which is also a best practise technique.
For more explanation about sync.Cond, I will write another post to explain it in future, please wait for a moment.
Finally, let’s see how fallback function is called when the target service is not responsive.
Let’s recall that each GoC call will create a new command instance. And fallback function will be assigned to the field with the same name, which will be used later.
As we see in above sections, errorWithFallback method is triggered when timeout or max concurrent request number threshold is met.
errorWithFallback method will run the fallback by calling tryFallback and report the metric events such as fallback-failure and fallback-success.
In the above, we talked about the timeout strategy which is the simplest one among all the strategies provided by hystrix. Some detailed Golang techniques are reviewed as well to have a better understand the complex code logic.
In this article, we talked about the detailed implementation of max concurrent requests strategy and timeout strategy provided by hystrix. Some detailed Golang techniques are reviewed as well to have a better understand the complex code logic.
I leave the error rate strategy for you, please dive into the codebase and explore more about circuit breaking. Have fun!
This content originally appeared on Level Up Coding - Medium and was authored by Chris Bao