Circuit Breaker

In distributed systems, interactions with remote resources and services can fail due to various faults such as slow network connections, timeouts, or temporary unavailability of resources. These issues are often short-lived and can be mitigated by implementing strategies like the Retry pattern, which allows the system to attempt the operation again after a brief pause.

However, not all faults are short-lived. Some issues may arise from unanticipated events that take longer to resolve, such as partial connectivity loss or complete service failure. In these cases, continuously retrying the operation may be futile and can lead to wasted resources. Instead, the system should quickly recognize the failure and handle it appropriately, avoiding unnecessary retries.

Moreover, in highly loaded systems, a failure in one part can lead to cascading failures across the system. For instance, if a service is overwhelmed, operations invoking this service might block until a timeout occurs, consuming critical resources like memory and threads. This can exhaust system resources, causing failures in other parts of the system. To prevent this, it is better for operations to fail fast when success is unlikely, rather than waiting for a timeout. While shorter timeouts can help, they should be balanced to avoid frequent failures.

Implementation

The Circuit Breaker pattern addresses issues by detecting failures and preventing repeated, unsuccessful operation attempts. When a failure is detected, the circuit breaker trips, causing subsequent attempts to fail immediately. This helps the system recover and maintain stability by isolating faults and preventing cascading failures, ensuring overall system resilience.

A Circuit Breaker functions as a proxy for operations that might fail. It tracks recent failures and uses this data to determine whether to allow the operation to proceed or to immediately return an exception.

The circuit-breaker can be seen as a state machine, intermediating remote calls:

  • Closed: Forwards requests to the actual service. If failures exceed a threshold, it switches to the Open state.
  • Open: Instantly fails requests, without invoking the actual service. After a timeout, it switches to the Half-Open state to test if the service has recovered.
  • Half-Open: Allows a limited number of requests to pass through. If these succeed, it resets the failure counter and switches back to the Closed state. If any request fails, it reverts to the Open state and restarts the timeout timer, preventing a recovering service from being overwhelmed.

The failure counter works as a sliding window, tracking the number of failures within a specified time frame. In HARP, you can fine-tune the window length and failure threshold to match your system's requirements.