tower-resilience
A comprehensive resilience and fault-tolerance toolkit for Tower services, inspired by Resilience4j.
Resilience Patterns
- Adaptive Concurrency - Dynamic concurrency limiting using AIMD or Vegas algorithms
- Bulkhead - Isolates resources to prevent system-wide failures
- Cache - Response memoization to reduce load
- Chaos - Inject failures and latency for testing resilience (development/testing only)
- Circuit Breaker - Prevents cascading failures by stopping calls to failing services
- Coalesce - Deduplicates concurrent identical requests (singleflight pattern)
- Executor - Delegates request processing to dedicated executors for parallelism
- Fallback - Graceful degradation when services fail
- Hedge - Reduces tail latency by racing redundant requests
- Health Check - Proactive health monitoring with intelligent resource selection
- Outlier Detection - Fleet-aware instance ejection based on consecutive error tracking
- Rate Limiter - Controls request rate with fixed or sliding window algorithms
- Reconnect - Automatic reconnection with configurable backoff strategies
- Retry - Intelligent retry with exponential backoff, jitter, and retry budgets
- Router - Weighted traffic routing for canary deployments and progressive rollout
- Time Limiter - Advanced timeout handling with cancellation support
Quick Start
[dependencies]
tower-resilience = "0.9"
tower = "0.5"use tower::{Layer, ServiceBuilder};
use tower_resilience::prelude::*;
let circuit_breaker = CircuitBreakerLayer::builder()
.failure_rate_threshold(0.5)
.build();
let service = ServiceBuilder::new()
.layer(circuit_breaker)
.layer(BulkheadLayer::builder()
.max_concurrent_calls(10)
.build())
.service(my_service);Presets: Get Started in One Line
Every pattern includes preset configurations with sensible defaults. Start immediately without tuning parameters - customize later when you need to:
use tower_resilience::retry::RetryLayer;
use tower_resilience::circuitbreaker::CircuitBreakerLayer;
use tower_resilience::ratelimiter::RateLimiterLayer;
use tower_resilience::bulkhead::BulkheadLayer;
use tower_resilience::timelimiter::TimeLimiterLayer;
use tower_resilience::hedge::HedgeLayer;
// Retry with exponential backoff (3 attempts, 100ms base)
let retry = RetryLayer::<(), (), MyError>::exponential_backoff().build();
// Circuit breaker with balanced defaults
let breaker = CircuitBreakerLayer::standard().build();
// Rate limit to 100 requests per second
let limiter = RateLimiterLayer::per_second(100).build();
// Limit to 50 concurrent requests
let bulkhead = BulkheadLayer::medium().build();
// 5 second timeout with cancellation
let timeout = TimeLimiterLayer::standard().build();
// Reduce tail latency with hedged requests
let hedge = HedgeLayer::standard();Available Presets
| Pattern | Presets | Description |
|---|---|---|
| Bulkhead | small() |
10 concurrent calls |
medium() |
50 concurrent calls | |
large() |
200 concurrent calls | |
| Circuit Breaker | standard() |
50% threshold, 100 calls - balanced |
fast_fail() |
25% threshold, 20 calls - fail fast | |
tolerant() |
75% threshold, 200 calls - high tolerance | |
| Hedge | conservative() |
500ms delay, 2 attempts |
standard() |
100ms delay, 3 attempts | |
aggressive() |
50ms delay, 5 attempts | |
| Rate Limiter | per_second(n) |
n requests per second |
per_minute(n) |
n requests per minute | |
burst(rate, size) |
Sustained rate with burst capacity | |
| Retry | exponential_backoff() |
3 attempts, 100ms base - balanced default |
aggressive() |
5 attempts, 50ms base - fast recovery | |
conservative() |
2 attempts, 500ms base - minimal overhead | |
| Time Limiter | fast() |
1s timeout, cancel on timeout |
standard() |
5s timeout, cancel on timeout | |
slow() |
30s timeout, cancel on timeout | |
streaming() |
60s timeout, no cancellation |
Presets return builders, so you can customize any setting:
// Start with a preset, override what you need
let breaker = CircuitBreakerLayer::fast_fail()
.name("payment-api") // Add observability
.wait_duration_in_open(Duration::from_secs(30)) // Custom recovery time
.build();Examples
Adaptive Concurrency
Dynamically adjust concurrency limits based on observed latency and error rates:
use tower_resilience::adaptive::{AdaptiveLimiterLayer, Aimd, Vegas};
use tower::ServiceBuilder;
use std::time::Duration;
// AIMD: Classic TCP-style congestion control
// Increases limit on success, decreases on failure/high latency
let layer = AdaptiveLimiterLayer::new(
Aimd::builder()
.initial_limit(10)
.min_limit(1)
.max_limit(100)
.increase_by(1) // Add 1 on success
.decrease_factor(0.5) // Halve on failure
.latency_threshold(Duration::from_millis(100))
.build()
);
// Vegas: More stable, uses RTT to estimate queue depth
let layer = AdaptiveLimiterLayer::new(
Vegas::builder()
.initial_limit(10)
.alpha(3) // Increase when queue < 3
.beta(6) // Decrease when queue > 6
.build()
);
let service = ServiceBuilder::new()
.layer(layer)
.service(my_service);Use cases:
- Auto-tuning: No manual concurrency limit configuration needed
- Variable backends: Adapts to changing downstream capacity
- Load shedding: Automatically reduces load when backends struggle
Full examples: adaptive.rs
Bulkhead
Limit concurrent requests to prevent resource exhaustion:
use tower_resilience::bulkhead::BulkheadLayer;
use std::time::Duration;
let layer = BulkheadLayer::builder()
.name("worker-pool")
.max_concurrent_calls(10) // Max 10 concurrent
.max_wait_duration(Duration::from_secs(5)) // Wait up to 5s
.on_call_permitted(|concurrent| {
println!("Request permitted (concurrent: {})", concurrent);
})
.on_call_rejected(|max| {
println!("Request rejected (max: {})", max);
})
.build();
let service = layer.layer(my_service);Full examples: bulkhead.rs | bulkhead_advanced.rs
Cache
Cache responses to reduce load on expensive operations:
use tower_resilience::cache::{CacheLayer, EvictionPolicy};
use std::time::Duration;
let layer = CacheLayer::builder()
.max_size(1000)
.ttl(Duration::from_secs(300)) // 5 minute TTL
.eviction_policy(EvictionPolicy::Lru) // LRU, LFU, or FIFO
.key_extractor(|req: &Request| req.id.clone())
.on_hit(|| println!("Cache hit!"))
.on_miss(|| println!("Cache miss"))
.build();
let service = layer.layer(my_service);Full examples: cache.rs | cache_example.rs
Chaos (Testing Only)
Inject failures and latency to test your resilience patterns:
use tower_resilience::chaos::ChaosLayer;
use std::time::Duration;
// Types inferred from closure signature - no type parameters needed!
let chaos = ChaosLayer::builder()
.name("test-chaos")
.error_rate(0.1) // 10% of requests fail
.error_fn(|_req: &String| std::io::Error::new(
std::io::ErrorKind::Other, "chaos!"
))
.latency_rate(0.2) // 20% delayed
.min_latency(Duration::from_millis(50))
.max_latency(Duration::from_millis(200))
.seed(42) // Deterministic chaos
.build();
let service = chaos.layer(my_service);WARNING: Only use in development/testing environments. Never in production.
Full examples: chaos.rs | chaos_example.rs
Circuit Breaker
Prevent cascading failures by opening the circuit when error rate exceeds threshold:
use tower::Layer;
use tower_resilience::circuitbreaker::CircuitBreakerLayer;
use std::time::Duration;
let layer = CircuitBreakerLayer::builder()
.name("api-circuit")
.failure_rate_threshold(0.5) // Open at 50% failure rate
.sliding_window_size(100) // Track last 100 calls
.wait_duration_in_open(Duration::from_secs(60)) // Stay open 60s
.on_state_transition(|from, to| {
println!("Circuit breaker: {:?} -> {:?}", from, to);
})
.build();
let service = layer.layer(my_service);Full examples: circuitbreaker.rs | circuitbreaker_fallback.rs | circuitbreaker_health_check.rs
Coalesce
Deduplicate concurrent identical requests (singleflight pattern):
use tower_resilience::coalesce::CoalesceLayer;
use tower::ServiceBuilder;
// Coalesce by request ID - concurrent requests for same ID share one execution
let layer = CoalesceLayer::new(|req: &Request| req.id.clone());
let service = ServiceBuilder::new()
.layer(layer)
.service(my_service);
// Use with cache to prevent stampede on cache miss
let service = ServiceBuilder::new()
.layer(cache_layer) // Check cache first
.layer(coalesce_layer) // Coalesce cache misses
.service(backend);Use cases:
- Cache stampede prevention: When cache expires, only one request refreshes it
- Expensive computations: Deduplicate identical report generation requests
- Rate-limited APIs: Reduce calls to external APIs by coalescing identical requests
Note: Response and error types must implement Clone to be shared with all waiters.
Executor
Delegate request processing to dedicated executors for parallel execution:
use tower_resilience::executor::ExecutorLayer;
use tower::ServiceBuilder;
// Use a dedicated runtime for CPU-heavy work
let compute_runtime = tokio::runtime::Builder::new_multi_thread()
.worker_threads(8)
.thread_name("compute")
.build()
.unwrap();
let layer = ExecutorLayer::new(compute_runtime.handle().clone());
// Or use the current runtime
let layer = ExecutorLayer::current();
let service = ServiceBuilder::new()
.layer(layer)
.service(my_service);Use cases:
- CPU-bound processing: Parallelize CPU-intensive request handling
- Runtime isolation: Process requests on a dedicated runtime
- Thread pool delegation: Use specific thread pools for certain workloads
Fallback
Provide fallback responses when the primary service fails:
use tower_resilience::fallback::FallbackLayer;
// Return a static fallback value on error
let layer = FallbackLayer::<Request, Response, MyError>::value(
Response::default()
);
// Or compute fallback from the error
let layer = FallbackLayer::<Request, Response, MyError>::from_error(|err| {
Response::error_response(err)
});
// Or use a backup service
let layer = FallbackLayer::<Request, Response, MyError>::service(|req| async {
backup_service.call(req).await
});
let service = layer.layer(primary_service);Hedge
Reduce tail latency by firing backup requests after a delay:
use tower_resilience::hedge::HedgeLayer;
use std::time::Duration;
// Fire a hedge request if primary takes > 100ms
let layer = HedgeLayer::builder()
.delay(Duration::from_millis(100))
.max_hedged_attempts(2)
.build();
// Or fire all requests in parallel (no delay)
let layer = HedgeLayer::<(), String, MyError>::builder()
.no_delay()
.max_hedged_attempts(3)
.build();
let service = layer.layer(my_service);Note: Hedge requires Req: Clone (requests are cloned for parallel execution) and E: Clone (for error handling). If your types don't implement Clone, consider wrapping them in Arc.
Full examples: hedge.rs
Health Check
Proactive health monitoring with intelligent resource selection:
use tower_resilience::healthcheck::{HealthCheckWrapper, HealthStatus, SelectionStrategy};
use std::time::Duration;
// Create wrapper with multiple resources
let wrapper = HealthCheckWrapper::builder()
.with_context(primary_db, "primary")
.with_context(secondary_db, "secondary")
.with_checker(|db| async move {
match db.ping().await {
Ok(_) => HealthStatus::Healthy,
Err(_) => HealthStatus::Unhealthy,
}
})
.with_interval(Duration::from_secs(5))
.with_selection_strategy(SelectionStrategy::RoundRobin)
.build();
// Start background health checking
wrapper.start().await;
// Get a healthy resource
if let Some(db) = wrapper.get_healthy().await {
// Use healthy database
}Note: Health Check is not a Tower layer - it's a wrapper pattern for managing multiple resources with automatic failover.
Full examples: healthcheck_basic.rs
Outlier Detection
Fleet-aware instance ejection that tracks per-instance health and routes around unhealthy backends:
use tower_resilience::outlier::{OutlierDetectionLayer, OutlierDetector};
use tower::{ServiceBuilder, service_fn};
use std::time::Duration;
// Shared detector coordinates ejection across all instances
let detector = OutlierDetector::new()
.max_ejection_percent(50)
.base_ejection_duration(Duration::from_secs(30));
// Register instances (eject after 5 consecutive errors)
detector.register("backend-1", 5);
detector.register("backend-2", 5);
// Per-instance layers share the detector
let layer = OutlierDetectionLayer::builder()
.detector(detector.clone())
.instance_name("backend-1")
.build();
let service = ServiceBuilder::new()
.layer(layer)
.service(service_fn(|req: String| async move { Ok::<_, std::io::Error>(req) }));Key features:
- Backpressure mode (default):
poll_readyreturnsPendingfor ejected instances, integrating naturally with Tower load balancers - Fleet protection:
max_ejection_percentprevents cascading ejections - Exponential backoff: Repeatedly-ejected instances stay out longer
Rate Limiter
Control request rate to protect downstream services:
use tower_resilience::ratelimiter::RateLimiterLayer;
use std::time::Duration;
let layer = RateLimiterLayer::builder()
.limit_for_period(100) // 100 requests
.refresh_period(Duration::from_secs(1)) // per second
.timeout_duration(Duration::from_millis(500)) // Wait up to 500ms
.on_permit_acquired(|wait| {
println!("Request permitted (waited {:?})", wait);
})
.build();
let service = layer.layer(my_service);Full examples: ratelimiter.rs | ratelimiter_example.rs
Reconnect
Automatically reconnect on connection failures with configurable backoff:
use tower_resilience::reconnect::{ReconnectLayer, ReconnectConfig, ReconnectPolicy};
use std::time::Duration;
let layer = ReconnectLayer::new(
ReconnectConfig::builder()
.policy(ReconnectPolicy::exponential(
Duration::from_millis(100), // Start at 100ms
Duration::from_secs(5), // Max 5 seconds
))
.max_attempts(10)
.retry_on_reconnect(true) // Retry request after reconnecting
.connection_errors_only() // Only reconnect on connection errors
.on_state_change(|from, to| {
println!("Connection: {:?} -> {:?}", from, to);
})
.build()
);
let service = layer.layer(my_service);Full examples: reconnect.rs | reconnect_basic.rs | reconnect_custom_policy.rs
Retry
Retry failed requests with exponential backoff and jitter:
use tower_resilience::retry::RetryLayer;
use std::time::Duration;
let layer = RetryLayer::<(), (), MyError>::builder()
.max_attempts(5)
.exponential_backoff(Duration::from_millis(100))
.on_retry(|attempt, delay| {
println!("Retrying (attempt {}, delay {:?})", attempt, delay);
})
.on_success(|attempts| {
println!("Success after {} attempts", attempts);
})
.build();
let service = layer.layer(my_service);Full examples: retry.rs | retry_example.rs
Router
Weighted traffic routing for canary deployments and progressive rollout:
use tower::util::BoxService;
use tower_resilience::router::WeightedRouter;
// Route 90% to stable, 10% to canary
let svc_v1: BoxService<String, String, MyError> =
BoxService::new(tower::service_fn(|req: String| async move {
Ok(format!("v1: {}", req))
}));
let svc_v2: BoxService<String, String, MyError> =
BoxService::new(tower::service_fn(|req: String| async move {
Ok(format!("v2: {}", req))
}));
let router = WeightedRouter::builder()
.name("canary-deploy")
.route(svc_v1, 90)
.route(svc_v2, 10)
.on_request_routed(|idx, weight| {
println!("Routed to backend {} (weight: {})", idx, weight);
})
.build();Key features:
- Deterministic (default): Atomic counter for exact, repeatable distribution
- Random: Probabilistic selection for high-volume statistical distribution
- Composable: Wrap each backend with circuit breakers, bulkheads, etc.
Note: Router is a standalone Service, not a Layer. Use BoxService to type-erase different backend implementations.
Full examples: router.rs
Time Limiter
Enforce timeouts on operations with configurable cancellation:
use tower_resilience::timelimiter::TimeLimiterLayer;
use std::time::Duration;
let layer = TimeLimiterLayer::builder()
.timeout_duration(Duration::from_secs(30))
.cancel_running_future(true) // Cancel on timeout
.on_timeout(|| {
println!("Operation timed out!");
})
.build();
let service = layer.layer(my_service);Full examples: timelimiter.rs | timelimiter_example.rs
Error Handling
When composing multiple resilience layers, each layer has its own error type (e.g., CircuitBreakerError, BulkheadError). The ResilienceError<E> type unifies these into a single error type, eliminating boilerplate.
The Problem
Without a unified error type, you'd need From implementations for every layer combination:
// Without ResilienceError: ~80 lines of boilerplate for 4 layers
impl From<BulkheadError> for ServiceError { /* ... */ }
impl From<CircuitBreakerError> for ServiceError { /* ... */ }
impl From<RateLimiterError> for ServiceError { /* ... */ }
impl From<TimeLimiterError> for ServiceError { /* ... */ }The Solution
Use ResilienceError<E> as your service error type - all layer errors automatically convert:
use tower_resilience::core::ResilienceError;
// Your application error
#[derive(Debug, Clone)]
enum AppError {
DatabaseDown,
InvalidRequest,
}
// That's it! Zero From implementations needed
type ServiceError = ResilienceError<AppError>;Pattern Matching
Handle different failure modes explicitly:
use tower_resilience::core::ResilienceError;
fn handle_error<E: std::fmt::Display>(error: ResilienceError<E>) {
match error {
ResilienceError::Timeout { layer } => {
eprintln!("Timeout in {}", layer);
}
ResilienceError::CircuitOpen { name } => {
eprintln!("Circuit breaker {:?} is open - fail fast", name);
}
ResilienceError::BulkheadFull { concurrent_calls, max_concurrent } => {
eprintln!("Bulkhead full: {}/{} - try again later", concurrent_calls, max_concurrent);
}
ResilienceError::RateLimited { retry_after } => {
if let Some(duration) = retry_after {
eprintln!("Rate limited, retry after {:?}", duration);
}
}
ResilienceError::Application(app_err) => {
eprintln!("Application error: {}", app_err);
}
}
}Helper Methods
Quickly check error categories:
if err.is_timeout() {
// Handle timeout from any layer (TimeLimiter or Bulkhead)
}
if err.is_circuit_open() {
// Circuit breaker is protecting the system
}
if err.is_rate_limited() {
// Backpressure - slow down
}
if err.is_application() {
// Get the underlying application error
if let Some(app_err) = err.application_error() {
// Handle app-specific error
}
}When to Use
Use ResilienceError<E> when:
- Building new services with multiple resilience layers
- You want zero boilerplate error handling
- Standard error categorization is sufficient
Use manual From implementations when:
- You need very specific error semantics
- Integrating with legacy error types
- You need specialized error logging per layer
See the tower_resilience_core::error module for full documentation.
Pattern Composition
Stack multiple patterns for comprehensive resilience:
use tower::ServiceBuilder;
// Client-side: timeout -> circuit breaker -> retry
let client = ServiceBuilder::new()
.layer(timeout_layer)
.layer(circuit_breaker_layer)
.layer(retry_layer)
.service(http_client);
// Server-side: rate limit -> bulkhead -> timeout
let server = ServiceBuilder::new()
.layer(rate_limiter_layer)
.layer(bulkhead_layer)
.layer(timeout_layer)
.service(handler);For comprehensive guidance on composing patterns effectively, see:
- Composition Guide - Pattern selection, recommended stacks, layer ordering, and anti-patterns
- Composition Tests - Working examples of all documented stacks that verify correct compilation
Benchmarks
Happy path overhead (no failures triggered):
| Pattern | Overhead |
|---|---|
| Retry (no retries) | ~80-100 ns |
| Time Limiter | ~107 ns |
| Rate Limiter | ~124 ns |
| Bulkhead | ~162 ns |
| Cache (hit) | ~250 ns |
| Circuit Breaker (closed) | ~298 ns |
cargo bench --bench happy_path_overheadExamples
cargo run --example circuitbreaker
cargo run --example bulkhead
cargo run --example retrySee examples/ for more.
Stress Tests
cargo test --test stress -- --ignoredMSRV
1.64.0 (matches Tower)
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contributing
Contributions are welcome! Please see the contributing guidelines for more information.