Context
Problem Statement:
The HDIM microservices architecture lacked resilience patterns, creating cascading failure risks:
Business Context:
Technical Context:
Decision
We will implement Resilience4j circuit breakers, retry policies, and rate limiters across all services that make external calls.
Specific Implementation:
- 50% failure rate threshold
- 30-second wait in open state
- 3 calls permitted in half-open state
- Sliding window of 10 calls
- 3 maximum attempts
- 2-second base wait duration
- 2x exponential backoff multiplier
- Retry on: IOException, SocketTimeoutException
- 100 requests per second default
- 5-second timeout for rate limit acquisition
Alternatives Considered
Alternative 1: Hystrix (Netflix)
Description: Netflix's original circuit breaker library
Pros:
Cons:
Why Not Chosen: Hystrix is deprecated; Resilience4j is recommended successor
Alternative 2: Spring Cloud Circuit Breaker
Description: Spring's abstraction over circuit breaker implementations
Pros:
Cons:
Why Not Chosen: Direct Resilience4j provides more control and simpler debugging
Alternative 3: Service Mesh (Istio/Linkerd)
Description: Infrastructure-level resilience
Pros:
Cons:
Why Not Chosen: Too heavy for current deployment model; can be added later
Consequences
Positive Consequences
Negative Consequences
Mitigation
Configuration
resilience4j:
circuitbreaker:
instances:
fhirService:
registerHealthIndicator: true
slidingWindowSize: 10
failureRateThreshold: 50
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
retry:
instances:
fhirService:
maxAttempts: 3
waitDuration: 2s
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2Implementation Plan
Files Modified
Dependencies Added:
backend/modules/services/cql-engine-service/build.gradle.ktsbackend/modules/services/fhir-service/build.gradle.ktsbackend/modules/services/care-gap-service/build.gradle.ktsbackend/modules/services/patient-service/build.gradle.ktsbackend/modules/services/gateway-service/build.gradle.ktsConfiguration Added:
backend/modules/services/cql-engine-service/src/main/resources/application.ymlbackend/modules/services/gateway-service/src/main/resources/application.ymlSuccess Metrics
| Metric | Target | Measurement |
|--------|--------|-------------|
| Circuit open incidents | <5/day | Prometheus counter |
| Mean time to recovery | <60s | Circuit open duration |
| Cascading failures | 0 | Incident reports |
| Retry success rate | >90% | Retry metrics |