Context
Problem Statement:
The HDIM API Gateway needed comprehensive resilience patterns:
Business Context:
Technical Context:
Decision
We will implement a comprehensive API Gateway with Resilience4j circuit breakers, Bucket4j rate limiting, and intelligent service routing.
Specific Implementation:
- Per-user limits (authenticated requests)
- Per-tenant limits (aggregate across users)
- Per-IP limits (unauthenticated/fallback)
- Different limits by endpoint tier (standard/premium)
- Individual circuit breakers for each backend
- Custom fallback responses per service
- Health indicator integration
- Path-based routing to backend services
- Load balancing across service instances
- Retry with exponential backoff
`
Request → RateLimitFilter → Auth → CircuitBreaker → Retry → Backend
`
Alternatives Considered
Alternative 1: Kong API Gateway
Description: Open-source API gateway with plugin architecture
Pros:
Cons:
Why Not Chosen: Custom Spring Boot gateway provides more control and integrates with existing codebase
Alternative 2: AWS API Gateway
Description: Fully managed AWS service
Pros:
Cons:
Why Not Chosen: Need cloud-agnostic solution with custom business logic
Alternative 3: Envoy Proxy
Description: High-performance edge proxy
Pros:
Cons:
Why Not Chosen: Complexity overkill; Spring Boot gateway sufficient for current scale
Consequences
Positive Consequences
Negative Consequences
Mitigation
Configuration
Rate Limiting (RateLimitFilter.java)
@Component
public class RateLimitFilter extends OncePerRequestFilter {
private final Map<String, Bucket> userBuckets = new ConcurrentHashMap<>();
private Bucket createBucket(RateLimitTier tier) {
return Bucket.builder()
.addLimit(Bandwidth.classic(tier.getRequestsPerSecond(),
Refill.greedy(tier.getRequestsPerSecond(), Duration.ofSeconds(1))))
.addLimit(Bandwidth.classic(tier.getBurstCapacity(),
Refill.intervally(tier.getBurstCapacity(), Duration.ofMinutes(1))))
.build();
}
}Circuit Breaker Configuration
resilience4j:
circuitbreaker:
instances:
cqlEngine:
registerHealthIndicator: true
slidingWindowSize: 10
failureRateThreshold: 50
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
slidingWindowType: COUNT_BASED
minimumNumberOfCalls: 5
qualityMeasure:
# ... similar config
fhirService:
# ... similar config
patientService:
# ... similar config
careGapService:
# ... similar configService Routing
@Service
public class ServiceRoutingService {
@CircuitBreaker(name = "cqlEngine", fallbackMethod = "cqlEngineFallback")
@Retry(name = "cqlEngine")
@TimeLimiter(name = "cqlEngine")
public CompletableFuture<ResponseEntity<String>> routeToCqlEngine(
String path, HttpMethod method, String body, HttpHeaders headers) {
return CompletableFuture.supplyAsync(() ->
executeRequest(cqlEngineUrl + path, method, body, headers));
}
public CompletableFuture<ResponseEntity<String>> cqlEngineFallback(
String path, HttpMethod method, String body,
HttpHeaders headers, Throwable t) {
return CompletableFuture.completedFuture(
ResponseEntity.status(503)
.body("{\"error\": \"CQL Engine temporarily unavailable\"}"));
}
}Rate Limit Tiers
| Tier | Requests/Second | Burst | Use Case |
|------|-----------------|-------|----------|
| Standard | 100 | 150 | Regular API users |
| Premium | 500 | 750 | Enterprise tenants |
| Internal | 1000 | 1500 | Service-to-service |
| Anonymous | 10 | 20 | Unauthenticated |
Implementation Plan
Files Created/Modified
New Files:
backend/modules/services/gateway-service/src/main/java/com/healthdata/gateway/filter/RateLimitFilter.javabackend/modules/services/gateway-service/src/main/java/com/healthdata/gateway/service/ServiceRoutingService.javaModified Files:
backend/modules/services/gateway-service/build.gradle.kts - Added Resilience4j, Bucket4jbackend/modules/services/gateway-service/src/main/resources/application.yml - Full resilience configSuccess Metrics
| Metric | Target | Measurement |
|--------|--------|-------------|
| Gateway p99 latency | <50ms overhead | APM metrics |
| Rate limit rejections | <1% legitimate | Prometheus counter |
| Circuit breaker opens | <5/day | Health indicators |
| Gateway availability | 99.99% | Uptime monitoring |