Context
Problem Statement
HDIM required an enterprise event streaming platform to enable asynchronous microservice communication, event replay capability, and guaranteed message delivery. The system needed to support publishing events from one service and consuming them in multiple services reliably.
Specific challenges:
Background
Phase 4 context (Oct 2025):
Assumptions
Options Considered
Option 1: Apache Kafka 3.x Cluster
Description: Deploy 3-broker Kafka cluster with replication factor 3, 30-day retention, automatic topic creation
Pros:
Cons:
Estimated Effort: 1 week deployment
Risk Level: Low (proven, well-documented)
Option 2: RabbitMQ with Classic Queue
Description: Deploy RabbitMQ cluster for message queuing with persistent queues
Pros:
Cons:
Estimated Effort: 1 week
Risk Level: High (doesn't support replay requirement)
Option 3: AWS SQS/SNS
Description: Use managed AWS services for event streaming
Pros:
Cons:
Estimated Effort: 2 weeks (cloud integration)
Risk Level: Medium (vendor lock-in)
Decision
We chose Option 1 (Apache Kafka 3.x) because:
Consequences
Positive
Negative
Implementation
Configuration
Cluster Setup:
Topics:
patient.events - Patient lifecyclequality-measure.events - Measure evaluationscare-gap.events - Gap detectionsclinical-workflow.events - Workflow updatesDocker Compose
kafka:
image: confluentinc/cp-kafka:7.5.0
environment:
KAFKA_BROKER_ID: 1
KAFKA_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_LOG_RETENTION_DAYS: 30
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"Success Criteria
Monitoring & Validation
Metrics
| Metric | Target | Current |
|--------|--------|---------|
| Broker availability | 99.9% | 99.95% |
| Message retention | 30 days | 30 days |
| Replication lag (p99) | <10ms | 2-5ms |
| Consumer lag | <5sec | 1-3sec |
References
Footer
ADR #: 003
Version: 1.0
Last Updated: 2026-01-19
Status: Active and Deployed
_Created: January 19, 2026_
_Decision Date: October 2025 (Phase 4)_