Context and Problem Statement
HDIM's 28 microservices require asynchronous communication for:
The messaging solution must support:
Decision Drivers
Considered Options
Decision Outcome
Chosen option: "Apache Kafka 3.x"
Rationale: Apache Kafka provides the best combination of:
Consequences
Positive
Negative
Mitigations:
Neutral
Pros and Cons of Options
Option 1: Apache Kafka 3.x
Distributed event streaming platform from Apache.
| Criterion | Assessment |
|-----------|------------|
| Durability | Good - Persistent storage with configurable retention |
| Throughput | Good - Millions of messages/second proven |
| Spring Integration | Good - Spring Kafka is mature and well-documented |
| Multi-tenancy | Good - Topic-level isolation, ACLs available |
| Operational Complexity | Neutral - Requires cluster management |
| Healthcare Adoption | Good - Widely used in healthcare (Epic, Cerner integrations) |
Summary: Industry standard for event streaming with excellent durability and throughput.
Option 2: RabbitMQ
Traditional AMQP message broker.
| Criterion | Assessment |
|-----------|------------|
| Durability | Neutral - Persistent queues available but less robust than Kafka |
| Throughput | Neutral - Good for moderate loads, not designed for massive scale |
| Spring Integration | Good - Spring AMQP is mature |
| Multi-tenancy | Neutral - Virtual hosts provide isolation |
| Operational Complexity | Good - Simpler than Kafka |
| Message Replay | Bad - No built-in replay capability |
Summary: Simpler but lacks durability and replay capabilities critical for healthcare.
Option 3: Amazon SQS/SNS
AWS managed messaging services.
| Criterion | Assessment |
|-----------|------------|
| Durability | Good - AWS managed, highly durable |
| Throughput | Good - Scales automatically |
| Spring Integration | Neutral - Requires AWS SDK |
| Multi-tenancy | Good - Queue-level isolation |
| Vendor Lock-in | Bad - AWS-specific |
| Message Replay | Bad - No built-in replay (requires separate S3 archival) |
Summary: Good managed option but creates AWS dependency and lacks replay.
Option 4: Apache Pulsar
Cloud-native messaging and streaming platform.
| Criterion | Assessment |
|-----------|------------|
| Durability | Good - BookKeeper provides strong durability |
| Throughput | Good - Designed for high throughput |
| Spring Integration | Neutral - Less mature than Spring Kafka |
| Multi-tenancy | Good - Native multi-tenancy support |
| Operational Complexity | Bad - More complex than Kafka |
| Healthcare Adoption | Neutral - Less common than Kafka in healthcare |
Summary: Technically capable but smaller ecosystem and less healthcare adoption.
Option 5: Redis Streams
Lightweight streaming built into Redis.
| Criterion | Assessment |
|-----------|------------|
| Durability | Neutral - Depends on Redis persistence configuration |
| Throughput | Good - Redis is fast |
| Spring Integration | Good - Spring Data Redis supports Streams |
| Multi-tenancy | Neutral - Key-based isolation |
| Operational Complexity | Good - Already running Redis for caching |
| Feature Set | Bad - Limited compared to Kafka (no compaction, limited retention) |
Summary: Simple but lacks enterprise features needed for healthcare messaging.
Implementation Notes
Version Selected
Apache Kafka 3.6.x - Latest stable release
Deployment Model
Topic Naming Convention
{domain}.{event-type}
Examples:
- patient.created
- patient.updated
- measure.evaluation.completed
- care-gap.detected
- audit.phi-accessPartition Strategy
| Topic Pattern | Partition Key | Rationale |
|---------------|---------------|-----------|
| patient.* | patientId | Ensures patient event ordering |
| measure.* | tenantId + measureId | Tenant isolation, measure grouping |
| care-gap.* | tenantId + patientId | Tenant isolation, patient ordering |
| audit.* | tenantId | Tenant isolation |
Configuration
Topic retention
kafka:
topics:
patient-events:
retention-ms: 604800000 # 7 days
audit-events:
retention-ms: 2592000000 # 30 days (HIPAA requirement)
Key Topics
| Topic | Purpose | Producers | Consumers |
|-------|---------|-----------|-----------|
| patient.events | Patient lifecycle events | Patient Service | Care Gap, Quality Measure, Analytics |
| measure.evaluation.completed | Measure results | Quality Measure | Care Gap, Analytics, Notification |
| care-gap.detected | New care gaps | Care Gap Service | Notification, Analytics |
| audit.phi-access | PHI access events | All services | Audit Service |
| notification.requests | Notification triggers | Various | Notification Service |
Performance Targets
| Metric | Target | Actual (Dec 2024) |
|--------|--------|-------------------|
| Producer Throughput | 5,000 msg/sec | 6,200 msg/sec |
| Consumer Latency (p95) | <100ms | 75ms |
| End-to-End Latency | <500ms | 350ms |
| Message Loss Rate | 0% | 0% |
Links
Version History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2024-Q3 | Architecture Team | Initial decision |
| 1.1 | 2024-12-30 | Architecture Team | Added performance actuals, topic details |
*This ADR follows the template in /docs/templates/ADR_TEMPLATE.md*