Organization: MedCore Health Network (Multi-Hospital Healthcare System)
Industry: Healthcare & Clinical Services
MedCore Health Network operates 12 hospitals and 40+ outpatient clinics across 3 regions, generating over 50 million clinical events annually. The organization faced critical challenges in clinical data analytics, including fragmented data sources, delayed insights (24-48 hours), and inability to respond to real-time clinical events such as sepsis alerts, bed capacity management, and patient deterioration warnings.
This case study demonstrates how a data analytics-first approach leveraging AWS EMR on EKS, Amazon MSK, AWS Glue, and modern DevOps practices transformed clinical data operations, enabling real-time healthcare intelligence.
Data Analytics Impact:
· Real-time clinical insights: 5–15-minute data freshness vs. 24–48-hour legacy delays
· 50M+ clinical events processed annually with 99.9% data quality validation
· Real-time sepsis detection enabling early intervention (previously 48-hour delay)
· 360-degree patient view unifying data from 15+ disparate healthcare systems
Operational Excellence:
· 45% infrastructure cost reduction through intelligent workload optimization
· Deployment velocity: 4+ hours reduced to under 2 minutes with zero downtime
· Incident resolution time: 2.5 hours reduced to under 15 minutes (MTTR improvement)
· Platform availability: 99.99% uptime for mission-critical clinical data services
MedCore Health Network Characteristics:
· Healthcare Facilities: 12 acute care hospitals, 40+ outpatient clinics, 5 specialty centers
· Clinical Staff: 8,000+ physicians, nurses, and allied health professionals
· Patient Population: 2 lakhs patients under active care management
· Annual Patient Volume: 250,000+ inpatient admissions, 1.5M+ outpatient visits, 5000+ emergency department encounters
· Data Generation Rate: 50M+ clinical events annually including admissions, discharges, lab results, vital signs, imaging studies, medication orders, and clinical documentation
Challenge 1: Clinical Data Fragmentation Preventing Real-Time Healthcare Delivery
Problem Statement:
Clinical, operational, and financial data fragmented across 15+ systems with no unified analytics framework. Nightly batch ETL processes created 24–48-hour data latency, preventing timely clinical decision-making and real-time patient care interventions.
Data Source Landscape:
Challenge 2: Batch Processing Limitations Preventing Scalable Analytics
Problem Statement:
On-premises ETL server with fixed capacity unable to scale for growing clinical data volumes. Nightly batch jobs routinely exceed 6–8-hour processing windows, causing cascading failures. No elastic capacity for urgent analytics during hospital surge events.
Processing Bottlenecks:
· Patient demographics ETL: 1 hour
· Laboratory results aggregation: 2 hours
· Vital signs time-series processing: 3 hours
· Billing charge capture: 2 hours
· Total nightly window: 8 hours (must complete by 8 AM for clinical rounds)
Surge Event Failure: During influenza season, 3M+ events (vs. normal 1.5M) caused processing delays extending to 8 AM, preventing real-time sepsis alerts and clinical decision support.
Challenge 3: Lack of Data Analytics Observability
Problem Statement:
No centralized visibility into data pipeline health, quality metrics, or processing performance. Pipeline failures detected hours after occurrence when business users report missing dashboard data.
Incident Example - Laboratory Results Missing:
· Detection delay: 2 hours (reported by lab staff, not automated alerting)
· Root cause analysis: 1.5 hours (manual log review across multiple servers)
· Resolution: 30 minutes (database quota increase, job restart)
· Total MTTR: 2.5 hours affecting 2,000+ lab results
Target Data Analytics Architecture

The redesigned platform follows data analytics-first principles with cloud-native infrastructure supporting healthcare requirements
1. Real-Time Data Integration: Event-driven CDC with 5–15-minute micro-batches
2. Unified Data Lake: Single source of truth for clinical, operational, and financial data
3. Schema-on-Read Analytics: Flexible data model supporting ad-hoc clinical research
4. Data Quality by Design: Automated validation at every stage
5. Compliance Automation: Security controls embedded in architecture
6. Elastic Analytics Compute: EMR on EKS scaling from 2 to 200+ nodes on demand
7. Infrastructure as Code: Complete platform reproducibility
Technology Stack: Data Analytics Components
EMR on EKS: Core Data Analytics Engine
Why EMR on EKS for Healthcare Analytics
Traditional EMR on EC2 requires infrastructure management (cluster sizing, node provisioning). Healthcare analytics workloads are highly variable - routine processing needs modest capacity, but surge events require immediate 10-20x scale-out.
EMR on EKS Benefits:
1. Unified Container Orchestration:
· Single Kubernetes cluster hosts real-time applications (Kafka consumers, APIs) and batch analytics (Spark jobs)
· Shared infrastructure eliminates separate EMR cluster management
· Kubernetes namespaces isolate production from development workloads
2. Elastic Analytics Compute:
· Dynamic Spark scaling: 2 executor pods baseline → 200+ pods during surge
· Kubernetes Cluster Autoscaler provisions EC2 nodes automatically
· Spot instance integration: 70% cost savings on Spark executors
3. Job Isolation & Multi-Tenancy:
· Namespace separation: clinical analytics in emr-clinical, financial in emr-finance
· Resource quotas prevent job monopolization
· Network policies isolate PHI-processing for HIPAA compliance
4. Simplified CI/CD:
· Spark applications packaged as container images in Amazon ECR
· ArgoCD GitOps: Spark configurations in Git, auto-synchronized
· Version control: rollback by reverting Git commit
EMR on EKS Configuration:
· EKS cluster: v1.34 with 3 node groups (system/driver/executor)
· EMR release: 6.15.0 (Spark 3.5.0, Hadoop 3.3.6)
· Namespaces: emr-clinical-prod, emr-clinical-stage, emr-clinical-dev
· IAM IRSA: Spark jobs assume roles for S3, Aurora access
· Autoscaling: 2 baseline nodes → 200+ during surge (5-minute scale-down delay)
Data Processing Patterns with EMR on EKS
Pattern 1: Real-Time Micro-Batch Processing
Use Case: Process clinical events from Kafka every 5-15 minutes instead of nightly 6–8-hour batches.
Airflow DAG Configuration:
· Schedule: Every 15 minutes during business hours, 30 minutes overnight
· Parallelism: Process multiple Kafka topics concurrently
Spark Job Workflow:
Step 1: Kafka Consumer (Structured Streaming)
· Read from Kafka topic lab results with 5-minute micro-batches
· Checkpointing in S3 for exactly-once processing
· Input: HL7 messages as JSON
Step 2: Data Validation
· Schema conformance: required fields, correct data types
· Referential integrity: patient ID exists in master index
· Business rules: timestamps within 24 hours
· Invalid records routed to dead-letter S3 bucket
Step 3: Data Transformation (Spark SQL)
· Flatten nested HL7 JSON: extract demographics, test codes, results
· Join with reference data: enrich with lab test descriptions
· Calculate derived fields: flag critical values (e.g., potassium > 6.0)
· Deduplication: remove duplicates by patient ID + test + timestamp
Step 4: Write to Data Lake (Partitioned Parquet)
· Destination: s3://medcore-datalake/curated/lab_results/
· Partitioning: year=2025/month=12/day=30/hour=14/
· Format: Parquet with Snappy compression (10:1 ratio)
· Auto-register partitions in AWS Glue Data Catalog
Step 5: Update Operational Database (Aurora)
· Insert summary stats: critical results per facility
· Update dashboard cache for sub-second queries
· JDBC batch writes: 1000 records per transaction
Resource Allocation:
· Spark driver: 1 pod, 4 vCPU, 16GB memory
· Spark executors: 10-50 pods (auto-scaled), 4 vCPU, 16GB each
· Processing time: 5-8 minutes for 500K results (vs. 2 hours legacy)
Pattern 2: Large-Scale Feature Engineering
Use Case: Calculate patient risk scores requiring multi-year time-series aggregation.
Computation Characteristics:
· Data volume: 50M+ events/day, 10 years history (500B+ events)
· Complexity: Time-series rolling averages, trend detection, multi-table joins
· Output: Patient-level feature table with 500+ features
Spark Processing Steps:
Step 1: Read Historical Data
· Source: S3 curated zone (Parquet partitioned by year/month/day)
· Partition pruning: Read only last 2 years (80% I/O reduction)
· Distribution: Shuffle-partition by patient ID
Step 2: Time-Series Feature Engineering
Vital Signs Aggregation:
· 24-hour rolling averages for heart rate, blood pressure, temperature
· Trend detection: increasing heart rate (early sepsis indicator)
· Variability metrics: high heart rate variability = better outcomes
Laboratory Trends:
· 72-hour creatinine trend (acute kidney injury detection)
· Troponin rate of change (myocardial infarction support)
· Electrolyte imbalances flagging
Step 3: Clinical Event Aggregation
· Hospital admissions past 12 months
· Emergency department visits frequency
· Surgery history and complications
· Medication adherence patterns
Step 4: Comorbidity Scoring
· Charlson Comorbidity Index calculation
· Diagnosis code aggregation (ICD-10)
· Chronic condition burden assessment
Resource Allocation:
· Spark driver: 1 pod, 8 vCPU, 32GB
· Spark executors: 100-200 pods, 8 vCPU, 32GB each
· Processing time: 45 minutes for 2M patients (vs. 8+ hours legacy)
Pattern 3: Clinical Data Quality Validation
Use Case: Systematic validation of data completeness, accuracy, and timeliness across all clinical data sources.
Great Expectations Framework Integration:
Validation Suite Categories:
1. Schema Validation:
· Column presence: all required fields exist
· Data types: dates are dates, integers are integers
· Value ranges: heart rate 30-250 bpm, temperature 90-110°F
2. Referential Integrity:
· Patient ID exists in patient master index
· Provider NPI valid in credentialing database
· Facility code matches active facility list
3. Completeness Checks:
· Lab results have collection time within 24 hours
· Vital signs within 15 minutes of device transmission
· Medication orders include dosage and route
4. Business Logic Validation:
· Discharge date after admission date
· Lab results numeric values within biological plausibility
· Age calculated from birth date matches recorded age
Validation Execution:
· Frequency: Every micro-batch (5-15 minutes)
· Action on failure: Route to quarantine S3 bucket, alert data quality team
· Metrics: Publish validation pass rate to CloudWatch
Data Types and Storage Strategy
Clinical Data Type Classification
1. Structured Clinical Data (70% of volume)
Patient Demographics:
· Format: Relational tables (Aurora), Parquet files (S3)
· Refresh frequency: Real-time from HL7 ADT messages
· Retention: 7 years (regulatory requirement)
· Encryption: AES-256 at rest (KMS), TLS 1.2 in transit
Laboratory Results:
· Format: Parquet partitioned by date and facility
· Storage zones:
o Raw: Original HL7 JSON messages
o Curated: Normalized lab values with reference ranges
o Analytics: Aggregated trends and critical value flags
· Partitioning: year/month/day/hour for query optimization
· Compression: Snappy (10:1 ratio)
Vital Signs Time-Series:
· Format: Parquet optimized for time-series queries
· Frequency: Every 5 seconds from bedside monitors
· Volume: 10M observations daily
· Retention: 30 days hot (S3 Standard), 1 year warm (S3 IA), 7 years cold (Glacier)
Medication Orders:
· Format: Normalized relational schema in Aurora
· Real-time cache: DynamoDB for sub-100ms retrieval
· Analytics copy: Daily snapshot to S3 Parquet
2. Semi-Structured Clinical Data (25% of volume)
Clinical Notes (Unstructured Text):
· Format: JSON with NLP-extracted entities
· Storage: S3 with full-text search index in Amazon OpenSearch
· Processing: Amazon Comprehend Medical for entity extraction
· Entities extracted: Medications, diagnoses, procedures, anatomical sites
HL7 Messages:
· Format: Raw HL7 v2.x messages preserved in S3
· Parsed format: JSON in curated zone
· Retention: 7 years (audit trail requirement)
DICOM Imaging Metadata:
· Format: JSON extracted from DICOM headers
· Storage: S3 with Glue Data Catalog registration
· Linked data: References to PACS image storage URLs
3. Unstructured Clinical Data (5% of volume)
Medical Images:
· Storage: PACS (Picture Archiving Communication System) with S3 archival
· Metadata only: DICOM headers in data lake for analytics
· Retention: 7 years regulatory minimum
Scanned Documents:
· Format: PDF with OCR text extraction
· Storage: S3 with searchable text index
· Use cases: Consent forms, insurance cards, external records
Data Transformation Techniques
Technique 1: Flattening Nested HL7 JSON Structures
Challenge: HL7 messages contain deeply nested JSON structures that are difficult to query and analyze.
Spark Transformation:
· Use Spark SQL explode () function to flatten arrays
· Extract nested fields to top-level columns
· Create separate tables for one-to-many relationships (patient → diagnoses)
Output Schema:
Flat patient admission table with separate diagnosis table linked by admission ID.
Technique 2: Slowly Changing Dimension (SCD) Type 2 for Patient Demographics
Challenge: Patient information changes over time (address, phone, insurance) but analytics require historical accuracy.
Implementation:
· Effective date and end date columns track record validity period
· Current indicator flag identifies active record
· Historical records preserved for longitudinal analysis
Use Case: Analyze readmission rates by patient ZIP code at time of original admission, not current address.
Technique 3: Time-Series Aggregation and Windowing
Challenge: Vital signs generate millions of observations requiring summarization for analysis.
Spark Window Functions:
· Rolling averages: 1-hour, 6-hour, 24-hour windows
· Trend detection: Compare current value to 6-hour baseline
· Anomaly detection: Standard deviation > 2 triggers alert
Performance Optimization:
· Pre-aggregate at 5-minute intervals (12x data reduction)
· Partition by patient ID for co-located processing
· Cache intermediate results for multi-pass analytics
Technique 4: Deduplication and Record Linkage
Challenge: Multiple systems generate duplicate patient records with slight variations.
Fuzzy Matching Approach:
· Exact match: Patient ID across systems
· Fuzzy match: Name + date of birth + gender (Levenshtein distance)
· Probabilistic linkage: Weighted scoring of matching attributes
Spark Implementation:
· Group by matching attributes
· Calculate match probability score
· Assign master patient ID to linked records
EMR Spark Job Runtime and Monitoring
Real-Time Micro-Batches (Every 5-15 minutes):
· Laboratory results processing: Every 5 minutes
· Vital signs aggregation: Every 10 minutes
· Medication order validation: Every 5 minutes
· Admission/discharge/transfer events: Every 2 minutes
Hourly Analytics Jobs:
· Hospital bed capacity dashboard refresh: Every 15 minutes
· Emergency department wait time predictions: Every 30 minutes
· Sepsis risk score recalculation: Every hour
Daily Batch Jobs:
· Full patient risk score recalculation: 2 AM daily
· Financial reporting aggregations: 3 AM daily
· Quality metrics calculation: 4 AM daily
Weekly Jobs:
· Historical trend analysis: Sunday 1 AM
· Machine learning model retraining: Saturday 2 AM
EMR Spark UI Access:
· Persistent Spark History Server running in EKS cluster
· Web UI accessible via Application Load Balancer with authentication
· Job logs retained for 30 days in S3
Key Metrics Monitored:
1. Job-Level Metrics:
· Total job duration: Target < 10 minutes for micro-batches
· Stage execution time: Identify bottleneck stages
· Task distribution: Detect data skew causing stragglers
· Memory usage: Prevent executor OOM errors
2. Stage-Level Metrics:
· Shuffle read/write volume: Optimize join strategies
· Input/output records: Verify expected data volumes
· GC time: Tune JVM settings if excessive
3. Executor Metrics:
· CPU utilization: Ensure efficient resource use
· Memory consumption: Adjust executor memory allocation
· Disk spill: Optimize Spark configurations to reduce I/O
4. Data Quality Metrics:
· Records processed vs. expected
· Validation failure rate (target < 0.1%)
· Data freshness: Time from source to availability
Custom CloudWatch Metrics Published from Spark Jobs:
· clinical.lab_results.processed - Count of lab results per batch
· clinical.data_freshness.minutes - Age of most recent data
· clinical.validation_failures.count - Data quality issues detected
· clinical.sepsis_alerts.generated - Patient alerts triggered
CloudWatch Alarms:
· Job duration exceeds 15 minutes → Page on-call engineer
· Validation failure rate > 1% → Alert data quality team
· Kafka consumer lag > 10,000 messages → Scale executor pods
· Data freshness > 30 minutes → Escalate to operations
CloudWatch Dashboards:
· Clinical Data Pipeline Health: Job success rates, processing latency, data volumes
· Spark Resource Utilization: CPU, memory, executor counts across jobs
· Data Quality Trends: Validation pass rates, error patterns over time
· Business Metrics: Sepsis alerts per hour, readmission predictions, bed capacity
S3 Intelligent-Tiering and Lifecycle Policies:
Raw Zone (Bronze):
· Retention: 7 days in S3 Standard (immediate queryability)
· Transition: After 7 days → S3 Standard-IA (infrequent access)
· Archive: After 90 days → Glacier Flexible Retrieval
· Deletion: After 7 years (regulatory compliance)
Curated Zone (Silver):
· Retention: 90 days S3 Standard (active analytics)
· Transition: After 90 days → S3 Standard-IA
· Archive: After 1 year → Glacier Deep Archive
· Deletion: After 10 years
Analytics Zone (Gold):
· Retention: Indefinite in S3 Standard (business intelligence)
· Optimization: Regularly compacted to reduce small files
· Backup: Cross-region replication to ap-south-2
Data Deletion and Right to Be Forgotten
Patient Data Deletion Requirements:
When patient requests data deletion (right to be forgotten):
1. Identification Phase:
· Query AWS Glue Data Catalog for all tables containing patient ID
· Use Athena SQL to identify all partitions with patient records
· Generate manifest file of all S3 objects requiring deletion
2. Deletion Phase:
· Spark job reads manifest and deletes specific patient records
· Rewrite Parquet files excluding deleted patient data
· Update AWS Glue partition metadata
· Delete original Parquet files containing patient
3. Verification Phase:
· Athena query confirms patient ID no longer appears in any table
· Document deletion timestamp and user authorization in audit log
· Generate deletion certificate for compliance records
Deletion Timeline: Complete within 30 days of request per requirements
AWS Glue Data Catalog Lineage Tracking:
Automated Lineage Capture:
· Spark jobs register input/output tables in Glue catalog
· Transformation logic documented in table descriptions
· Job execution history links source data to derived tables
Lineage Visualization:
Example lineage path for readmission risk score:
1. Source: EMR system HL7 messages → Kafka topic admissions
2. Ingestion: Kafka consumer → S3 raw zone raw/admissions/
3. Validation: Great Expectations → S3 curated zone curated/admissions/
4. Feature engineering: Spark job → S3 analytics zone analytics/patient_features/
5. Dashboard: QuickSight query → Clinical dashboard
Data Dictionary:
· AWS Glue Data Catalog stores column descriptions, data types, business definitions
· Searchable by clinical users via Glue console or Athena queries
· Version controlled: Schema evolution tracked with timestamps
DevOps Infrastructure Supporting Data Analytics
GitLab CI/CD Pipeline for Data Jobs
Pipeline Stages:
1. Code Quality (Lint & Test):
· Python PySpark code: Flake8 linting, type checking with mypy
· SQL queries: SQLFluff linting for consistent style
· Unit tests: pytest with coverage > 80% requirement
· Data validation tests: Great Expectations expectation suites
2. Container Build:
· Build Docker image containing Spark application code
· Multi-stage build: compile dependencies, minimize image size
· Base image: AWS EMR runtime image with healthcare libraries
3. Security Scanning:
· Trivy scan for OS vulnerabilities in container image
· Snyk scan for Python dependency vulnerabilities
· Fail build if CRITICAL vulnerabilities detected
4. Image Publishing:
· Push versioned image to Amazon ECR
· Tag with Git commit SHA for traceability
· ECR lifecycle policy retains 30 most recent images
5. Deployment (GitOps with ArgoCD):
· Update Helm chart values with new image tag
· Commit to Git repository
· ArgoCD automatically syncs to EKS cluster
· Spark job configuration updated without manual intervention
Deployment Frequency:
· Development environment: 10+ deployments per day
· Staging environment: 2-3 deployments per day after automated testing
· Production environment: 1-2 deployments per week with change approval
Infrastructure as Code (Terraform)
Terraform Modules for Data Platform:
Module: emr-on-eks
· Provisions EKS cluster with EMR-optimized node groups
· Configures IAM roles for service accounts (IRSA)
· Sets up EMR virtual cluster and job execution roles
· Enables CloudWatch logging and Container Insights
Module: msk-kafka
· Creates Amazon MSK cluster with 3 AZ replication
· Configures broker storage and retention policies
· Enables encryption in transit (TLS) and at rest (KMS)
· Sets up VPC security groups for producer/consumer access
Module: s3-datalake
· Creates S3 buckets for raw/curated/analytics zones
· Configures bucket encryption with KMS customer-managed keys
· Sets lifecycle policies for data tiering and deletion
· Enables S3 access logging for audit trail
Module: glue-catalog
· Creates AWS Glue databases for each data zone
· Configures Glue crawlers to auto-discover schemas
· Sets up Glue ETL jobs for batch transformations
· Integrates with Lake Formation for access control
Terraform State Management:
· State stored in S3 backend with versioning enabled
· State locking with DynamoDB to prevent concurrent modifications
· Encrypted at rest with KMS
GitOps Repository Structure:
Helm charts directory contains:
· Chart.yaml: Application metadata and dependencies
· values-dev.yaml: Development environment configuration
· values-staging.yaml: Staging environment configuration
· values-prod.yaml: Production environment with strict resource limits
ArgoCD Application Sync:
· Auto-sync enabled for development namespace
· Manual approval required for production deployments
· Health checks: Kubernetes pod readiness probes
· Rollback: Automated on failed health checks within 5 minutes
Audit Trail:
· Every deployment logged as Git commit
· Commit message includes: author, timestamp, change description
CloudWatch Logs Aggregation:
· Fluent Bit DaemonSet collects logs from all EKS pods
· Spark driver and executor logs streamed to CloudWatch
· Airflow task logs centralized for troubleshooting
· Log retention: 90 days (compliance requirement)
CloudWatch Container Insights:
· CPU and memory utilization metrics per pod
· Network throughput and error rates
· Disk I/O metrics for Spark shuffle operations
· Auto-scaling triggers based on resource thresholds
AWS X-Ray Distributed Tracing:
· Trace patient data requests end-to-end
· Identify performance bottlenecks in data pipeline
· Latency breakdown: Kafka read → Spark transformation → database write
· Error correlation across microservices
VPC Flow Logs:
· Network traffic monitoring for security analysis
· Troubleshoot connectivity issues between services
· Compliance requirement: 90-day retention for HIPAA audit
Technical Safeguards Implementation
Encryption:
· At rest: AES-256 encryption with AWS KMS customer-managed keys
· In transit: TLS 1.2+ for all data movement
· Key management: Automatic key rotation every 90 days
· Enforcement: Bucket policies deny unencrypted uploads
Access Control:
· IAM policies: Least-privilege role-based access (RBAC)
· ABAC (Attribute-Based Access Control): Tag-based fine-grained controls
· Database-level: Column-level encryption for sensitive PII (SSN, MRN)
· Audit: AWS CloudTrail logs all API calls with 90-day retention
Audit Logging:
· CloudTrail: All AWS API calls logged to S3 with integrity validation
· Application logs: EKS pods send logs to CloudWatch with encryption
· Database audit: Aurora enhanced monitoring tracks query execution
· Immutable logs: S3 Object Lock prevents deletion of audit logs
Workforce Security:
· Authentication: MFA required for all console/programmatic access
· Authorization: Role-based access control with periodic reviews
· Termination: Automated IAM credential revocation on employee separation
Security Awareness Training:
· Mandatory annual training for all staff
· Role-specific training for developers and operations teams
· Incident response drills quarterly
Incident Response Plan:
· Detection: Automated CloudWatch alarms for suspicious activity
· Investigation: CloudTrail logs analyzed within 1 hour of alert
· Containment: Automatic security group rule updates to isolate compromised resources
· Reporting: Breach notification within 60 days per requirements
Spot Instance Integration:
· Spark executors run on Spot instances (80% discount)
· Driver pods on on-demand instances for reliability
· Spot fleet maintains 99.9% availability with fallback to on-demand
Workload-Based Scaling:
· Baseline: 2 on-demand nodes (shared services)
· Business hours: 10-50 executors (real-time micro-batches)
· Surge events: 100-200 executors (auto-scaling in 5 minutes)
· Off-hours: 0 Spark executors (schedule scale-down)
Cost Example - Monthly Savings:
· Spot instances: $0.05/hour vs. $0.20/hour on-demand (75% savings)
· 24/7 baseline: 2 nodes × $0.20 = $96/day
· 12-hour peak: 40 avg executors × $0.05 = $24/day
· Daily cost: 120K vs. 480K with on-demand
· Monthly savings: $10,800 per month (45% reduction)
Intelligent-Tiering Lifecycle:
· Raw zone: 7 days standard → 90 days IA → 7 years glacier
· Curated zone: 90 days standard → 1 year IA → 10 years glacier
· Analytics zone: Indefinite standard (frequently queried)
Storage Cost Reduction:
· Raw zone: 1 TB/day × 365 days = 365 TB annually
o Standard: 30 days × 1 TB × $0.023 = $23/month
o IA: 60 days × 1 TB × $0.0125 = $12.50/month
o Glacier: 365 days × 1 TB × $0.004 = $4/month
o Total: $39.50/month (vs. $840 all standard)
Risk Mitigation and Disaster Recovery
Disaster Recovery Architecture
RTO (Recovery Time Objective): 4 hours
RPO (Recovery Point Objective): 15 minutes
Multi-Region Replication:
· Primary: ap-south-1 (active)
· Standby: ap-south-2 (passive with failover capability)
· S3 cross-region replication (15-minute lag)
· Aurora Global Database (same replication)
Failover Procedure:
· Automated health checks detect ap-south-1 outage
· Update Route 53 to route traffic to ap-south-2
· Promote standby EKS cluster to active
· Re-sync Kafka topics from backup
· Total failover time: < 4 hours
S3 Bucket Versioning:
· Enabled on all production buckets
· 30-day retention for previous versions
· Protects against accidental deletion or encryption attacks
Aurora Automated Backups:
· 7-day retention for point-in-time recovery
· Cross-region backup copies
· Tested monthly for restore capability
Immutable Backups:
· S3 Object Lock (governance mode) prevents deletion
· Minimum 90-day retention for regulatory compliance
· Protects against ransomware attacks
Lessons Learned and Best Practices
1. Event-Driven > Batch Processing: Real-time micro-batches (5-15 min) beat nightly batches (6-8 hrs) for clinical decision-making
2. Containerization Critical: EMR on EKS simplifies operations vs. managing separate EMR clusters
3. Data Quality First: Automated validation at every transformation stage prevents cascading downstream errors
4. GitOps Discipline: Infrastructure and configuration as code enables rapid iteration without manual drift
5. Observability Built-In: Comprehensive monitoring (CloudWatch, X-Ray, custom metrics) detects issues within minutes vs. hours
The real-time clinical data analytics platform transformed MedCore Health Network from a fragmented, batch-driven architecture to a modern, event-driven data platform. By leveraging AWS EMR on EKS, Amazon MSK, and cloud-native DevOps practices, the organization achieved:
· Real-time clinical insights enabling immediate patient care interventions
· 99.99% platform availability for mission-critical healthcare operations
· 45% infrastructure cost reduction through intelligent cloud resource optimization
· 12% hospital readmission reduction improving patient outcomes and revenue
The case study demonstrates that healthcare organizations can modernize their data infrastructure while maintaining strict regulatory compliance and achieving significant operational and financial benefits.