· The customer is a large omni-channel e-commerce platform operating across web, mobile, and physical retail stores, processing millions of daily transactions and billions of behavioural events across multiple regions.
· Analytics plays a mission-critical role in revenue reporting, demand forecasting, fraud detection, inventory optimization, and executive decision-making, especially during peak sales and seasonal campaigns.
· The existing ETL platform relied on single-AZ Amazon EMR clusters with batch-oriented processing, exposing the business to availability risks, unpredictable recovery behaviour, and rising operational costs as data volume and complexity increased.
· As an AWS Partner, we designed a multi-AZ–aware, resilient ETL architecture using AWS-native services, introducing controlled execution across Availability Zones, centralized orchestration, and restart-safe data processing.
· The solution enabled near-zero data loss, predictable recovery times, and continuous analytics availability while improving processing efficiency and reducing operational overhead.
· Industry: Retail and E-Commerce
· Business Overview:
The customer is a large omni-channel e-commerce platform operating across online channels (web and mobile applications) and offline channels (physical retail stores and POS systems). The platform serves customers across multiple geographic regions and supports high-volume, high-velocity transactional and behavioral workloads.
· Operational Scale:
The customer processes millions of transactions per day and ingests billions of user interaction events, including clickstream, search, cart, and purchase activity. This data is central to business operations such as sales analytics, demand forecasting, fraud detection, inventory planning, and executive reporting.
· Technology Landscape:
The customer’s core transactional systems are built on AWS-managed services, with Amazon Aurora and Amazon DynamoDB supporting OLTP workloads, Amazon S3 serving as the central data lake, and Amazon Redshift acting as the primary analytics warehouse.
· Mission-critical dependency on ETL for business operations:
ETL pipelines supported revenue reporting, inventory planning, fraud detection, and executive dashboards. Any delay or failure in data processing directly impacted business visibility, decision-making, and downstream systems.
· Single-AZ dependency for EMR-based ETL workloads:
Amazon EMR clusters were provisioned in a single Availability Zone. Any AZ-level degradation such as capacity constraints, networking issues, or service impairment posed a direct risk to ETL continuity with no deterministic recovery mechanism.
· Lack of predictable recovery and failover behavior:
While no major outage had occurred, the existing design had no controlled failover strategy. Recovery depended on manual intervention, cluster recreation, and job restarts, resulting in unpredictable recovery times.
· Growing memory pressure and Spark instability:
Increasing data volumes from clickstream, POS, and transactional joins caused frequent Spark executor out-of-memory (OOM) errors. Skewed joins led to uneven resource utilization, extended runtimes, and repeated job retries.
· Tight coupling between jobs and long-lived clusters:
EMR clusters were tightly coupled to ETL execution. Failures often required manual debugging and cluster restarts, increasing operational overhead and delaying downstream processing.
· Improper data handling and reprocessing overhead:
The pipelines lacked strong guarantees around idempotency, checkpointing, and intermediate persistence. Partial job failures frequently triggered full reprocessing of large datasets, increasing compute cost and processing latency.
· Inconsistent data correctness and reconciliation effort:
Duplicate records and partial writes occasionally appeared in analytical datasets due to failed or retried jobs. Manual validation and reconciliation were required to ensure data accuracy, especially during peak sales events.
· Inefficient orchestration and dependency management:
ETL execution relied on custom scripts and CRON-based scheduling. There was no centralized orchestration layer to manage dependencies, retries, or failure isolation between ingestion, transformation, and aggregation stages.
· Limited visibility into pipeline health:
The existing setup provided no unified view of ETL execution status, failure points, or recovery progress. Diagnosing issues required manual log inspection across clusters and services.
· Escalating infrastructure cost and inefficiency:
EMR clusters often ran longer than required due to retries and reprocessing. Scaling decisions were reactive rather than workload-aware, leading to unpredictable runtimes and inflated costs during peak traffic periods.
The customer selected AWS as the cloud platform for their analytics modernization because most of their application infrastructure and operational databases were already hosted on AWS, enabling seamless integration without introducing additional platforms or operational complexity. AWS provided a mature set of managed and serverless services for data ingestion, transformation, orchestration, and analytics, allowing the customer to scale processing dynamically based on scan volume while avoiding always-on infrastructure. Native services such as event-driven orchestration, managed Spark for complex transformations, and a fully managed analytics warehouse enabled the customer to implement a secure, resilient, and cost-optimized data platform that could be operated efficiently by a small engineering team.
The customer chose to engage the partner Ancrew global services based on an existing relationship in which the partner was already managing the customer’s application infrastructure and DevOps operations on AWS. Through ongoing collaboration, the partner gained deep visibility into the customer’s data flows, operational constraints, and growth challenges, which led to early identification of limitations in the existing analytics approach. This established trust and hands-on understanding of the environment positioned the partner to effectively assess the problem, conduct data discovery, and design a scalable analytics solution aligned with the customer’s technical and business goals.

Before redesigning the platform, we conducted a structured assessment of the existing ETL environment.
· Operational analysis
o Reviewed historical ETL failures and Spark job metrics
o Analyzed EMR cluster runtimes, retry behavior, and recovery patterns
o Identified stages triggering repeated full dataset reprocessing
· Risk identification
o Single-AZ dependency for mission-critical ETL workloads
o Frequent executor out-of-memory (OOM) failures
o Skewed joins causing uneven resource utilization
o Lack of deterministic recovery during infrastructure degradation
Based on these findings, the architecture was redesigned to separate execution control, data persistence, and processing logic, ensuring restart safety and predictable recovery.
To remove Availability Zone dependency and introduce deterministic failover:
· Multi-AZ EMR deployment
o Identical Amazon EMR clusters provisioned in separate Availability Zones
o Each cluster capable of executing the full ETL workload independently
· Execution control with AWS Application Recovery Controller (ARC)
o ARC routing controls enforced an active/passive execution model
o Guaranteed that only one EMR cluster could process ETL jobs at any time
o Prevented split-brain and concurrent processing scenarios
· Deterministic failover
o Failover decisions driven by orchestration logic
o No reliance on ad-hoc scripts or manual intervention
AWS Step Functions served as the control plane for the complete ETL lifecycle.
· Orchestration responsibilities
o Validate ingestion readiness
o Query ARC to identify the active Availability Zone
o Submit Spark jobs to the active EMR cluster
o Monitor execution status and job health
· Failure handling
o Stage-level retries with controlled backoff
o Explicit success, retry, and failure states per ETL stage
o Automated transition to failover states when required
This provided end-to-end visibility, controlled progression, and predictable execution.
The ingestion layer consolidated batch and streaming data sources into a single landing pattern.
· Real-time ingestion
o Web and mobile clickstream events
o POS transaction streams
o Ingested via Amazon Kinesis Data Firehose
· Batch ingestion
o Transactional data from Amazon Aurora
o Customer and reference data from Amazon DynamoDB
o Ingested via AWS Glue ingestion jobs
· Raw data landing
o All sources landed into Amazon S3 Raw Zone
o Data stored immutably with
§ Source identifiers
§ Ingestion timestamps
§ Partitioning by date and channel
This normalization simplified downstream processing and recovery.
ETL processing was implemented as restart-safe, layered Spark jobs running on the active EMR cluster.
· Data quality checks and filtering of malformed records
· Deduplication of events and transactions
· Schema normalization across online and offline datasets
· Output persisted to S3 Processed Zone for checkpointing
· Enrichment using:
o Customer profile data
o Product and catalogue metadata
o Store and regional reference data
· Sessionization using Spark window functions
· Enriched datasets persisted to S3 to avoid recomputation
· Computation of business-critical metrics:
o GMV (daily and hourly)
o Conversion funnels
o Inventory turnover and stock movement
o Channel and region performance
· Idempotent aggregation logic
· Output written in Parquet and partitioned by date, region, and channel
· File compaction and layout optimization
· Partition tuning for downstream analytics performance
· Final outputs stored in Amazon S3 Curated Zone
· Curated datasets loaded into Amazon Redshift (Multi-AZ / RA3)
· Used for:
o BI dashboards
o Sales and funnel analytics
o Executive reporting
· Amazon Aurora retained strictly for OLTP workloads
· Amazon S3 served as the long-term analytical and ML data lake
· Controlled retries
o Implemented at each ETL stage via Step Functions
· Automated failover
o On repeated failures or AZ degradation:
§ Step Functions triggered ARC routing changes
§ Active EMR cluster deactivated
§ Standby cluster activated in alternate AZ
§ Processing resumed from last S3 checkpoint
· Operational alerting
o Failures captured in Amazon CloudWatch Logs
o Log subscription filters triggered Datadog handlers
o Errors enriched and forwarded to:
§ Datadog for monitoring
§ ServiceNow for incident creation
§ Email notifications with relevant log excerpts
This reduced mean time to resolution and eliminated manual log investigation.
· EMR clusters lifecycle-managed by orchestration logic
· No long-lived idle clusters
· Intermediate persistence eliminated full reprocessing
· Stage-level isolation reduced failure blast radius and compute waste
The final architecture combined:
· Multi-AZ Amazon EMR execution
· ARC-based execution control and failover
· Step Functions–driven orchestration
· S3-backed checkpointing and persistence
· Redshift-based analytics consumption
This delivered a resilient, observable, and cost-efficient ETL platform capable of supporting peak e-commerce workloads with predictable recovery behaviour.

· Achieved AZ-level fault tolerance for ETL workloads, eliminating single-AZ dependency and enabling deterministic recovery during Availability Zone impairments without manual intervention.
· Reduced ETL recovery time from hours to under 30 minutes, by resuming processing from S3-persisted checkpoints and activating standby EMR clusters through ARC-controlled failover.
· Improved ETL stability and job success rate, significantly reducing Spark OOM failures and full-pipeline reruns through layered processing, idempotent job design, and stage-level retries.
· Lowered EMR compute waste by 25–35%, by avoiding full dataset reprocessing after partial failures and managing cluster lifecycles through centralized orchestration.
· Ensured continuous analytics availability during peak sales events, maintaining reliable BI dashboards, fraud detection, and operational reporting even under high load and infrastructure stress.