ENTERPRISE DATA ANALYTICS PLATFORM FOR FINTECH

2026-01-19
Data Analytics

EXECUTIVE SUMMARY

Neowise Digital operates a high-throughput payment processing platform handling 2M+ daily transactions across 50,000+ merchants globally. The organization faced critical challenges in data ingestion, transformation, storage, and analytics delivery including fragmented data pipelines, inconsistent data quality, lack of centralized data governance, and manual deployment processes.

This case study outlines how a data analytics-focused architecture leveraging AWS EMR Serverless, AWS Glue, Amazon S3, Amazon Redshift, and containerized orchestration on EKS with GitOps practices resolved data processing bottlenecks, enabled scalable ETL workflows, and delivered enterprise-grade analytics capabilities.

Key Outcomes:

·        Data Processing Latency: Reduced from 4-6 hours (batch) to 15-20 minutes (near real-time)

·        Data Quality: Improved from 85% accuracy to 99.5% through automated validation

·        Analytics Query Performance: 50x improvement through Redshift optimization

·        Deployment Time: Reduced from 4+ hours to <2 minutes via GitOps

·        Cost Optimization: 40% savings through serverless EMR and lifecycle policies

·        Data Lake Storage: 2.5 PB of structured financial data with 99.99% durability

Data Types and Sources

Neowise processes diverse financial data types requiring different processing strategies:

1. Transactional Data (Structured): -

·        Volume: 2M records/day (15 GB/day)

·        Schema: Transaction ID, merchant ID, customer ID, amount, currency, timestamp, status, metadata

·        Source: Apache Kafka (real-time stream)

·        Retention: 7 years (regulatory requirement)

·        Format: JSON in Kafka → Parquet in S3 → Redshift tables

2. Customer Profile Data (Semi-Structured): -

·        Volume: 50K updates/day

·        Schema: Customer demographics, account details, preferences, nested address objects, notification settings

·        Source: PostgreSQL databases (20+ instances)

·        Complexity: Nested JSON fields requiring flattening (addresses array, preferences object)

·        Format: PostgreSQL → change data capture via DMS → S3 (JSON) → Glue ETL → Parquet

3. Merchant Metadata (Structured)

·        Volume: 50,000 merchants, daily updates

·        Schema: Business details, categories, locations, payment terms, revenue metrics

·        Source: MySQL operational database

·        Format:  MySQL → Glue Crawler → S3 → Redshift dimension tables

4. Historical Transaction Archives (Unstructured)

·        Volume: 2.5 PB accumulated over 5 years

·        Format: CSV files on SFTP servers (legacy)

·        Challenge: Inconsistent schemas across years, missing values, encoding issues (UTF-8/Latin1 mix)

·        Processing: One-time migration + transformation to standardized Parquet

5. External Risk Intelligence (Semi-Structured)

·        Volume: API responses, 50K lookups/day

·        Format: JSON with nested arrays (risk factors, historical incidents)

·        Source: Third-party REST APIs (fraud databases, KYC services)

·        Latency: 500-2000ms per API call

·        Processing: Lambda ingestion → S3 → Glue transformation

6. Log and Event Data (Unstructured)

·        Volume: 500 GB/day

·        Types: Application logs, infrastructure logs, audit trails

·        Format: Plain text, JSON logs

·        Storage: CloudWatch Logs → S3 (long-term archival)

Existing Technology Stack (Legacy)

Component

Technology

Status

Data Ingestion

Apache Kafka (on‑premises, 5 clusters)

Siloed, no monitoring

Batch Processing

Hadoop (8‑node on‑premises)

Slow, manual scaling

Data Storage

PostgreSQL (20+ instances), CSV on SFTP

Fragmented, inconsistent schemas

Data Warehouse

On‑premises PostgreSQL

Cannot handle analytical queries at scale

ETL Orchestration

Apache Airflow (dev only) + shell scripts

Limited orchestration, manual execution

Data Quality

Manual SQL checks

No automated validation framework

Deployment

Manual SSH + shell scripts

Error‑prone, 4+ hours per release

PAIN POINTS: DATA ANALYTICS PERSPECTIVE

1: Data Pipeline Fragmentation & Quality Issues

Data exists in 7+ disconnected systems with no unified ingestion, transformation, or quality validation framework. Inconsistent data schemas, missing values, and lack of standardization prevent reliable analytics.

Impact:

·        Data Inconsistency: 15% of transactions missing merchant category codes due to failed joins

·        Analytics Delays: Data scientists spend 60% of time on data cleaning vs. analysis

·        Report Inaccuracy: Monthly revenue reports require 2-3 iterations due to data quality issues

·        Storage Waste: 400 GB of duplicate records across systems costing 12K annually

·        Root Cause: 

o   No centralized data lake, no standardized ETL framework, no automated data quality validation, no schema registry for consistency.

2: Scalability & Performance Bottlenecks

On-premises Hadoop cluster cannot scale for growing data volumes. Batch processing takes 4-6 hours, preventing timely analytics. No auto-scaling, no spot instance optimization.

Impact:

·        Delayed Analytics: Business reports available by 10 AM instead of 7 AM target

·        Scalability Ceiling: Cannot handle 35% YoY growth without hardware procurement (9-month lead time)

·        Peak Load Failures: Black Friday batch jobs timeout after 8 hours, causing 12-hour delays

·        Cost Inefficiency: 180K/quarter in on-premises hardware with <40% average utilization

·        Innovation Blocked: Data scientists cannot experiment with ad-hoc queries on full dataset due to resource constraints

Root Cause:

·        No elastic compute (EMR Serverless), no distributed processing framework at scale, no auto-scaling for peak loads, no spot instance utilization.

3: Data Storage & Query Performance

No purpose-built data warehouse for analytics. PostgreSQL used for both operational and analytical workloads cause resource contention. No columnar storage, no query optimization, no partitioning strategy.

PostgreSQL Warehouse (On-Premises):

·        Storage: 8 TB transaction data, row-based storage (inefficient for analytics)

·        Query Performance:

·        Simple aggregation (daily revenue): 45 seconds

·        Complex join (customer lifetime value): 12 minutes

·        Historical trend analysis (6-month): Timeout after 30 minutes

·        Month-end reporting: 4+ hours

·        Concurrency: 5 simultaneous users before performance degradation

·        Data Access Pattern: 80% of queries scan full tables due to no partitioning

·        Indexing: Over-indexed (47 indexes per table) causing slow writes

Impact:

·        Slow Dashboards: Executive dashboards timeout during business hours, requiring off-peak scheduling

·        Resource Contention: Operational queries delayed by 30+ seconds when analytical workloads run

·        Limited Analytics: Cannot run complex queries on full historical data (5+ years)

·        Cost Inefficiency: Over-provisioned database instances (25K/month) to handle occasional heavy queries

·        User Frustration: Analysts wait 20+ minutes for query results, reducing productivity by 40%

Root Cause: 

No columnar data warehouse (Redshift), no data partitioning strategy, no separation of analytical and operational workloads, no query result caching.

4: Data Transformation & ETL Complexity

ETL processes implemented as fragile shell scripts with no dependency management, error handling, or data lineage tracking. Complex nested JSON structures not properly flattened for analytics.

Customer Profile Data Challenges

Source Data Structure:

·        Customer records contain nested personal information objects, an array of multiple addresses (billing, shipping, historical), and deeply nested preference settings including notification configurations. This multi-level nesting creates significant challenges for analytical queries.

Current Processing Issues:

·        Nested addresses array contains 1-5 addresses per customer but not flattened, preventing joins with transaction data on specific address types

·        Preferences and notification settings are 3 levels deep, making it impossible to query "all customers who opted into email notifications"

·        No standardized field extraction logic across different data sources

·        Manual Python scripts fail on malformed JSON (12% error rate)

Manual ETL Process:

The current process involves exporting JSON from PostgreSQL taking 20 minutes, running Python scripts to flatten JSON structures taking 40 minutes with frequent failures on schema variations, writing results to CSV taking 10 minutes, and finally loading to warehouse taking 30 minutes. Total time is approximately 100 minutes with a 12% failure rate.

Impact:

·        Data Quality: 12% of customer records fail flattening due to schema variations

·        Maintenance Burden: ETL scripts require weekly updates for new schema changes

·        No Reusability: Each data source has custom flattening logic, creating 40+ disparate scripts

·        Debugging Difficulty: No lineage tracking when data quality issues arise, taking 2-3 hours per investigation

·        Incomplete Analytics: Cannot analyze customer preferences due to unflattened nested structures

Root Cause:

·        No managed ETL service (Glue), no standardized data transformation framework, no automated schema evolution handling, no data quality checkpoints.

6: Monitoring & Observability for Data Pipelines

No visibility into data pipeline health, job execution status, or data quality metrics. Failures discovered hours after occurrence through user reports rather than proactive monitoring.

 Monitoring Gaps:

Component

Current State

Impact

ETL Jobs

No execution logs, manual checking

Cannot debug failed jobs, 3–4 hour MTTR

Data Quality

Manual SQL checks run weekly

Issues discovered days later, affecting 50+ downstream reports

Pipeline Latency

No metrics tracked

SLA breaches undetected, discovered by business users

Data Completeness

No validation of row counts

Missing data in reports (discovered monthly during reconciliation)

EMR Jobs

Logs on HDFS (not centralized)

Cannot correlate failures across jobs, no Spark UI access after job completion

Kafka Consumer Lag

Manual checks via CLI

Lag spikes go unnoticed, causing hours of data delays

Impact:

·        High MTTR: 3-4 hours to diagnose ETL failures due to log searching across 15+ systems

·        Silent Data Loss: Missing records discovered during monthly reconciliation, too late to recover

·        No Proactive Alerts: Pipeline failures detected by downstream users filing tickets, not by monitoring

·        Compliance Risk: Cannot produce audit trail of data transformations for regulatory reviews

·        Resource Waste: Cannot identify bottlenecks in Spark jobs, leading to over-provisioning

Root Cause:

·        No centralized logging (CloudWatch Logs), no EMR Spark UI persistence, no custom metrics for data quality, no distributed tracing of data flow.

Architecture Suggest:

The redesigned architecture implements a cloud-native, serverless data analytics platform with:

Unified Data Lake: S3-based multi-layer architecture (raw/enriched/curated)

Managed ETL: AWS Glue for scalable data transformation and cataloging

Serverless Processing: EMR Serverless for large-scale Spark workloads with auto-scaling

Data Warehouse: Amazon Redshift for high-performance analytical queries

Orchestration: Apache Airflow on EKS for pipeline coordination

Data Governance: Glue Data Catalog for centralized metadata and schema management

Observability: EMR Spark UI with persistent logs, CloudWatch metrics, and custom data quality dashboards

 

 

SOLUTION COMPONENTS:

1. Multi-Layer Data Lake Architecture (Amazon S3)
Organize data by processing stage with clear separation of concerns, enabling efficient data governance and lifecycle management.

Layer 1: Raw Data (Immutable Source Data)

Characteristics:

·        Immutability: Never modified after ingestion, preserving original data for audit and replay

·        Format: Original format preserved (JSON, CSV, Parquet as-is from source)

·        Retention: 7 years (regulatory requirement for financial data)

·        Partitioning: By ingestion date for efficient lifecycle management

·        Compression: gzip compression to reduce storage costs by 70%

·        Size: 2.5 PB total, 8 TB daily ingestion

Data Organization:

The raw zone is organized into subdirectories by source system. Transaction data is partitioned by year, month, and day with hourly files. Customer profile data uses snapshot date partitioning. Merchant metadata includes extract date partitioning. Historical archives are organized by year ranges for legacy data migration.

Storage Policies:

Lifecycle management transitions data to S3 Glacier after 90 days for cost optimization (85% cost reduction). Data is retained for 7 years to meet audit requirements and regional financial regulations. Versioning is enabled to protect against accidental deletion. All data is encrypted at rest using SSE-S3 (AWS-managed keys).

Cost Optimization

The raw zone implements intelligent tiering: Standard storage for 30 days (frequent access for debugging), Standard-IA for days 31-90 (occasional access), and Glacier for 91+ days (compliance retention). This strategy reduces storage costs from $60K/month to $18K/month, a 70% reduction.

2: Enriched Zone (Cleaned & Standardized)

Characteristics:

·        Format: Parquet with Snappy compression (columnar storage optimized for analytics)

·        Partitioning: By date and region for optimal query performance

·        Schema: Standardized, consistent across all sources

·        Compression: Snappy compression (balance between speed and size, 4:1 compression ratio)

·        Data Quality: Validated, deduplicated, flattened, type-converted

·        Size: 800 TB (67% reduction from raw through compression and deduplication)

Data Transformation Techniques:

·        JSON Flattening for Nested Structures:

o   The enriched layer applies systematic flattening of nested JSON structures from customer profiles. Source data contains multi-level nesting with personal information objects, address arrays containing multiple addresses, and preference objects with nested notification settings.

 

o   The flattening process extracts nested personal information fields to top-level columns. Address arrays are exploded and pivoted to create separate columns for billing and shipping addresses (billing_city, billing_street, shipping_city, shipping_street). Deeply nested preference settings are flattened to individual boolean columns (email_notifications_enabled, sms_notifications_enabled).

·        Result: Queries that previously required complex JSON parsing functions now use simple column references. Query performance improved 40x for customer profile joins. Data is now compatible with BI tools that do not support nested structures.

·        Data Type Standardization & Casting:

o   Challenge: Source systems use inconsistent data types: transaction amounts stored as strings with currency symbols (150.00), integers (150), or doubles (150.50). Dates appear as ISO strings (2025-01-02), US format (01/02/2025), or Unix timestamps.

o   Transformation Process:

§  Currency amounts are cleaned by removing dollar signs and commas, then cast to decimal(18,2) for precision. All date fields are standardized to ISO 8601 timestamp format (yyyy-MM-dd HH:mm:ss). Currency codes are uppercased and trimmed (USD, usd, Usd → USD). Boolean fields from various representations (true/false, 1/0, Y/N) are standardized to boolean type.

·        Result: Consistent data types enable proper aggregations (SUM, AVG work correctly on decimal amounts), date arithmetic works reliably, and JOIN operations succeed based on matching formats.

·        Deduplication Strategy:

o   Source systems produce duplicate records due to Kafka retries (2-3% duplication rate), change data capture capturing the same change multiple times, and overlapping batch extracts.

o   Deduplication Logic:

§  Duplicates are identified using composite keys (transaction_id + merchant_id + timestamp). When duplicates exist, the latest record is retained based on ingestion timestamp. A window function partitions by composite key and orders by ingestion timestamp descending, keeping only row number 1.

·        Result: 2.8% of records identified as duplicates and removed. Storage savings of 70 TB in enriched layer. Analytics queries return correct counts without manual DISTINCT operations.

·        Data Quality Validation & Cleansing

o   Validation Rules Implemented:

§  Column value ranges are enforced (transaction amounts between 0.01 and 1,000,000). Required fields are validated as not null (customer_id, merchant_id, transaction_date). Valid enumerations are checked (currency must be in USD, EUR, GBP, JPY, CAD). Business logic validation ensures transaction_date is after customer_registration_date.

o   Data Cleansing Actions:

§  Invalid records are quarantined to a separate S3 path for manual review. Correctable issues are auto-fixed (trim whitespace, uppercase currency codes). Critical violations cause job failure and alert data engineering team. Data quality metrics are published to CloudWatch for monitoring.

·        Result: Data quality improved from 85% to 99.5%. Invalid data reduced from 15% to 0.5%. Business users trust analytics without manual verification.

·        Column Renaming & Standardization

o   Source systems use inconsistent naming conventions: customer_id vs customerId vs cust_id, transaction_date vs txn_date vs date_of_transaction.

o   Standardization Rules:

§  All column names converted to snake_case (customer_id, transaction_date). Abbreviations expanded to full words (txn → transaction, cust → customer). Standard naming conventions applied (all dates suffixed with _date, all IDs suffixed with _id).

·        Result: Cross-source joins simplified. Data catalog is self-documenting. New team members understand data without extensive documentation.

·        Enriched Layer Structure:

o   The enriched layer contains transactions partitioned by year, month, day, and region. Customer profiles are flattened with snapshot date partitioning. Merchant dimensions are maintained as slowly changing dimension type 2. Risk intelligence data is normalized from nested JSON to relational format.

3: Curated Zone (Business-Ready Analytics)

Characteristics:

·        Purpose: Pre-aggregated, business-logic applied datasets optimized for consumption

·        Consumers: BI tools (Tableau, QuickSight), Redshift, data science teams

·        Schema: Denormalized for query performance, star schema design

·        Refresh Frequency: Daily for batch aggregations, hourly for near-real-time metrics

·        Size: 120 TB (highly aggregated from enriched layer)

Data Aggregation Techniques:

AWS Modern Data Platform Architecture

DATA IMPLEMENTATION: -

1. Data Lake Architecture & Curated Layer

A. Customer Analytics Aggregations

Customer Lifetime Value (CLV) Calculation

The curated layer pre-computes customer lifetime metrics including:

  • Total transaction count
  • Total transaction value
  • Average transaction value
  • First transaction date
  • Most recent transaction date
  • Customer tenure in days
  • Transaction frequency (transactions per month)

Window Functions for Time-Series Analysis

  • Running totals: Calculated using cumulative sum windows partitioned by customer
  • Moving averages: Use 30-day and 90-day rolling windows for trend analysis
  • Rank functions: Identify top customers by revenue within each region and merchant category

Result: CLV queries that took 12 minutes on enriched data now return in under 2 seconds from curated layer. Dashboards load instantly with pre-aggregated metrics.

B. Merchant Analytics & KPIs

Merchant Performance Metrics

Pre-aggregated merchant KPIs include:

  • Daily/weekly/monthly revenue totals
  • Transaction count by merchant and category
  • Average transaction value by merchant
  • Refund rates and chargeback rates
  • Customer acquisition counts (new vs returning)

Trend Analysis

  • Month-over-month growth rates calculated and stored
  • Year-over-year comparisons pre-computed for seasonal analysis
  • Merchant rank by revenue updated daily for leaderboards

Result: Merchant dashboards load 50x faster. Complex trend queries execute in under 3 seconds vs 2+ minutes previously.

C. Time-Series Rollups & Aggregations

Multi-Granularity Aggregations

Transaction data aggregated at multiple time granularities:

  • Hourly rollups for real-time monitoring
  • Daily summaries for operational reporting
  • Weekly and monthly aggregations for executive reporting
  • Yearly summaries for long-term trend analysis

Partitioning Strategy

Each granularity level partitioned differently:

  • Hourly data: partitioned by date and hour
  • Daily summaries: date partitioning only
  • Monthly aggregations: year and month partitioning

Result: Reports select the appropriate granularity level for their timeframe. Query performance improved 80x by avoiding re-aggregation.

D. Compliance & Regulatory Reporting Datasets

Audit Trail Datasets

  • Point-in-time snapshots of customer data for regulatory compliance
  • Transaction audit trails capture all state changes
  • Data deletion logs track right-to-be-forgotten requests

Regulatory Report Tables

Pre-formatted datasets match regulatory reporting requirements:

  • Anti-money laundering transaction monitoring

Result: Regulatory report generation reduced from 40 hours to 2 hours per quarter. Audit requests fulfilled in minutes vs days.

Curated Layer Structure

The curated zone contains:

  • Customer analytics (aggregated metrics by customer)
  • Merchant analytics (KPIs and trends)
  • Daily transaction summaries (pre-aggregated facts)
  • Revenue reports (business intelligence datasets)
  • Compliance snapshots (regulatory reporting data)

2. AWS Glue ETL & Data Transformation

Purpose: Managed ETL service for scalable data transformation, schema evolution, and data quality management.

Glue Job Architecture

Job Types Implemented

1. Glue Python Shell Jobs (Lightweight)

Used for small-scale transformations:

  • API data ingestion and lightweight formatting
  • Metadata updates to Glue Catalog
  • Data quality report generation
  • Simple CSV to Parquet conversions

Configuration: 1 DPU (Data Processing Unit), runs in under 5 minutes, cost-effective for small data volumes (<1 GB).

2. Glue Spark Jobs (Distributed Processing)

Used for large-scale transformations:

  • Flattening nested JSON from customer profiles (50K records/day)
  • Joining transactions with customer and merchant data
  • Deduplication across 2M+ daily transactions
  • Complex aggregations for curated layer

Configuration: 10-50 DPUs with auto-scaling, processes 8 TB daily data in 15-20 minutes, Spark 3.3 with Glue 4.0 runtime.

3. Glue Streaming Jobs (Real-Time)

Used for near-real-time processing:

  • Kafka to S3 ingestion with micro-batching (5-minute windows)
  • Real-time data quality validation
  • Streaming aggregations for monitoring dashboards

Configuration: Continuous run with checkpoint management, processes 50K events per second during peak, writes to enriched layer every 5 minutes.

Glue Data Catalog Integration

Purpose: Centralized metadata repository for all data assets, enabling schema discovery and query federation.

Catalog Organization

Databases:

  • raw_zone: schemas for source data as-is
  • enriched_zone: standardized, cleaned schemas
  • curated_zone: aggregated, business-ready schemas
  • external_sources: third-party API data

Tables:

Each S3 prefix registered as a table:

  • fact_transactions: 400M+ rows, partitioned by date and region
  • dim_customers: 5M rows, SCD type 2
  • dim_merchants: 50K rows
  • customer_analytics: pre-aggregated metrics

Partitions:

Tables partitioned to enable partition pruning for faster queries:

  • Transaction tables have 18,000+ partitions (5 years × 365 days × 10 regions)
  • Glue Crawler automatically discovers new partitions daily
  • Partition projection used for predictable date-based partitions to avoid crawler overhead

Schema Evolution Handling

  • Glue automatically detects schema changes during crawls
  • New columns added without breaking existing queries
  • Column type changes trigger data engineering alerts
  • Schema versions tracked for audit trail

Result: Athena queries automatically use Glue Catalog for schema information. Redshift Spectrum federated queries access data lake through catalog. No manual schema management required.

Data Quality Framework

Glue Data Quality Rules

Built-in Glue Data Quality feature validates data during ETL:

  • Completeness rules: check for null values in required fields
  • Uniqueness rules: validate primary key constraints
  • Statistical rules: detect anomalies in distributions
  • Referential integrity rules: ensure foreign key relationships

Great Expectations Integration

Custom data quality checks implemented using Great Expectations:

  • Schema validation: ensures all expected columns exist
  • Value range checks: validate business logic (amounts > 0)
  • Distribution checks: detect data skew or anomalies
  • Custom rules: validate complex business logic

Quality Metrics Dashboard

  • Data quality scores published to CloudWatch for each table
  • Quality trends tracked over time
  • Alerts trigger when quality drops below 99% threshold
  • Quality reports generated daily for data stewards

Result: Data quality issues detected immediately vs days later. Automated remediation for common issues. Trust in analytics increased across organization.

Job Orchestration via Airflow

DAG Structure

Glue jobs orchestrated by Apache Airflow running on EKS. A typical daily ETL DAG includes:

  1. Trigger Glue Crawler to discover new source data
  2. Run Glue ETL jobs to process raw to enriched
  3. Execute data quality validation
  4. Trigger EMR Serverless jobs for curated aggregations
  5. Load Redshift from curated layer
  6. Send notifications on success or failure

Dependency Management

  • Tasks wait for upstream completion
  • Glue jobs run in parallel where possible (10+ concurrent jobs)
  • Failed jobs retry 3 times with exponential backoff
  • Critical path jobs have priority scheduling

Result: End-to-end pipeline orchestration with clear visibility. Pipeline runtime reduced from 5.5 hours to 15-20 minutes.

3. EMR Serverless for Large-Scale Spark Processing

Purpose: Execute distributed Spark workloads for complex transformations and aggregations without managing clusters.

EMR Serverless Architecture

Application Configuration

Pre-Initialized Capacity:

  • Minimum workers: 5 (warm pool for fast job startup)
  • Maximum workers: 200 (auto-scales based on workload)
  • Worker type: 4 vCPU and 16 GB RAM per worker
  • Startup time: Sub-minute (vs 15 minutes for on-demand EMR clusters)

Auto-Scaling Behavior:

  • Workers scale up when Spark tasks are pending
  • Scale down after 5 minutes of idle time
  • Scale proportional to data volume being processed
  • During Black Friday, system scaled from 5 to 180 workers automatically

Cost Optimization:

  • EMR Serverless charges only for resources used during job execution
  • No charges for idle time between jobs
  • 70% cost reduction vs running persistent EMR clusters
  • Monthly cost reduced from $45K (on-demand EMR) to $13K (serverless)

Spark Job Types & Scheduling

Job Category 1: Customer Aggregations

Purpose: Calculate customer-level metrics for analytics and business intelligence.

Processing Logic:

  • Aggregate total transaction value per customer
  • Count unique transactions per customer
  • Calculate average transaction value
  • Identify first and last transaction dates
  • Compute customer lifetime value metrics
  • Calculate transaction frequency patterns

Data Volume:

  • Processes 2M transactions daily
  • Joins with 5M customer records
  • Outputs 800K customer metrics records

Schedule:

  • Runs daily at 2 AM (off-peak hours)
  • Duration: 8-12 minutes with auto-scaling
  • Outputs written to s3://neowise-analytics/curated/customer_analytics/

Result: Customer dashboards updated by 7 AM, 50x faster than previous Hadoop processing (from 6 hours to 12 minutes).

Job Category 2: Merchant Performance Analytics

Purpose: Generate merchant KPIs and performance trends for merchant-facing dashboards.

Processing Logic:

  • Aggregate revenue by merchant and time period
  • Calculate transaction counts and average ticket size
  • Compute refund and chargeback rates
  • Identify top-performing products per merchant
  • Analyze customer retention by merchant
  • Generate month-over-month growth metrics

Data Volume:

  • 2M transactions across 50K merchants daily
  • Outputs 50K merchant metric records

Schedule:

  • Runs daily at 3 AM after customer aggregations
  • Duration: 5-8 minutes
  • Outputs to s3://neowise-analytics/curated/merchant_analytics/

Result: Merchant dashboards load instantly, enabling self-service analytics for merchant partners.

Job Category 3: Time-Series Rollups

Purpose: Pre-aggregate data at multiple time granularities for fast reporting.

Processing Logic:

  • Create hourly transaction summaries from minute-level data
  • Roll up hourly data to daily summaries
  • Aggregate daily summaries to weekly and monthly
  • Calculate running totals and moving averages
  • Partition by region for geographic reporting

Data Volume:

  • Input: 15 GB raw transactions daily
  • Output: 2 GB multi-granularity aggregations

Schedule:

  • Hourly jobs for real-time metrics (run at :05 past each hour)
  • Daily rollup jobs at 4 AM
  • Weekly rollups on Monday mornings
  • Monthly rollups on 1st of each month

Result: Reports select optimal granularity, query performance improved 80x, storage efficiency improved through pre-aggregation.

Job Category 4: Complex Join Operations

Purpose: Create denormalized datasets joining transactions with dimensions for analytics.

Processing Logic:

  • Join transactions with customer profiles (5M records)
  • Enrich with merchant metadata (50K merchants)
  • Add risk intelligence data from third-party sources
  • Flatten nested structures from all sources
  • Apply business logic transformations

Data Volume:

  • 2M+ transaction records joined with multiple dimension tables daily

Schedule:

  • Runs daily at 1 AM before other aggregations
  • Duration: 15-18 minutes (longest running job)
  • Outputs to s3://neowise-analytics/enriched/transactions_enriched/

Result: Downstream jobs work with pre-joined data, eliminating repeated join operations and improving overall pipeline efficiency.

EMR Spark UI for Logging & Monitoring

Purpose: Persistent Spark UI provides detailed visibility into job execution, stage performance, and resource utilization for debugging and optimization.

Spark UI Configuration

Persistent History Server:

  • EMR Serverless automatically configures Spark to write event logs to S3 (s3://neowise-analytics/logs/spark-events/)
  • Event logs retained for 30 days for analysis
  • Spark UI accessible through EMR Console even after jobs complete
  • Historical job runs can be analyzed weeks after execution

Logging to CloudWatch:

  • Driver logs capture application-level logging and errors
  • Executor logs show task-level execution details
  • Stderr and stdout streams automatically sent to CloudWatch Logs
  • Logs organized by application ID and job run ID for easy filtering

Key Metrics Captured

Job-Level Metrics:

  • Total job duration and time breakdown by stage
  • Number of tasks executed
  • Data read from S3 and data written to S3
  • Shuffle read and shuffle write volumes
  • Executor memory and CPU utilization

Stage-Level Analysis:

  • Spark UI shows detailed stage DAG visualization
  • Each stage displays task count, duration, and data volume
  • Slow stages identified through duration heatmaps
  • Skewed partitions visible in task duration metrics

Task-Level Details:

  • Individual task execution times
  • Task locality (data local vs node local)
  • Garbage collection time per task
  • Shuffle read/write per task
  • Failures and retry counts

Usage for Debugging

Performance Optimization Example:

  • Issue: Customer aggregation job taking 25 minutes, exceeding 15-minute SLA
  • Analysis via Spark UI: Stage 3 (groupBy operation) taking 18 minutes with severe data skew. Task duration varied from 30 seconds to 15 minutes
  • Solution: Added salting to distribute hot keys evenly
  • Result: Job duration reduced to 9 minutes, meeting SLA

Data Quality Investigation Example:

  • Issue: Merchant analytics showing inconsistent record counts
  • Spark UI logs revealed: NullPointerException in 2% of tasks during join operation
  • Root cause: Merchant dimension table had null IDs breaking join
  • Solution: Added null handling in join logic
  • Result: 100% task success rate, accurate record counts

Resource Optimization Example:

  • Spark UI memory metrics: Showed executors with 80% idle memory
  • Action: Over-provisioned worker memory from 16 GB to 8 GB
  • Cost savings: 40% reduction in EMR Serverless costs with no performance impact

Alerting & Monitoring

  • CloudWatch alarms trigger when job duration exceeds thresholds (e.g., customer aggregation > 15 minutes)
  • Task failure rate alerts when >5% of tasks fail
  • Data volume anomaly detection alerts when input data varies by >30% from rolling average
  • Custom metrics track business KPIs (records processed per minute, data quality score)

Result: MTTR for data pipeline issues reduced from 3 hours to 15 minutes. Proactive optimization based on Spark UI metrics. Clear audit trail for compliance and debugging.

Data Storage After Processing

Storage Format

  • All EMR Serverless outputs written in Parquet format with Snappy compression
  • Columnar storage provides 10x faster query performance vs row-based
  • Compression reduces storage by 75% vs uncompressed JSON

Partitioning Strategy

  • Transaction data partitioned by year, month, day, and region
  • Customer analytics partitioned by snapshot_date
  • Merchant analytics partitioned by calculation_date
  • Partitioning enables partition pruning, reducing query scans by 90%

Data Retention

  • Enriched layer: 2 years in Standard storage, then Glacier
  • Curated layer: 1 year in Standard-IA, then Glacier
  • Logs and event data: 90 days retention
  • Compliance-required data: 7 years in Glacier

Data Transparency

All data includes metadata columns:

  • ingestion_timestamp: when data entered system
  • processing_timestamp: when transformation occurred
  • data_source: origin system identifier
  • job_run_id: Glue or EMR job identifier
  • data_quality_score: validation result 0-100

Result: Complete data lineage from source to consumption. Ability to replay specific time periods. Regulatory audit requirements easily fulfilled.

4. Amazon Redshift Data Warehouse

Purpose: Columnar data warehouse optimized for analytical queries and BI tool integration.

Cluster Configuration

  • Node Type: ra3.4xlarge (128 GB RAM, 12 vCPUs, managed storage)
  • Cluster Size: 4 nodes (512 GB total RAM)
  • Storage: 16 TB managed storage (separated from compute)
  • Concurrency: 50 concurrent queries supported via WLM

Table Design

Fact Table: fact_transactions

  • Contains 400M rows (2 years of transaction history)
  • Distributed by customer_id for efficient joins
  • Sorted by transaction_date for range queries
  • Compressed with AZ64 encoding (native Redshift compression)

Dimension Tables:

  • dim_customers: 5M rows, distributed across all nodes (ALL distribution style)
  • dim_merchants: 50K rows, replicated to all nodes
  • dim_dates: 3,650 rows (10 years), replicated

Aggregation Tables:

  • fact_daily_aggregates: Pre-aggregated daily summaries (2K rows per day)
  • fact_customer_lifetime_value: Customer-level rollups (5M rows)
  • fact_merchant_monthly: Merchant performance by month (600K rows)

Data Loading Strategy

Incremental Loads:

  • Daily loads from curated layer via COPY command
  • Load only new partitions (yesterday's data)
  • Runs at 5 AM after EMR jobs complete
  • Duration: 5-8 minutes for 2M daily transactions

Full Refresh Tables:

  • Dimension tables fully refreshed weekly
  • Uses DELETE then INSERT pattern
  • Maintains slowly changing dimension history

Query Performance

Before Redshift (PostgreSQL):

  • Daily revenue query: 45 seconds
  • Customer lifetime value: 12 minutes
  • 6-month trend analysis: timeout (30+ minutes)

After Redshift:

  • Daily revenue query: 0.8 seconds (56x faster)
  • Customer lifetime value: 14 seconds (51x faster)
  • 6-month trend analysis: 18 seconds (100x faster, previously timeout)

Result: Interactive analytics possible, dashboards load in under 3 seconds, 50+ concurrent users supported.

5. Apache Airflow Orchestration (EKS)

Purpose: Coordinate data pipeline execution across Glue, EMR, Redshift, and validation steps.

Airflow Deployment on EKS

Configuration:

  • 2 scheduler pods for high availability
  • 5-10 worker pods with auto-scaling based on task queue
  • 1 webserver pod for UI access
  • PostgreSQL backend for metadata (RDS Aurora)

DAG Structure: Daily Analytics Pipeline

Stage 1: Data Discovery (6:00 AM)

  • Trigger Glue Crawler to discover new partitions in raw zone
  • Update Glue Data Catalog with new schemas
  • Validate all expected data files present

Stage 2: Data Transformation (6:10 AM)

  • Run Glue ETL jobs in parallel (10+ concurrent jobs)
  • Process raw to enriched layer
  • Apply flattening, deduplication, standardization
  • Duration: 10-12 minutes

Stage 3: Data Quality Validation (6:25 AM)

  • Execute Great Expectations validation suite
  • Check row counts, schema compliance, business rules
  • Fail pipeline if quality <99%
  • Duration: 2-3 minutes

Stage 4: Advanced Analytics (6:30 AM)

  • Submit EMR Serverless jobs for aggregations
  • Customer analytics job (8 mins)
  • Merchant analytics job (5 mins)
  • Time-series rollups (6 mins)
  • Total duration: 8 mins (parallel execution)

Stage 5: Data Warehouse Load (6:40 AM)

  • COPY data from curated layer to Redshift
  • Update fact tables incrementally
  • Refresh aggregation tables
  • Duration: 5-8 minutes

Stage 6: Validation & Notification (6:50 AM)

  • Run reconciliation queries (source vs target counts)
  • Send success notification to Slack
  • Publish pipeline metrics to CloudWatch
  • Generate data quality report

Total Pipeline Duration: 45-50 minutes (vs 5.5 hours previously)

Error Handling

  • Failed tasks retry 3 times with exponential backoff
  • Critical failures trigger PagerDuty alerts
  • Partial failures allow downstream tasks to proceed with available data
  • Clear error messages in Airflow UI for debugging

Result: Predictable, reliable daily pipeline. Data ready by 7 AM for business users. 99.9% success rate vs 85% previously.

DEVOPS IMPLEMENTATION:-

1. GitLab CI/CD Pipeline

Purpose: Automated build, test, security scan, and deployment orchestration for all infrastructure and application components.

Pipeline Stages

Stage 1: Lint & Validate

  • Terraform validate for infrastructure code
  • Helm lint for Kubernetes manifests
  • Pylint for Python/PySpark code
  • SQL linting for Glue scripts
  • Duration: 2-3 minutes

Stage 2: Build

  • Build Docker images for Airflow, custom operators, data quality jobs
  • Tag with git commit SHA for traceability
  • Cache layers for faster builds
  • Duration: 3-5 minutes

Stage 3: Test

  • Unit tests for data transformation logic
  • Integration tests against test data
  • Schema validation tests
  • Data quality rule tests
  • Duration: 5-7 minutes

Stage 4: Security Scan

  • Trivy scans for CVEs in Docker images
  • Snyk scans for dependency vulnerabilities
  • Fail build if critical CVEs found
  • GHAS code scanning for secrets
  • Duration: 2-3 minutes

Stage 5: Push to ECR

  • Push Docker images to Amazon ECR
  • Apply image tags (commit SHA, version, latest)
  • Enable image scanning in ECR
  • Sign images with Docker Content Trust
  • Duration: 1-2 minutes

Stage 6: Deploy to Dev

  • Trigger ArgoCD sync for dev environment
  • Automated deployment
  • Validate health checks
  • Run smoke tests
  • Duration: 2-3 minutes

Stage 7: Deploy to Production

  • Require manual approval in GitLab
  • Trigger ArgoCD sync for production
  • Blue-green deployment strategy
  • Automated rollback on failure
  • Duration: 2-3 minutes

Total Pipeline: 15-20 minutes from commit to production

Result: Deployment frequency increased from quarterly to daily. Deployment failure rate reduced from 28% to 0.3%. Full audit trail in git history.

2. Amazon ECR (Container Registry)

Purpose: Centralized, secure Docker image management with automated scanning and lifecycle policies.

Configuration

Image Scanning:

  • Scan on push enabled for all images
  • Integration with Amazon Inspector for CVE detection
  • Fail pipeline if critical vulnerabilities found
  • Weekly rescans of production images

Image Immutability:

  • Tag immutability enabled to prevent overwrites
  • Semantic versioning enforced (v1.2.3)
  • Production tags cannot be reused

Lifecycle Policies:

  • Keep last 30 tagged images per repository
  • Delete untagged images after 7 days
  • Reduce storage from 800 images ($3K/month) to 120 images ($450/month)

Access Control:

  • IAM policies restrict ECR push to CI/CD pipelines only
  • EKS pods use IRSA for pull access
  • Cross-account access for DR region

Result: Zero CVE incidents in production, 85% storage cost reduction, full image lineage for compliance.

3. Amazon EKS (Kubernetes Orchestration)

Purpose: Production-grade container orchestration for Airflow, Kafka consumers, and custom data processing jobs.

Workloads

Apache Airflow:

  • 2 scheduler pods (HA)
  • 5-10 worker pods (auto-scaled)
  • 1 webserver
  • Resource requests: 4 CPU, 8 GB RAM per scheduler

Kafka Consumers:

  • 5-10 replicas processing transaction stream
  • Auto-scaled by HPA based on consumer lag
  • 2 CPU, 4 GB RAM per pod

Data Validation Jobs:

  • 3-5 scheduled pods running Great Expectations
  • 1 CPU, 2 GB RAM each

Security

  • RBAC policies with minimal permissions per namespace
  • Pod security standards (no privileged containers)
  • Network policies restrict pod-to-pod communication
  • Secrets managed via AWS Secrets Manager with IRSA

Result: 99.99% uptime for Airflow, seamless scaling during peak loads, infrastructure costs optimized through right-sizing.

4. ArgoCD (GitOps Deployment)

Purpose: Declarative, git-driven application deployment with continuous reconciliation and automated rollback.

Deployment Flow

  1. Developer updates Helm values in git (change image tag)
  2. Commits to main branch
  3. ArgoCD detects git change within 3 minutes
  4. Computes diff between desired and actual state
  5. If auto-sync enabled, deploys new pods in parallel (blue-green)
  6. Waits for health checks (readiness probes)
  7. Gradually shifts traffic to new pods
  8. Automatically rolls back if health checks fail

Benefits

  • Deployment time: under 2 minutes (vs 4+ hours manual)
  • Error rate: 0.3% (vs 28% manual)
  • Rollback time: 30 seconds (git revert + commit)
  • Full audit trail in git (who, what, when, why)

Self-Healing

  • ArgoCD continuously monitors cluster state
  • Automatically recreates deleted pods
  • Reverts manual changes to match git
  • Alerts on drift detection

Result: Zero-downtime deployments, instant rollback capability, immutable deployment history.

5. Infrastructure as Code (Terraform)

Purpose: Version-controlled, reproducible infrastructure provisioning across all AWS resources.

Terraform Modules

Networking Module:

  • VPC with public/private subnets
  • NAT gateways
  • Security groups
  • VPC peering for cross-region DR

Data Lake Module:

  • S3 buckets with encryption
  • Lifecycle policies
  • IAM policies for cross-service access

EMR Serverless Module:

  • EMR application configuration
  • IAM roles for Glue and S3
  • CloudWatch log groups

Redshift Module:

  • Cluster configuration
  • Parameter groups
  • Subnet groups
  • Snapshot schedules

EKS Module:

  • Cluster creation
  • Node groups with auto-scaling
  • IAM roles for service accounts (IRSA)

State Management

  • Terraform state stored in S3 with DynamoDB locking
  • State file encrypted
  • Versioning enabled for rollback
  • Separate state files per environment

Deployment Process

  1. terraform plan generates preview of changes
  2. Review in pull request
  3. terraform apply executed via CI/CD
  4. Drift detection runs weekly via AWS Config

Result: Infrastructure changes in minutes vs days, consistent environments (dev = staging = prod), disaster recovery environment rebuilt in under 1 hour.

6. Observability: CloudWatch, VPC Flow Logs

CloudWatch Logs

  • Airflow task logs automatically streamed
  • EMR Spark driver/executor logs centralized
  • Glue job logs with job run ID tagging
  • EKS pod logs via Fluent Bit DaemonSet
  • Retention: 90 days for operational logs, 7 years for audit logs

CloudWatch Metrics

Custom metrics for data pipeline health:

  • Records processed
  • Job duration
  • Data quality score
  • EMR Serverless metrics (worker count, task duration)
  • EKS pod metrics (CPU, memory, restarts)
  • Redshift query performance (execution time, queue time)

Alerting

  • Pipeline SLA breach alerts (e.g., daily ETL exceeds 60 minutes)
  • Data quality alerts (score below 99%)
  • Resource utilization alerts (Redshift CPU >80%)
  • Error rate thresholds (>5% task failures)

VPC Flow Logs

  • All VPC traffic logged to S3 for security analysis
  • Used to diagnose connectivity issues
  • Analyzed for anomaly detection
  • Retention: 30 days

Result: MTTR reduced from 2.5 hours to under 15 minutes. Proactive issue detection before user impact. Complete audit trail for compliance.

BEST PRACTICES IMPLEMENTED

Data Analytics Best Practices

1. Layered Data Architecture

  • Separate raw, enriched, and curated zones for clear data processing stages and lineage tracking

2. Partition Strategy

  • Partition by date and region for query performance optimization and cost reduction

3. Columnar Storage

  • Use Parquet format for 10x query performance and 75% storage reduction

4. Data Quality Gates

  • Validate data between pipeline stages, fail fast on quality issues

5. Schema Evolution

  • Use Glue Data Catalog for automatic schema discovery and version tracking

6. Idempotent Pipelines

  • Ensure Spark jobs produce identical results on re-run for safe replay

7. Data Lifecycle Management

  • Automate transitions to cold storage (S3 Glacier) for cost optimization

8. Metadata Management

  • Centralize in Glue Data Catalog for discovery and governance

9. Pre-Aggregation Strategy

  • Create curated datasets at multiple granularities for fast reporting

10. Incremental Processing

  • Process only new data to reduce compute costs and latency

DevOps Best Practices

11. Infrastructure as Code

  • Version all infrastructure in git, treat IaC as production code

12. GitOps Deployment

  • Use git as single source of truth for all deployments

13. Container Immutability

  • Use semantic versioning, prevent tag overwrites in ECR

14. CVE Scanning

  • Scan all images before deployment, fail builds on critical vulnerabilities

15. Secrets Management

  • Use AWS Secrets Manager, never hardcode credentials

16. Least Privilege IAM

  • Scope roles to specific resources and actions, use IRSA for EKS workloads

17. Automated Rollbacks

  • Configure health checks for automatic rollback on deployment failures

18. Blue-Green Deployments

  • Deploy new versions in parallel, instant traffic switching

19. Observability First

  • Comprehensive logging, metrics, and tracing from day one

20. Chaos Engineering

  • Regularly test failure scenarios to ensure resilience

OUTCOMES & METRICS

Data Analytics Improvements: -

Metric

Before

After

Improvement

Data Processing Latency

4–6 hours

15–20 minutes

15–18× faster

Data Quality Accuracy

85%

99.5%

17% improvement

Query Performance (Redshift)

12 min (CLV query)

14 seconds

51× faster

Dashboard Load Time

45+ seconds

<3 seconds

15× faster

Data Lake Storage

2.5 PB

2.5 PB

Same data, optimized format

Storage Costs

$60K/month

$18K/month

70% reduction

Pipeline Success Rate

85%

99.9%

Near-perfect reliability

Analytics Availability

10 AM

7 AM

3 hours earlier

 

DevOps Improvements: -

Metric

Before

After

Impact

Deployment Time

4–5 hours

<2 minutes

120–150× faster

Deployment Success Rate

72%

99.7%

99% reduction in failures

Rollback Time

1–3 hours

<30 seconds

100× faster recovery

MTTR (Mean Time to Resolution)

2.5 hours

<15 minutes

10× faster

Infrastructure Provisioning

2–3 days

10–15 minutes

Enables rapid environment creation

Compliance Audit Time

2 weeks

2 days

7× faster audits

 

 

 

 

Cost Optimization: -

Category

Before

After

Annual Savings

EMR/Hadoop Infrastructure

$180K/quarter

$52K/quarter

$512K/year

Storage (S3 lifecycle)

$720K/year

$216K/year

$504K/year

ECR Storage (cleanup)

$36K/year

$5.4K/year

$30.6K/year

Redshift vs PostgreSQL

$300K/year

$180K/year

$120K/year

Incident Response Labor

$200K/year

$20K/year

$180K/year

Compliance Audit Labor

$80K/cycle

$5K/cycle

$225K/year (3 cycles)

Total Annual Savings

$1.57M/year

 

Business Impact

Operational Excellence

  • Reports available 3 hours earlier (7 AM vs 10 AM), enabling timely decision-making.
  • Pipeline reliability improved to 99.9%, reducing business disruptions.
  • Self-service analytics enabled for 50+ business users.

Scalability

  • Platform handles 35% YoY growth without infrastructure upgrades.
  • Peak load processing (Black Friday) succeeds without manual intervention.
  • New market expansion supported without infrastructure changes.

Compliance

  • Complete audit trail for all data transformations.

Innovation

  • Data scientists spend 60% less time on data cleaning, 60% more on analysis.
  • Ad-hoc analytics possible on full historical dataset.
  • New data sources onboarded in days vs months.

Conclusion

Neowise Digital's transformation from a manually-managed, batch-oriented infrastructure to a cloud-native, serverless data analytics platform demonstrates the power of:

Data Analytics Excellence

  • Multi-layer data lake architecture for organized data management.
  • AWS Glue for managed ETL and schema management.
  • EMR Serverless for elastic, cost-effective Spark processing.
  • Amazon Redshift for high-performance analytical queries.
  • Comprehensive data quality framework with automated validation.
  • EMR Spark UI for deep observability into job execution.

DevOps Automation

  • Infrastructure as Code for reproducible environments.
  • GitOps with ArgoCD for zero-downtime deployments.
  • Container orchestration on EKS for scalable workloads.
  • GitLab CI/CD for automated testing and deployment.
  • Comprehensive observability with CloudWatch and Spark UI.

Impact Achieved

  • 99.9% pipeline reliability.
  • 15-18x faster data processing.
  • 1.57L annual cost savings.
  • 100% compliance with financial regulations.
  • 10x faster incident response.

This case study provides a blueprint for fintech companies and other data-intensive organizations seeking to modernize their data analytics infrastructure and accelerate time-to-value for analytics initiatives while maintaining strong DevOps practices for operational excellence.

Share This On

Leave a comment