ENTERPRISE DATA ANALYTICS PLATFORM FOR FINTECH

EXECUTIVE SUMMARY

Neowise Digital operates a high-throughput payment processing platform handling 2M+ daily transactions across 50,000+ merchants globally. The organization faced critical challenges in data ingestion, transformation, storage, and analytics delivery including fragmented data pipelines, inconsistent data quality, lack of centralized data governance, and manual deployment processes.

This case study outlines how a data analytics-focused architecture leveraging AWS EMR Serverless, AWS Glue, Amazon S3, Amazon Redshift, and containerized orchestration on EKS with GitOps practices resolved data processing bottlenecks, enabled scalable ETL workflows, and delivered enterprise-grade analytics capabilities.

Key Outcomes:

· Data Processing Latency: Reduced from 4-6 hours (batch) to 15-20 minutes (near real-time)

· Data Quality: Improved from 85% accuracy to 99.5% through automated validation

· Analytics Query Performance: 50x improvement through Redshift optimization

· Deployment Time: Reduced from 4+ hours to <2 minutes via GitOps

· Cost Optimization: 40% savings through serverless EMR and lifecycle policies

· Data Lake Storage: 2.5 PB of structured financial data with 99.99% durability

Data Types and Sources

Neowise processes diverse financial data types requiring different processing strategies:

1. Transactional Data (Structured): -

· Volume: 2M records/day (15 GB/day)

· Schema: Transaction ID, merchant ID, customer ID, amount, currency, timestamp, status, metadata

· Source: Apache Kafka (real-time stream)

· Retention: 7 years (regulatory requirement)

· Format: JSON in Kafka → Parquet in S3 → Redshift tables

2. Customer Profile Data (Semi-Structured): -

· Volume: 50K updates/day

· Schema: Customer demographics, account details, preferences, nested address objects, notification settings

· Source: PostgreSQL databases (20+ instances)

· Complexity: Nested JSON fields requiring flattening (addresses array, preferences object)

· Format: PostgreSQL → change data capture via DMS → S3 (JSON) → Glue ETL → Parquet

3. Merchant Metadata (Structured)

· Volume: 50,000 merchants, daily updates

· Schema: Business details, categories, locations, payment terms, revenue metrics

· Source: MySQL operational database

· Format: MySQL → Glue Crawler → S3 → Redshift dimension tables

4. Historical Transaction Archives (Unstructured)

· Volume: 2.5 PB accumulated over 5 years

· Format: CSV files on SFTP servers (legacy)

· Challenge: Inconsistent schemas across years, missing values, encoding issues (UTF-8/Latin1 mix)

· Processing: One-time migration + transformation to standardized Parquet

5. External Risk Intelligence (Semi-Structured)

· Volume: API responses, 50K lookups/day

· Format: JSON with nested arrays (risk factors, historical incidents)

· Source: Third-party REST APIs (fraud databases, KYC services)

· Latency: 500-2000ms per API call

· Processing: Lambda ingestion → S3 → Glue transformation

6. Log and Event Data (Unstructured)

· Volume: 500 GB/day

· Types: Application logs, infrastructure logs, audit trails

· Format: Plain text, JSON logs

· Storage: CloudWatch Logs → S3 (long-term archival)

Existing Technology Stack (Legacy)

Component	Technology	Status
Data Ingestion	Apache Kafka (on‑premises, 5 clusters)	Siloed, no monitoring
Batch Processing	Hadoop (8‑node on‑premises)	Slow, manual scaling
Data Storage	PostgreSQL (20+ instances), CSV on SFTP	Fragmented, inconsistent schemas
Data Warehouse	On‑premises PostgreSQL	Cannot handle analytical queries at scale
ETL Orchestration	Apache Airflow (dev only) + shell scripts	Limited orchestration, manual execution
Data Quality	Manual SQL checks	No automated validation framework
Deployment	Manual SSH + shell scripts	Error‑prone, 4+ hours per release

PAIN POINTS: DATA ANALYTICS PERSPECTIVE

1: Data Pipeline Fragmentation & Quality Issues

Data exists in 7+ disconnected systems with no unified ingestion, transformation, or quality validation framework. Inconsistent data schemas, missing values, and lack of standardization prevent reliable analytics.

Impact:

· Data Inconsistency: 15% of transactions missing merchant category codes due to failed joins

· Analytics Delays: Data scientists spend 60% of time on data cleaning vs. analysis

· Report Inaccuracy: Monthly revenue reports require 2-3 iterations due to data quality issues

· Storage Waste: 400 GB of duplicate records across systems costing 12K annually

· Root Cause:

o No centralized data lake, no standardized ETL framework, no automated data quality validation, no schema registry for consistency.

2: Scalability & Performance Bottlenecks

On-premises Hadoop cluster cannot scale for growing data volumes. Batch processing takes 4-6 hours, preventing timely analytics. No auto-scaling, no spot instance optimization.

Impact:

· Delayed Analytics: Business reports available by 10 AM instead of 7 AM target

· Scalability Ceiling: Cannot handle 35% YoY growth without hardware procurement (9-month lead time)

· Peak Load Failures: Black Friday batch jobs timeout after 8 hours, causing 12-hour delays

· Cost Inefficiency: 180K/quarter in on-premises hardware with <40% average utilization

· Innovation Blocked: Data scientists cannot experiment with ad-hoc queries on full dataset due to resource constraints

Root Cause:

· No elastic compute (EMR Serverless), no distributed processing framework at scale, no auto-scaling for peak loads, no spot instance utilization.

3: Data Storage & Query Performance

No purpose-built data warehouse for analytics. PostgreSQL used for both operational and analytical workloads cause resource contention. No columnar storage, no query optimization, no partitioning strategy.

PostgreSQL Warehouse (On-Premises):

· Storage: 8 TB transaction data, row-based storage (inefficient for analytics)

· Query Performance:

· Simple aggregation (daily revenue): 45 seconds

· Complex join (customer lifetime value): 12 minutes

· Historical trend analysis (6-month): Timeout after 30 minutes

· Month-end reporting: 4+ hours

· Concurrency: 5 simultaneous users before performance degradation

· Data Access Pattern: 80% of queries scan full tables due to no partitioning

· Indexing: Over-indexed (47 indexes per table) causing slow writes

Impact:

· Slow Dashboards: Executive dashboards timeout during business hours, requiring off-peak scheduling

· Resource Contention: Operational queries delayed by 30+ seconds when analytical workloads run

· Limited Analytics: Cannot run complex queries on full historical data (5+ years)

· Cost Inefficiency: Over-provisioned database instances (25K/month) to handle occasional heavy queries

· User Frustration: Analysts wait 20+ minutes for query results, reducing productivity by 40%

Root Cause:

No columnar data warehouse (Redshift), no data partitioning strategy, no separation of analytical and operational workloads, no query result caching.

4: Data Transformation & ETL Complexity

ETL processes implemented as fragile shell scripts with no dependency management, error handling, or data lineage tracking. Complex nested JSON structures not properly flattened for analytics.

Customer Profile Data Challenges

Source Data Structure:

· Customer records contain nested personal information objects, an array of multiple addresses (billing, shipping, historical), and deeply nested preference settings including notification configurations. This multi-level nesting creates significant challenges for analytical queries.

Current Processing Issues:

· Nested addresses array contains 1-5 addresses per customer but not flattened, preventing joins with transaction data on specific address types

· Preferences and notification settings are 3 levels deep, making it impossible to query "all customers who opted into email notifications"

· No standardized field extraction logic across different data sources

· Manual Python scripts fail on malformed JSON (12% error rate)

Manual ETL Process:

The current process involves exporting JSON from PostgreSQL taking 20 minutes, running Python scripts to flatten JSON structures taking 40 minutes with frequent failures on schema variations, writing results to CSV taking 10 minutes, and finally loading to warehouse taking 30 minutes. Total time is approximately 100 minutes with a 12% failure rate.

Impact:

· Data Quality: 12% of customer records fail flattening due to schema variations

· Maintenance Burden: ETL scripts require weekly updates for new schema changes

· No Reusability: Each data source has custom flattening logic, creating 40+ disparate scripts

· Debugging Difficulty: No lineage tracking when data quality issues arise, taking 2-3 hours per investigation

· Incomplete Analytics: Cannot analyze customer preferences due to unflattened nested structures

Root Cause:

· No managed ETL service (Glue), no standardized data transformation framework, no automated schema evolution handling, no data quality checkpoints.

6: Monitoring & Observability for Data Pipelines

No visibility into data pipeline health, job execution status, or data quality metrics. Failures discovered hours after occurrence through user reports rather than proactive monitoring.

Monitoring Gaps:

Component	Current State	Impact
ETL Jobs	No execution logs, manual checking	Cannot debug failed jobs, 3–4 hour MTTR
Data Quality	Manual SQL checks run weekly	Issues discovered days later, affecting 50+ downstream reports
Pipeline Latency	No metrics tracked	SLA breaches undetected, discovered by business users
Data Completeness	No validation of row counts	Missing data in reports (discovered monthly during reconciliation)
EMR Jobs	Logs on HDFS (not centralized)	Cannot correlate failures across jobs, no Spark UI access after job completion
Kafka Consumer Lag	Manual checks via CLI	Lag spikes go unnoticed, causing hours of data delays

Impact:

· High MTTR: 3-4 hours to diagnose ETL failures due to log searching across 15+ systems

· Silent Data Loss: Missing records discovered during monthly reconciliation, too late to recover

· No Proactive Alerts: Pipeline failures detected by downstream users filing tickets, not by monitoring

· Compliance Risk: Cannot produce audit trail of data transformations for regulatory reviews

· Resource Waste: Cannot identify bottlenecks in Spark jobs, leading to over-provisioning

Root Cause:

· No centralized logging (CloudWatch Logs), no EMR Spark UI persistence, no custom metrics for data quality, no distributed tracing of data flow.

Architecture Suggest:

The redesigned architecture implements a cloud-native, serverless data analytics platform with:

Unified Data Lake: S3-based multi-layer architecture (raw/enriched/curated)

Managed ETL: AWS Glue for scalable data transformation and cataloging

Serverless Processing: EMR Serverless for large-scale Spark workloads with auto-scaling

Data Warehouse: Amazon Redshift for high-performance analytical queries

Orchestration: Apache Airflow on EKS for pipeline coordination

Data Governance: Glue Data Catalog for centralized metadata and schema management

Observability: EMR Spark UI with persistent logs, CloudWatch metrics, and custom data quality dashboards

SOLUTION COMPONENTS:

1. Multi-Layer Data Lake Architecture (Amazon S3)
Organize data by processing stage with clear separation of concerns, enabling efficient data governance and lifecycle management.

Layer 1: Raw Data (Immutable Source Data)

Characteristics:

· Immutability: Never modified after ingestion, preserving original data for audit and replay

· Format: Original format preserved (JSON, CSV, Parquet as-is from source)

· Retention: 7 years (regulatory requirement for financial data)

· Partitioning: By ingestion date for efficient lifecycle management

· Compression: gzip compression to reduce storage costs by 70%

· Size: 2.5 PB total, 8 TB daily ingestion

Data Organization:

The raw zone is organized into subdirectories by source system. Transaction data is partitioned by year, month, and day with hourly files. Customer profile data uses snapshot date partitioning. Merchant metadata includes extract date partitioning. Historical archives are organized by year ranges for legacy data migration.

Storage Policies:

Lifecycle management transitions data to S3 Glacier after 90 days for cost optimization (85% cost reduction). Data is retained for 7 years to meet audit requirements and regional financial regulations. Versioning is enabled to protect against accidental deletion. All data is encrypted at rest using SSE-S3 (AWS-managed keys).

Cost Optimization

The raw zone implements intelligent tiering: Standard storage for 30 days (frequent access for debugging), Standard-IA for days 31-90 (occasional access), and Glacier for 91+ days (compliance retention). This strategy reduces storage costs from $60K/month to $18K/month, a 70% reduction.

2: Enriched Zone (Cleaned & Standardized)

Characteristics:

· Format: Parquet with Snappy compression (columnar storage optimized for analytics)

· Partitioning: By date and region for optimal query performance

· Schema: Standardized, consistent across all sources

· Compression: Snappy compression (balance between speed and size, 4:1 compression ratio)

· Data Quality: Validated, deduplicated, flattened, type-converted

· Size: 800 TB (67% reduction from raw through compression and deduplication)

Data Transformation Techniques:

· JSON Flattening for Nested Structures:

o The enriched layer applies systematic flattening of nested JSON structures from customer profiles. Source data contains multi-level nesting with personal information objects, address arrays containing multiple addresses, and preference objects with nested notification settings.

o The flattening process extracts nested personal information fields to top-level columns. Address arrays are exploded and pivoted to create separate columns for billing and shipping addresses (billing_city, billing_street, shipping_city, shipping_street). Deeply nested preference settings are flattened to individual boolean columns (email_notifications_enabled, sms_notifications_enabled).

· Result: Queries that previously required complex JSON parsing functions now use simple column references. Query performance improved 40x for customer profile joins. Data is now compatible with BI tools that do not support nested structures.

· Data Type Standardization & Casting:

o Challenge: Source systems use inconsistent data types: transaction amounts stored as strings with currency symbols (150.00), integers (150), or doubles (150.50). Dates appear as ISO strings (2025-01-02), US format (01/02/2025), or Unix timestamps.

o Transformation Process:

§ Currency amounts are cleaned by removing dollar signs and commas, then cast to decimal(18,2) for precision. All date fields are standardized to ISO 8601 timestamp format (yyyy-MM-dd HH:mm:ss). Currency codes are uppercased and trimmed (USD, usd, Usd → USD). Boolean fields from various representations (true/false, 1/0, Y/N) are standardized to boolean type.

· Result: Consistent data types enable proper aggregations (SUM, AVG work correctly on decimal amounts), date arithmetic works reliably, and JOIN operations succeed based on matching formats.

· Deduplication Strategy:

o Source systems produce duplicate records due to Kafka retries (2-3% duplication rate), change data capture capturing the same change multiple times, and overlapping batch extracts.

o Deduplication Logic:

§ Duplicates are identified using composite keys (transaction_id + merchant_id + timestamp). When duplicates exist, the latest record is retained based on ingestion timestamp. A window function partitions by composite key and orders by ingestion timestamp descending, keeping only row number 1.

· Result: 2.8% of records identified as duplicates and removed. Storage savings of 70 TB in enriched layer. Analytics queries return correct counts without manual DISTINCT operations.

· Data Quality Validation & Cleansing

o Validation Rules Implemented:

§ Column value ranges are enforced (transaction amounts between 0.01 and 1,000,000). Required fields are validated as not null (customer_id, merchant_id, transaction_date). Valid enumerations are checked (currency must be in USD, EUR, GBP, JPY, CAD). Business logic validation ensures transaction_date is after customer_registration_date.

o Data Cleansing Actions:

§ Invalid records are quarantined to a separate S3 path for manual review. Correctable issues are auto-fixed (trim whitespace, uppercase currency codes). Critical violations cause job failure and alert data engineering team. Data quality metrics are published to CloudWatch for monitoring.

· Result: Data quality improved from 85% to 99.5%. Invalid data reduced from 15% to 0.5%. Business users trust analytics without manual verification.

· Column Renaming & Standardization

o Source systems use inconsistent naming conventions: customer_id vs customerId vs cust_id, transaction_date vs txn_date vs date_of_transaction.

o Standardization Rules:

§ All column names converted to snake_case (customer_id, transaction_date). Abbreviations expanded to full words (txn → transaction, cust → customer). Standard naming conventions applied (all dates suffixed with _date, all IDs suffixed with _id).

· Result: Cross-source joins simplified. Data catalog is self-documenting. New team members understand data without extensive documentation.

· Enriched Layer Structure:

o The enriched layer contains transactions partitioned by year, month, day, and region. Customer profiles are flattened with snapshot date partitioning. Merchant dimensions are maintained as slowly changing dimension type 2. Risk intelligence data is normalized from nested JSON to relational format.

3: Curated Zone (Business-Ready Analytics)

Characteristics:

· Purpose: Pre-aggregated, business-logic applied datasets optimized for consumption

· Consumers: BI tools (Tableau, QuickSight), Redshift, data science teams

· Schema: Denormalized for query performance, star schema design

· Refresh Frequency: Daily for batch aggregations, hourly for near-real-time metrics

· Size: 120 TB (highly aggregated from enriched layer)

Data Aggregation Techniques:

AWS Modern Data Platform Architecture

DATA IMPLEMENTATION: -

1. Data Lake Architecture & Curated Layer

A. Customer Analytics Aggregations

Customer Lifetime Value (CLV) Calculation

The curated layer pre-computes customer lifetime metrics including:

Total transaction count
Total transaction value
Average transaction value
First transaction date
Most recent transaction date
Customer tenure in days
Transaction frequency (transactions per month)

Window Functions for Time-Series Analysis

Running totals: Calculated using cumulative sum windows partitioned by customer
Moving averages: Use 30-day and 90-day rolling windows for trend analysis
Rank functions: Identify top customers by revenue within each region and merchant category

Result: CLV queries that took 12 minutes on enriched data now return in under 2 seconds from curated layer. Dashboards load instantly with pre-aggregated metrics.

B. Merchant Analytics & KPIs

Merchant Performance Metrics

Pre-aggregated merchant KPIs include:

Daily/weekly/monthly revenue totals
Transaction count by merchant and category
Average transaction value by merchant
Refund rates and chargeback rates
Customer acquisition counts (new vs returning)

Trend Analysis

Month-over-month growth rates calculated and stored
Year-over-year comparisons pre-computed for seasonal analysis
Merchant rank by revenue updated daily for leaderboards

Result: Merchant dashboards load 50x faster. Complex trend queries execute in under 3 seconds vs 2+ minutes previously.

C. Time-Series Rollups & Aggregations

Multi-Granularity Aggregations

Transaction data aggregated at multiple time granularities:

Hourly rollups for real-time monitoring
Daily summaries for operational reporting
Weekly and monthly aggregations for executive reporting
Yearly summaries for long-term trend analysis

Partitioning Strategy

Each granularity level partitioned differently:

Hourly data: partitioned by date and hour
Daily summaries: date partitioning only
Monthly aggregations: year and month partitioning

Result: Reports select the appropriate granularity level for their timeframe. Query performance improved 80x by avoiding re-aggregation.

D. Compliance & Regulatory Reporting Datasets

Audit Trail Datasets

Point-in-time snapshots of customer data for regulatory compliance
Transaction audit trails capture all state changes
Data deletion logs track right-to-be-forgotten requests

Regulatory Report Tables

Pre-formatted datasets match regulatory reporting requirements:

Anti-money laundering transaction monitoring

Result: Regulatory report generation reduced from 40 hours to 2 hours per quarter. Audit requests fulfilled in minutes vs days.

Curated Layer Structure

The curated zone contains:

Customer analytics (aggregated metrics by customer)
Merchant analytics (KPIs and trends)
Daily transaction summaries (pre-aggregated facts)
Revenue reports (business intelligence datasets)
Compliance snapshots (regulatory reporting data)

2. AWS Glue ETL & Data Transformation

Purpose: Managed ETL service for scalable data transformation, schema evolution, and data quality management.

Glue Job Architecture

Job Types Implemented

1. Glue Python Shell Jobs (Lightweight)

Used for small-scale transformations:

API data ingestion and lightweight formatting
Metadata updates to Glue Catalog
Data quality report generation
Simple CSV to Parquet conversions

Configuration: 1 DPU (Data Processing Unit), runs in under 5 minutes, cost-effective for small data volumes (<1 GB).

2. Glue Spark Jobs (Distributed Processing)

Used for large-scale transformations:

Flattening nested JSON from customer profiles (50K records/day)
Joining transactions with customer and merchant data
Deduplication across 2M+ daily transactions
Complex aggregations for curated layer

Configuration: 10-50 DPUs with auto-scaling, processes 8 TB daily data in 15-20 minutes, Spark 3.3 with Glue 4.0 runtime.

3. Glue Streaming Jobs (Real-Time)

Used for near-real-time processing:

Kafka to S3 ingestion with micro-batching (5-minute windows)
Real-time data quality validation
Streaming aggregations for monitoring dashboards

Configuration: Continuous run with checkpoint management, processes 50K events per second during peak, writes to enriched layer every 5 minutes.

Glue Data Catalog Integration

Purpose: Centralized metadata repository for all data assets, enabling schema discovery and query federation.

Catalog Organization

Databases:

raw_zone: schemas for source data as-is
enriched_zone: standardized, cleaned schemas
curated_zone: aggregated, business-ready schemas
external_sources: third-party API data

Tables:

Each S3 prefix registered as a table:

fact_transactions: 400M+ rows, partitioned by date and region
dim_customers: 5M rows, SCD type 2
dim_merchants: 50K rows
customer_analytics: pre-aggregated metrics

Partitions:

Tables partitioned to enable partition pruning for faster queries:

Transaction tables have 18,000+ partitions (5 years × 365 days × 10 regions)
Glue Crawler automatically discovers new partitions daily
Partition projection used for predictable date-based partitions to avoid crawler overhead

Schema Evolution Handling

Glue automatically detects schema changes during crawls
New columns added without breaking existing queries
Column type changes trigger data engineering alerts
Schema versions tracked for audit trail

Result: Athena queries automatically use Glue Catalog for schema information. Redshift Spectrum federated queries access data lake through catalog. No manual schema management required.

Data Quality Framework

Glue Data Quality Rules

Built-in Glue Data Quality feature validates data during ETL:

Completeness rules: check for null values in required fields
Uniqueness rules: validate primary key constraints
Statistical rules: detect anomalies in distributions
Referential integrity rules: ensure foreign key relationships

Great Expectations Integration

Custom data quality checks implemented using Great Expectations:

Schema validation: ensures all expected columns exist
Value range checks: validate business logic (amounts > 0)
Distribution checks: detect data skew or anomalies
Custom rules: validate complex business logic

Quality Metrics Dashboard

Data quality scores published to CloudWatch for each table
Quality trends tracked over time
Alerts trigger when quality drops below 99% threshold
Quality reports generated daily for data stewards

Result: Data quality issues detected immediately vs days later. Automated remediation for common issues. Trust in analytics increased across organization.

Job Orchestration via Airflow

DAG Structure

Glue jobs orchestrated by Apache Airflow running on EKS. A typical daily ETL DAG includes:

Trigger Glue Crawler to discover new source data
Run Glue ETL jobs to process raw to enriched
Execute data quality validation
Trigger EMR Serverless jobs for curated aggregations
Load Redshift from curated layer
Send notifications on success or failure

Dependency Management

Tasks wait for upstream completion
Glue jobs run in parallel where possible (10+ concurrent jobs)
Failed jobs retry 3 times with exponential backoff
Critical path jobs have priority scheduling

Result: End-to-end pipeline orchestration with clear visibility. Pipeline runtime reduced from 5.5 hours to 15-20 minutes.

3. EMR Serverless for Large-Scale Spark Processing

Purpose: Execute distributed Spark workloads for complex transformations and aggregations without managing clusters.

EMR Serverless Architecture

Application Configuration

Pre-Initialized Capacity:

Minimum workers: 5 (warm pool for fast job startup)
Maximum workers: 200 (auto-scales based on workload)
Worker type: 4 vCPU and 16 GB RAM per worker
Startup time: Sub-minute (vs 15 minutes for on-demand EMR clusters)

Auto-Scaling Behavior:

Workers scale up when Spark tasks are pending
Scale down after 5 minutes of idle time
Scale proportional to data volume being processed
During Black Friday, system scaled from 5 to 180 workers automatically

Cost Optimization:

EMR Serverless charges only for resources used during job execution
No charges for idle time between jobs
70% cost reduction vs running persistent EMR clusters
Monthly cost reduced from $45K (on-demand EMR) to $13K (serverless)

Spark Job Types & Scheduling

Job Category 1: Customer Aggregations

Purpose: Calculate customer-level metrics for analytics and business intelligence.

Processing Logic:

Aggregate total transaction value per customer
Count unique transactions per customer
Calculate average transaction value
Identify first and last transaction dates
Compute customer lifetime value metrics
Calculate transaction frequency patterns

Data Volume:

Processes 2M transactions daily
Joins with 5M customer records
Outputs 800K customer metrics records

Schedule:

Runs daily at 2 AM (off-peak hours)
Duration: 8-12 minutes with auto-scaling
Outputs written to s3://neowise-analytics/curated/customer_analytics/

Result: Customer dashboards updated by 7 AM, 50x faster than previous Hadoop processing (from 6 hours to 12 minutes).

Job Category 2: Merchant Performance Analytics

Purpose: Generate merchant KPIs and performance trends for merchant-facing dashboards.

Processing Logic:

Aggregate revenue by merchant and time period
Calculate transaction counts and average ticket size
Compute refund and chargeback rates
Identify top-performing products per merchant
Analyze customer retention by merchant
Generate month-over-month growth metrics

Data Volume:

2M transactions across 50K merchants daily
Outputs 50K merchant metric records

Schedule:

Runs daily at 3 AM after customer aggregations
Duration: 5-8 minutes
Outputs to s3://neowise-analytics/curated/merchant_analytics/

Result: Merchant dashboards load instantly, enabling self-service analytics for merchant partners.

Job Category 3: Time-Series Rollups

Purpose: Pre-aggregate data at multiple time granularities for fast reporting.

Processing Logic:

Create hourly transaction summaries from minute-level data
Roll up hourly data to daily summaries
Aggregate daily summaries to weekly and monthly
Calculate running totals and moving averages
Partition by region for geographic reporting

Data Volume:

Input: 15 GB raw transactions daily
Output: 2 GB multi-granularity aggregations

Schedule:

Hourly jobs for real-time metrics (run at :05 past each hour)
Daily rollup jobs at 4 AM
Weekly rollups on Monday mornings
Monthly rollups on 1st of each month

Result: Reports select optimal granularity, query performance improved 80x, storage efficiency improved through pre-aggregation.

Job Category 4: Complex Join Operations

Purpose: Create denormalized datasets joining transactions with dimensions for analytics.

Processing Logic:

Join transactions with customer profiles (5M records)
Enrich with merchant metadata (50K merchants)
Add risk intelligence data from third-party sources
Flatten nested structures from all sources
Apply business logic transformations

Data Volume:

2M+ transaction records joined with multiple dimension tables daily

Schedule:

Runs daily at 1 AM before other aggregations
Duration: 15-18 minutes (longest running job)
Outputs to s3://neowise-analytics/enriched/transactions_enriched/

Result: Downstream jobs work with pre-joined data, eliminating repeated join operations and improving overall pipeline efficiency.

EMR Spark UI for Logging & Monitoring

Purpose: Persistent Spark UI provides detailed visibility into job execution, stage performance, and resource utilization for debugging and optimization.

Spark UI Configuration

Persistent History Server:

EMR Serverless automatically configures Spark to write event logs to S3 (s3://neowise-analytics/logs/spark-events/)
Event logs retained for 30 days for analysis
Spark UI accessible through EMR Console even after jobs complete
Historical job runs can be analyzed weeks after execution

Logging to CloudWatch:

Driver logs capture application-level logging and errors
Executor logs show task-level execution details
Stderr and stdout streams automatically sent to CloudWatch Logs
Logs organized by application ID and job run ID for easy filtering

Key Metrics Captured

Job-Level Metrics:

Total job duration and time breakdown by stage
Number of tasks executed
Data read from S3 and data written to S3
Shuffle read and shuffle write volumes
Executor memory and CPU utilization

Stage-Level Analysis:

Spark UI shows detailed stage DAG visualization
Each stage displays task count, duration, and data volume
Slow stages identified through duration heatmaps
Skewed partitions visible in task duration metrics

Task-Level Details:

Individual task execution times
Task locality (data local vs node local)
Garbage collection time per task
Shuffle read/write per task
Failures and retry counts

Usage for Debugging

Performance Optimization Example:

Issue: Customer aggregation job taking 25 minutes, exceeding 15-minute SLA
Analysis via Spark UI: Stage 3 (groupBy operation) taking 18 minutes with severe data skew. Task duration varied from 30 seconds to 15 minutes
Solution: Added salting to distribute hot keys evenly
Result: Job duration reduced to 9 minutes, meeting SLA

Data Quality Investigation Example:

Issue: Merchant analytics showing inconsistent record counts
Spark UI logs revealed: NullPointerException in 2% of tasks during join operation
Root cause: Merchant dimension table had null IDs breaking join
Solution: Added null handling in join logic
Result: 100% task success rate, accurate record counts

Resource Optimization Example:

Spark UI memory metrics: Showed executors with 80% idle memory
Action: Over-provisioned worker memory from 16 GB to 8 GB
Cost savings: 40% reduction in EMR Serverless costs with no performance impact

Alerting & Monitoring

CloudWatch alarms trigger when job duration exceeds thresholds (e.g., customer aggregation > 15 minutes)
Task failure rate alerts when >5% of tasks fail
Data volume anomaly detection alerts when input data varies by >30% from rolling average
Custom metrics track business KPIs (records processed per minute, data quality score)

Result: MTTR for data pipeline issues reduced from 3 hours to 15 minutes. Proactive optimization based on Spark UI metrics. Clear audit trail for compliance and debugging.

Data Storage After Processing

Storage Format

All EMR Serverless outputs written in Parquet format with Snappy compression
Columnar storage provides 10x faster query performance vs row-based
Compression reduces storage by 75% vs uncompressed JSON

Partitioning Strategy

Transaction data partitioned by year, month, day, and region
Customer analytics partitioned by snapshot_date
Merchant analytics partitioned by calculation_date
Partitioning enables partition pruning, reducing query scans by 90%

Data Retention

Enriched layer: 2 years in Standard storage, then Glacier
Curated layer: 1 year in Standard-IA, then Glacier
Logs and event data: 90 days retention
Compliance-required data: 7 years in Glacier

Data Transparency

All data includes metadata columns:

ingestion_timestamp: when data entered system
processing_timestamp: when transformation occurred
data_source: origin system identifier
job_run_id: Glue or EMR job identifier
data_quality_score: validation result 0-100

Result: Complete data lineage from source to consumption. Ability to replay specific time periods. Regulatory audit requirements easily fulfilled.

4. Amazon Redshift Data Warehouse

Purpose: Columnar data warehouse optimized for analytical queries and BI tool integration.

Cluster Configuration

Node Type: ra3.4xlarge (128 GB RAM, 12 vCPUs, managed storage)
Cluster Size: 4 nodes (512 GB total RAM)
Storage: 16 TB managed storage (separated from compute)
Concurrency: 50 concurrent queries supported via WLM

Table Design

Fact Table: fact_transactions

Contains 400M rows (2 years of transaction history)
Distributed by customer_id for efficient joins
Sorted by transaction_date for range queries
Compressed with AZ64 encoding (native Redshift compression)

Dimension Tables:

dim_customers: 5M rows, distributed across all nodes (ALL distribution style)
dim_merchants: 50K rows, replicated to all nodes
dim_dates: 3,650 rows (10 years), replicated

Aggregation Tables:

fact_daily_aggregates: Pre-aggregated daily summaries (2K rows per day)
fact_customer_lifetime_value: Customer-level rollups (5M rows)
fact_merchant_monthly: Merchant performance by month (600K rows)

Data Loading Strategy

Incremental Loads:

Daily loads from curated layer via COPY command
Load only new partitions (yesterday's data)
Runs at 5 AM after EMR jobs complete
Duration: 5-8 minutes for 2M daily transactions

Full Refresh Tables:

Dimension tables fully refreshed weekly
Uses DELETE then INSERT pattern
Maintains slowly changing dimension history

Query Performance

Before Redshift (PostgreSQL):

Daily revenue query: 45 seconds
Customer lifetime value: 12 minutes
6-month trend analysis: timeout (30+ minutes)

After Redshift:

Daily revenue query: 0.8 seconds (56x faster)
Customer lifetime value: 14 seconds (51x faster)
6-month trend analysis: 18 seconds (100x faster, previously timeout)

Result: Interactive analytics possible, dashboards load in under 3 seconds, 50+ concurrent users supported.

5. Apache Airflow Orchestration (EKS)

Purpose: Coordinate data pipeline execution across Glue, EMR, Redshift, and validation steps.

Airflow Deployment on EKS

Configuration:

2 scheduler pods for high availability
5-10 worker pods with auto-scaling based on task queue
1 webserver pod for UI access
PostgreSQL backend for metadata (RDS Aurora)

DAG Structure: Daily Analytics Pipeline

Stage 1: Data Discovery (6:00 AM)

Trigger Glue Crawler to discover new partitions in raw zone
Update Glue Data Catalog with new schemas
Validate all expected data files present

Stage 2: Data Transformation (6:10 AM)

Run Glue ETL jobs in parallel (10+ concurrent jobs)
Process raw to enriched layer
Apply flattening, deduplication, standardization
Duration: 10-12 minutes

Stage 3: Data Quality Validation (6:25 AM)

Execute Great Expectations validation suite
Check row counts, schema compliance, business rules
Fail pipeline if quality <99%
Duration: 2-3 minutes

Stage 4: Advanced Analytics (6:30 AM)

Submit EMR Serverless jobs for aggregations
Customer analytics job (8 mins)
Merchant analytics job (5 mins)
Time-series rollups (6 mins)
Total duration: 8 mins (parallel execution)

Stage 5: Data Warehouse Load (6:40 AM)

COPY data from curated layer to Redshift
Update fact tables incrementally
Refresh aggregation tables
Duration: 5-8 minutes

Stage 6: Validation & Notification (6:50 AM)

Run reconciliation queries (source vs target counts)
Send success notification to Slack
Publish pipeline metrics to CloudWatch
Generate data quality report

Total Pipeline Duration: 45-50 minutes (vs 5.5 hours previously)

Error Handling

Failed tasks retry 3 times with exponential backoff
Critical failures trigger PagerDuty alerts
Partial failures allow downstream tasks to proceed with available data
Clear error messages in Airflow UI for debugging

Result: Predictable, reliable daily pipeline. Data ready by 7 AM for business users. 99.9% success rate vs 85% previously.

DEVOPS IMPLEMENTATION:-

1. GitLab CI/CD Pipeline

Purpose: Automated build, test, security scan, and deployment orchestration for all infrastructure and application components.

Pipeline Stages

Stage 1: Lint & Validate

Terraform validate for infrastructure code
Helm lint for Kubernetes manifests
Pylint for Python/PySpark code
SQL linting for Glue scripts
Duration: 2-3 minutes

Stage 2: Build

Build Docker images for Airflow, custom operators, data quality jobs
Tag with git commit SHA for traceability
Cache layers for faster builds
Duration: 3-5 minutes

Stage 3: Test

Unit tests for data transformation logic
Integration tests against test data
Schema validation tests
Data quality rule tests
Duration: 5-7 minutes

Stage 4: Security Scan

Trivy scans for CVEs in Docker images
Snyk scans for dependency vulnerabilities
Fail build if critical CVEs found
GHAS code scanning for secrets
Duration: 2-3 minutes

Stage 5: Push to ECR

Push Docker images to Amazon ECR
Apply image tags (commit SHA, version, latest)
Enable image scanning in ECR
Sign images with Docker Content Trust
Duration: 1-2 minutes

Stage 6: Deploy to Dev

Trigger ArgoCD sync for dev environment
Automated deployment
Validate health checks
Run smoke tests
Duration: 2-3 minutes

Stage 7: Deploy to Production

Require manual approval in GitLab
Trigger ArgoCD sync for production
Blue-green deployment strategy
Automated rollback on failure
Duration: 2-3 minutes

Total Pipeline: 15-20 minutes from commit to production

Result: Deployment frequency increased from quarterly to daily. Deployment failure rate reduced from 28% to 0.3%. Full audit trail in git history.

2. Amazon ECR (Container Registry)

Purpose: Centralized, secure Docker image management with automated scanning and lifecycle policies.

Configuration

Image Scanning:

Scan on push enabled for all images
Integration with Amazon Inspector for CVE detection
Fail pipeline if critical vulnerabilities found
Weekly rescans of production images

Image Immutability:

Tag immutability enabled to prevent overwrites
Semantic versioning enforced (v1.2.3)
Production tags cannot be reused

Lifecycle Policies:

Keep last 30 tagged images per repository
Delete untagged images after 7 days
Reduce storage from 800 images ($3K/month) to 120 images ($450/month)

Access Control:

IAM policies restrict ECR push to CI/CD pipelines only
EKS pods use IRSA for pull access
Cross-account access for DR region

Result: Zero CVE incidents in production, 85% storage cost reduction, full image lineage for compliance.

3. Amazon EKS (Kubernetes Orchestration)

Purpose: Production-grade container orchestration for Airflow, Kafka consumers, and custom data processing jobs.

Workloads

Apache Airflow:

2 scheduler pods (HA)
5-10 worker pods (auto-scaled)
1 webserver
Resource requests: 4 CPU, 8 GB RAM per scheduler

Kafka Consumers:

5-10 replicas processing transaction stream
Auto-scaled by HPA based on consumer lag
2 CPU, 4 GB RAM per pod

Data Validation Jobs:

3-5 scheduled pods running Great Expectations
1 CPU, 2 GB RAM each

Security

RBAC policies with minimal permissions per namespace
Pod security standards (no privileged containers)
Network policies restrict pod-to-pod communication
Secrets managed via AWS Secrets Manager with IRSA

Result: 99.99% uptime for Airflow, seamless scaling during peak loads, infrastructure costs optimized through right-sizing.

4. ArgoCD (GitOps Deployment)

Purpose: Declarative, git-driven application deployment with continuous reconciliation and automated rollback.

Deployment Flow

Developer updates Helm values in git (change image tag)
Commits to main branch
ArgoCD detects git change within 3 minutes
Computes diff between desired and actual state
If auto-sync enabled, deploys new pods in parallel (blue-green)
Waits for health checks (readiness probes)
Gradually shifts traffic to new pods
Automatically rolls back if health checks fail

Benefits

Deployment time: under 2 minutes (vs 4+ hours manual)
Error rate: 0.3% (vs 28% manual)
Rollback time: 30 seconds (git revert + commit)
Full audit trail in git (who, what, when, why)

Self-Healing

ArgoCD continuously monitors cluster state
Automatically recreates deleted pods
Reverts manual changes to match git
Alerts on drift detection

Result: Zero-downtime deployments, instant rollback capability, immutable deployment history.

5. Infrastructure as Code (Terraform)

Purpose: Version-controlled, reproducible infrastructure provisioning across all AWS resources.

Terraform Modules

Networking Module:

VPC with public/private subnets
NAT gateways
Security groups
VPC peering for cross-region DR

Data Lake Module:

S3 buckets with encryption
Lifecycle policies
IAM policies for cross-service access

EMR Serverless Module:

EMR application configuration
IAM roles for Glue and S3
CloudWatch log groups

Redshift Module:

Cluster configuration
Parameter groups
Subnet groups
Snapshot schedules

EKS Module:

Cluster creation
Node groups with auto-scaling
IAM roles for service accounts (IRSA)

State Management

Terraform state stored in S3 with DynamoDB locking
State file encrypted
Versioning enabled for rollback
Separate state files per environment

Deployment Process

terraform plan generates preview of changes
Review in pull request
terraform apply executed via CI/CD
Drift detection runs weekly via AWS Config

Result: Infrastructure changes in minutes vs days, consistent environments (dev = staging = prod), disaster recovery environment rebuilt in under 1 hour.

6. Observability: CloudWatch, VPC Flow Logs

CloudWatch Logs

Airflow task logs automatically streamed
EMR Spark driver/executor logs centralized
Glue job logs with job run ID tagging
EKS pod logs via Fluent Bit DaemonSet
Retention: 90 days for operational logs, 7 years for audit logs

CloudWatch Metrics

Custom metrics for data pipeline health:

Records processed
Job duration
Data quality score
EMR Serverless metrics (worker count, task duration)
EKS pod metrics (CPU, memory, restarts)
Redshift query performance (execution time, queue time)

Alerting

Pipeline SLA breach alerts (e.g., daily ETL exceeds 60 minutes)
Data quality alerts (score below 99%)
Resource utilization alerts (Redshift CPU >80%)
Error rate thresholds (>5% task failures)

VPC Flow Logs

All VPC traffic logged to S3 for security analysis
Used to diagnose connectivity issues
Analyzed for anomaly detection
Retention: 30 days

Result: MTTR reduced from 2.5 hours to under 15 minutes. Proactive issue detection before user impact. Complete audit trail for compliance.

BEST PRACTICES IMPLEMENTED

Data Analytics Best Practices

1. Layered Data Architecture

Separate raw, enriched, and curated zones for clear data processing stages and lineage tracking

2. Partition Strategy

Partition by date and region for query performance optimization and cost reduction

3. Columnar Storage

Use Parquet format for 10x query performance and 75% storage reduction

4. Data Quality Gates

Validate data between pipeline stages, fail fast on quality issues

5. Schema Evolution

Use Glue Data Catalog for automatic schema discovery and version tracking

6. Idempotent Pipelines

Ensure Spark jobs produce identical results on re-run for safe replay

7. Data Lifecycle Management

Automate transitions to cold storage (S3 Glacier) for cost optimization

8. Metadata Management

Centralize in Glue Data Catalog for discovery and governance

9. Pre-Aggregation Strategy

Create curated datasets at multiple granularities for fast reporting

10. Incremental Processing

Process only new data to reduce compute costs and latency

DevOps Best Practices

11. Infrastructure as Code

Version all infrastructure in git, treat IaC as production code

12. GitOps Deployment

Use git as single source of truth for all deployments

13. Container Immutability

Use semantic versioning, prevent tag overwrites in ECR

14. CVE Scanning

Scan all images before deployment, fail builds on critical vulnerabilities

15. Secrets Management

Use AWS Secrets Manager, never hardcode credentials

16. Least Privilege IAM

Scope roles to specific resources and actions, use IRSA for EKS workloads

17. Automated Rollbacks

Configure health checks for automatic rollback on deployment failures

18. Blue-Green Deployments

Deploy new versions in parallel, instant traffic switching

19. Observability First

Comprehensive logging, metrics, and tracing from day one

20. Chaos Engineering

Regularly test failure scenarios to ensure resilience

OUTCOMES & METRICS

Data Analytics Improvements: -

Metric	Before	After	Improvement
Data Processing Latency	4–6 hours	15–20 minutes	15–18× faster
Data Quality Accuracy	85%	99.5%	17% improvement
Query Performance (Redshift)	12 min (CLV query)	14 seconds	51× faster
Dashboard Load Time	45+ seconds	<3 seconds	15× faster
Data Lake Storage	2.5 PB	2.5 PB	Same data, optimized format
Storage Costs	$60K/month	$18K/month	70% reduction
Pipeline Success Rate	85%	99.9%	Near-perfect reliability
Analytics Availability	10 AM	7 AM	3 hours earlier

DevOps Improvements: -

Metric	Before	After	Impact
Deployment Time	4–5 hours	<2 minutes	120–150× faster
Deployment Success Rate	72%	99.7%	99% reduction in failures
Rollback Time	1–3 hours	<30 seconds	100× faster recovery
MTTR (Mean Time to Resolution)	2.5 hours	<15 minutes	10× faster
Infrastructure Provisioning	2–3 days	10–15 minutes	Enables rapid environment creation
Compliance Audit Time	2 weeks	2 days	7× faster audits

Cost Optimization: -

Category	Before	After	Annual Savings
EMR/Hadoop Infrastructure	$180K/quarter	$52K/quarter	$512K/year
Storage (S3 lifecycle)	$720K/year	$216K/year	$504K/year
ECR Storage (cleanup)	$36K/year	$5.4K/year	$30.6K/year
Redshift vs PostgreSQL	$300K/year	$180K/year	$120K/year
Incident Response Labor	$200K/year	$20K/year	$180K/year
Compliance Audit Labor	$80K/cycle	$5K/cycle	$225K/year (3 cycles)
Total Annual Savings	–	–	$1.57M/year

Business Impact

Operational Excellence

Reports available 3 hours earlier (7 AM vs 10 AM), enabling timely decision-making.
Pipeline reliability improved to 99.9%, reducing business disruptions.
Self-service analytics enabled for 50+ business users.

Scalability

Platform handles 35% YoY growth without infrastructure upgrades.
Peak load processing (Black Friday) succeeds without manual intervention.
New market expansion supported without infrastructure changes.

Compliance

Complete audit trail for all data transformations.

Innovation

Data scientists spend 60% less time on data cleaning, 60% more on analysis.
Ad-hoc analytics possible on full historical dataset.
New data sources onboarded in days vs months.

Conclusion

Neowise Digital's transformation from a manually-managed, batch-oriented infrastructure to a cloud-native, serverless data analytics platform demonstrates the power of:

Data Analytics Excellence

Multi-layer data lake architecture for organized data management.
AWS Glue for managed ETL and schema management.
EMR Serverless for elastic, cost-effective Spark processing.
Amazon Redshift for high-performance analytical queries.
Comprehensive data quality framework with automated validation.
EMR Spark UI for deep observability into job execution.

DevOps Automation

Infrastructure as Code for reproducible environments.
GitOps with ArgoCD for zero-downtime deployments.
Container orchestration on EKS for scalable workloads.
GitLab CI/CD for automated testing and deployment.
Comprehensive observability with CloudWatch and Spark UI.

Impact Achieved

99.9% pipeline reliability.
15-18x faster data processing.
1.57L annual cost savings.
100% compliance with financial regulations.
10x faster incident response.

This case study provides a blueprint for fintech companies and other data-intensive organizations seeking to modernize their data analytics infrastructure and accelerate time-to-value for analytics initiatives while maintaining strong DevOps practices for operational excellence.

ENTERPRISE DATA ANALYTICS PLATFORM FOR FINTECH

Share This On

Leave a comment