Cloud-Native Fintech Modernization and PCI-Compliant Migration on AWS - fintech getadvantage wala

2026-02-05
Cloud Modernisation

EXECUTIVE SUMMARY: -

GetAdvantage is a growing fintech company offering digital payments, lending, and wealth management services to over 800,000 retail and corporate customers across India. Prior to this engagement, the company operated its entire technology stack on Legacy Data Centre (LDC) infrastructure, an aging on-premises environment characterised by hardware constraints, operational fragility, limited scalability, and increasing security and compliance risk.

 

2. About GetAdvantage

2.1 Company Profile

Industry Vertical

Financial Technology (Fintech)

Products & Services

Digital Payments, BNPL, SME Lending, Wealth Management

Customer Base

800,000+ retail and corporate users

Employee Headcount

400 (of which 90 in engineering)

Prior Infrastructure

On-premises LDC (Legacy Data Centre)

Migration Completion

2026

 

2.2 Business Context and Strategic Objectives

GetAdvantage was facing a pivotal inflection point. Its legacy infrastructure, originally designed for a fraction of its current transaction volumes, was struggling to keep pace with user growth, product expansion, and increasingly stringent regulatory requirements. Leadership had identified that technology constraints were directly impeding the company's ability to launch new products, respond to market opportunities, and compete against cloud-native fintech challengers.

 

The board approved a cloud-first transformation strategy with four strategic objectives:

       Eliminate infrastructure bottlenecks that were constraining product velocity

       Reduce total cost of ownership (TCO) while converting capital expenditure to operational expenditure

       Achieve and maintain continuous regulatory compliance (PCI-DSS Level 1, RBI Cloud Guidelines)

       Build a scalable, resilient platform capable of 10x transaction volume growth without architectural rework

 

3. Understanding the Pain Points

3.1 Detailed Pain Points

The discovery exercise surfaced six primary pain point categories, each with measurable business impact:

 

Pain Point 1 — Scalability Constraints and Capacity Ceiling

GetAdvantage's LDC environment was provisioned for a peak capacity of approximately 400 concurrent transactions per second (TPS). During festival sale periods, payment campaigns, and month-end billing cycles, the platform regularly exceeded this threshold — triggering application timeouts, failed transactions, and customer-facing errors. Provisioning additional capacity required hardware procurement cycles of 6–8 weeks, by which point the demand spike had already passed.

       Average transaction failure rate during peak periods: 8.4%

       Estimated revenue leakage from failed peak-period transactions: INR 22 crore per annum

       Infrastructure utilisation at off-peak: < 20% (severe overprovisioning for base capacity)

 

Pain Point 2 — Availability and Reliability Deficits

The LDC environment achieved approximately 97.8% uptime — equivalent to approximately 174 hours of unplanned downtime per year. A significant portion of this was attributable to planned maintenance windows (typically 4-hour Sunday outages for patching) which required full-service blackouts. The company had no true high-availability architecture; a failure of a primary database server would result in manual failover taking 45–90 minutes.

       Planned downtime per year: ~96 hours (bi-weekly 4-hour maintenance windows)

       Unplanned outage events in the prior 12 months: 11 incidents (including 4 P1 severity)

       Average MTTR for P1 incidents: 3.2 hours

 

Pain Point 3 — Security Vulnerabilities and Compliance Exposure

The legacy environment accumulated significant security technical debt. Patch management was manual and inconsistent — the team of 3 operations engineers was responsible for patching over 140 servers, resulting in average patch lag of 47 days from release to application. Three servers were found to be running operating systems beyond end-of-life with no vendor security support. The company faced an upcoming PCI-DSS Level 1 audit with a realistic risk of non-compliance findings that could result in payment network sanctions.

       Average patch lag: 47 days

       End-of-life OS instances: 3 production servers

       PCI-DSS scope: entire LDC floor (flat network, broad cardholder data environment)

       Annual cost of manual compliance activities: ~INR 1.8 crore (external auditors + internal effort)

 

Pain Point 4 — Disaster Recovery and Business Continuity Gaps

GetAdvantage's disaster recovery posture was critically inadequate for a regulated fintech. The DR strategy consisted of nightly tape backups stored at a secondary co-location facility 12 km away. In a tested DR drill, the team achieved a Recovery Time Objective (RTO) of 31 hours and a Recovery Point Objective (RPO) of 24 hours — far outside the 4-hour RTO and 1-hour RPO mandated by RBI business continuity guidelines for payment system operators.

       Tested RTO: 31 hours | RBI-mandated RTO: 4 hours

       Tested RPO: 24 hours | RBI-mandated RPO: 1 hour

       Last DR drill: 14 months prior to engagement (below annual requirement)

 

Pain Point 5 — Engineering Velocity and Deployment Friction

The deployment process required a formal change advisory board (CAB) approval, a 2-week change freeze, manual server configuration, and a 6–8 week hardware lead time for any environment changes. Developers spent significant time maintaining and troubleshooting infrastructure rather than building product features. There was no CI/CD pipeline; deployments were executed manually via SSH. The engineering team estimated that approximately 40% of their time was consumed by infrastructure maintenance activities.

       Deployment cycle: 6–8 weeks (hardware) to 5 days (software-only changes)

       Engineering time spent on infrastructure maintenance: ~40%

       Number of production environments: 1 (no staging parity, limited testing fidelity)

 

Pain Point 6 — Rising Infrastructure Cost and Capital Lock-in

The LDC environment required a hardware refresh cycle every 3–5 years, with the next refresh estimated at INR 8.5 crore in capital expenditure. Annual data centre co-location fees, hardware maintenance contracts, and software licensing totalled approximately INR 10.4 crore per annum. This capital-intensive model prevented investment in product and engineering capability, and created stranded assets with limited residual value.

       Annual LDC total cost (co-location + hardware maintenance + software licensing): ~INR 10.4 crore (~USD 1.2M)

       Upcoming hardware refresh capex: ~INR 8.5 crore

       Hardware utilisation efficiency: < 35% average across fleet

 

 

4. Our Solution — What We Offered and Why

4.1 Solution Design Philosophy

Our solution recommendation was built around four non-negotiable principles derived directly from GetAdvantage's business requirements and regulatory context:

 

Principle 1 — Compliance First

Every architectural decision prioritised regulatory compliance. For a PCI-DSS Level 1 and RBI-regulated entity, non-compliant architecture is not a technical deficiency — it is an existential business risk. Security and compliance controls were designed in from day one, not retrofitted.

 

Principle 2 — Modernise While Migrating

Rather than a pure lift-and-shift (which would preserve legacy anti-patterns in a new environment), we designed a phased approach that used the migration as an opportunity to modernise. Applications were re-architected to microservices and containerisation where the benefit justified the effort.

 

Principle 3 — Cloud-Native Resilience

We designed for failure at every layer. Multi-AZ deployments, automated failover, managed services with built-in HA, and cross-region disaster recovery ensured the platform could meet and exceed RBI BCP requirements.

 

Principle 4 — Measurable ROI

Every service selection was justified against a cost model. We leveraged AWS MAP funding (credits and professional services), Reserved Instance pricing, Savings Plans, and right-sizing to ensure the migration delivered a clear and quantifiable financial benefit within 12 months.

4.2 Migration Strategy — The 7Rs Framework Applied

We applied the AWS 7Rs migration strategy to GetAdvantage's application portfolio, categorising each of the 47 identified workloads:

Strategy

Workloads

Applications

Rationale

Rehost (Lift & Shift)

18

Internal tools, monitoring agents, legacy batch jobs

Low migration risk; benefit from cloud infra immediately

Replatform

12

MySQL databases → Aurora, Tomcat → EKS managed containers

Minimal code change; significant operational improvement

Refactor / Re-architect

9

Core payments API, lending engine, auth service

Highest business value; microservices unlock scalability

Repurchase

5

CRM → Salesforce, HR system → Workday, email → SES

SaaS replacement eliminates maintenance overhead

Retire

3

Redundant reporting tools, legacy notification service

No active users; safe decommission

Retain

0

N/A

All workloads assessed as cloud-suitable

 

4.3 Why We Recommended These Specific AWS Services

Service selection was based on three criteria: fitness-for-purpose for fintech workloads, compliance certification coverage (PCI-DSS, SOC 2, ISO 27001), and total cost of ownership. The following explains our rationale for key service choices:

 

AWS Service

Selection Rationale

Amazon Aurora PostgreSQL (Multi-AZ)

Chosen over self-managed PostgreSQL on EC2 because Aurora provides 5x the throughput of standard PostgreSQL, automated Multi-AZ failover in < 30 seconds, automated patching, and built-in encryption. For a fintech processing real-money transactions, Aurora's ACID guarantees and < 35ms read latency were essential.

Amazon EKS (Kubernetes)

Chosen over EC2 Auto Scaling groups because GetAdvantage's roadmap required microservices. EKS provides managed Kubernetes control plane, native integration with ALB and IAM, and supports GitOps-based deployment workflows. The container abstraction also improved development environment parity.

AWS Lambda + Step Functions

Chosen for batch processing (settlement runs, statement generation) because these workloads are inherently event-driven and irregular. Lambda eliminates idle compute cost; Step Functions provides durable, auditable workflow orchestration — critical for financial batch processes that must be recoverable on failure.

Amazon MSK (Managed Kafka)

Chosen for event streaming because GetAdvantage required an immutable, ordered, replayable audit trail of all transactions — a key PCI-DSS and RBI requirement. MSK provides Kafka without the operational burden of managing Zookeeper, broker scaling, and partition rebalancing.

AWS KMS + Secrets Manager

PCI-DSS DSS 3.4 requires strong cryptographic key management. KMS provides FIPS 140-2 validated HSMs with automatic annual key rotation. Secrets Manager eliminates hardcoded credentials — a significant vulnerability found during the security audit of the LDC environment.

Amazon GuardDuty + Security Hub

Replaces manual log review with ML-based continuous threat detection. GuardDuty analyses VPC Flow Logs, DNS logs, and CloudTrail events. Security Hub aggregates findings across services and maps them to compliance frameworks (PCI-DSS, CIS Benchmarks), providing a single audit-ready security posture view.

 

4.4 Why We Did NOT Recommend Alternative Approaches

During the solution design phase, several alternative approaches were evaluated and rejected. We document these decisions transparently for the auditor's review:

 

Alternative Considered

Reason for Rejection

Alternative: Pure Lift-and-Shift (all Rehost)

Rejected because it would have reproduced the scalability and availability anti-patterns of the LDC environment on EC2. Peak-period failures and deployment rigidity would have persisted. The cost savings would have been lower (no managed service efficiency gains), and compliance scope would have remained broad.

Alternative: Multi-Cloud (AWS + Azure)

Rejected because GetAdvantage's team lacked multi-cloud operational expertise, and distributing workloads across clouds would have increased operational complexity without proportionate benefit. A single-cloud strategy enables consistent security controls, unified billing, and cohesive IAM — critical for a regulated entity.

Alternative: Private Cloud (VMware / OpenStack on new hardware)

Rejected after TCO modelling showed that a new private cloud build would require INR 12–15 crore in capital expenditure with a 5-year depreciation cycle, and would not resolve the scalability or DR shortcomings. It would also not provide the managed compliance tooling (GuardDuty, Config, Security Hub) that significantly reduces audit effort.

Alternative: Colocation Upgrade (remain in LDC, upgrade hardware)

Rejected because it addresses only the hardware refresh need and leaves all other pain points (scalability ceiling, DR, compliance, deployment velocity) unresolved. The INR 8.5 crore capex would have been committed with no architectural improvement.

Alternative: Serverless-First Architecture

Evaluated but scoped only to appropriate workloads (batch processing, event-driven functions). Core banking APIs were not recommended for full serverless migration because the cold-start latency characteristics of Lambda are unsuitable for synchronous payment APIs requiring < 100ms P99 response times.

 

5. Target Architecture on AWS

5.1 Architecture Overview

The target architecture implements a defence-in-depth, multi-layered design across three AWS Availability Zones within the ap-south-1 (Mumbai) primary region, with a warm standby configuration in ap-southeast-1 (Singapore) for disaster recovery. All data classified as sensitive financial data is encrypted at rest (AES-256 via KMS) and in transit (TLS 1.2+).

 

The architecture is organised into four logical tiers: Public (internet-facing), Application (compute), Data (storage and databases), and Management (security, observability, compliance). All inter-tier communication traverses private subnets within the VPC; no application tier component has a public IP address.

 

5.2 Complete Architecture Component Reference

Layer

Component

AWS Service

Purpose

Internet

Client Apps / Web Portal

Amazon CloudFront + WAF

CDN, DDoS protection, geo-restriction

DNS

Domain Routing

Amazon Route 53

Latency-based DNS, health checks, failover

Load Balancing

Traffic Distribution

Application Load Balancer (ALB)

Layer 7 routing, SSL termination

Compute

Core Banking APIs

Amazon EKS (Kubernetes)

Containerised microservices, auto-scaling

Compute

Batch Processing

AWS Lambda + Step Functions

Serverless transaction processing, workflows

Data — Primary

OLTP Transactions

Amazon Aurora PostgreSQL (Multi-AZ)

ACID-compliant, <35ms latency, 99.99% SLA

Data — Cache

Session & Rate Data

Amazon ElastiCache (Redis)

Sub-millisecond cache, rate-limit counters

Data — Analytics

Reporting & BI

Amazon Redshift + QuickSight

DWH, dashboards, regulatory reports

Storage

Documents & Statements

Amazon S3 + Glacier

Object storage, tiered archival (7-year retention)

Messaging

Event Streaming

Amazon MSK (Kafka)

Real-time transaction events, audit trail

Security

Secrets & Keys

AWS KMS + Secrets Manager

Envelope encryption, PCI-DSS key rotation

Security

Identity & Access

AWS IAM + AWS SSO

Least-privilege, MFA, federated login

Security

Threat Detection

Amazon GuardDuty + Security Hub

ML-based anomaly detection, SIEM integration

Compliance

Audit Logging

AWS CloudTrail + Config

Immutable API audit trail, drift detection

Observability

Monitoring & Alerts

Amazon CloudWatch + X-Ray

APM, distributed tracing, auto-remediation

Network

Private Connectivity

AWS VPC + PrivateLink + Direct Connect

Isolated network, private endpoints, dedicated link

DR / BCP

Disaster Recovery

AWS Backup + Cross-Region Replication

RPO <15 min, RTO <1 hr, secondary region

 

5.3 Network Architecture

The VPC is segmented into six subnet tiers across three AZs:

       Public Subnets (3 AZs): NAT Gateways, Application Load Balancers — no application workloads

       Application Subnets (3 AZs): EKS worker nodes, Lambda functions — no direct internet access

       Data Subnets (3 AZs): Aurora, ElastiCache, MSK — accessible only from Application subnets

       Management Subnets: Bastion hosts (SSM Session Manager preferred), monitoring agents

       VPC Endpoints: S3, DynamoDB, SSM, Secrets Manager, KMS — traffic never leaves AWS network

       AWS Direct Connect: Dedicated 1 Gbps private connectivity from GetAdvantage's offices to AWS

 

5.4 Security Architecture — Zero Trust Model

Security is implemented as a layered control framework aligned to the AWS Shared Responsibility Model and PCI-DSS requirements:

Control Layer

Implementation

Perimeter Security

AWS WAF with managed rule groups (OWASP Top 10, known bad IPs, SQL injection, XSS). AWS Shield Standard. CloudFront geo-restriction. Rate limiting at ALB and WAF layers.

Identity and Access

IAM with least-privilege role-based access. AWS SSO with MFA enforcement. No long-lived IAM access keys — all compute uses IAM roles. Service accounts use IRSA (IAM Roles for Service Accounts) on EKS.

Data Protection

All data encrypted at rest with customer-managed KMS keys. TLS 1.2+ enforced for all in-transit data. RDS encryption, S3 SSE-KMS, EBS encryption enabled by default via AWS Config rules.

Threat Detection

GuardDuty enabled across all accounts and regions. CloudTrail with log file integrity validation. VPC Flow Logs. Security Hub with PCI-DSS and CIS Benchmark standards enabled.

Vulnerability Management

Amazon Inspector for continuous EC2 and container image vulnerability scanning. Automated patching via AWS Systems Manager Patch Manager. ECR image scanning on push.

Incident Response

Security Hub findings integrated with PagerDuty and Slack. Automated remediation playbooks via Lambda for common finding types (e.g., open S3 bucket — auto-block). SOC team has Security Hub SIEM integration.

 

5.5 Data Architecture and Compliance Boundaries

Cardholder Data Environment (CDE) scope has been dramatically reduced compared to the LDC. In the LDC, the flat network meant the entire data centre floor was in scope for PCI-DSS. On AWS, the CDE is limited to specific Aurora database instances and the microservices that process payment card data, all within dedicated PCI-scope subnets with additional NACLs and Security Group restrictions. All other workloads are out of PCI scope, reducing audit surface by approximately 75%.

 

6. Migration Execution — The MAP Journey

6.1 MAP Phase Overview

MAP Phase

Duration

Key Activities

Deliverable

Phase 1: Assess

Weeks 1–4

Discovery, portfolio analysis, TCO modelling, readiness scoring

Migration Readiness Assessment (MRA), TCO Report

Phase 2: Mobilise

Weeks 5–12

Landing zone setup, AWS Control Tower, IaC development, team training

Baseline AWS environment, CI/CD pipelines, runbooks

Phase 3: Migrate (Wave 1)

Weeks 13–20

Non-production workloads, dev/test environments, rehost workloads

18 workloads migrated, validated in AWS

Phase 4: Migrate (Wave 2)

Weeks 21–30

Core business applications, database migration, replatform workloads

Payments API, lending engine, Aurora migration

Phase 5: Migrate (Wave 3)

Weeks 31–36

Final production cutover, LDC decommission, refactored applications

100% workloads on AWS, LDC terminated

Phase 6: Optimise

Ongoing

Cost optimisation, rightsizing, Savings Plans, performance tuning

Monthly Well-Architected Reviews, cost reports

 

6.2 Cutover Strategy

Production database cutover used AWS Database Migration Service (DMS) with Change Data Capture (CDC) to achieve near-zero-downtime migration. The cutover window for each database was under 45 minutes, with DMS maintaining replication until the application was validated on Aurora. The payments API cutover used a blue-green deployment with weighted Route 53 routing — traffic was shifted 10% → 25% → 50% → 100% over 4 hours with automated rollback triggers based on error rate CloudWatch alarms.

 

7. Benefits Realised — Before and After Comparison

7.1 Quantitative Benefits

Metric

Before (LDC)

After (AWS)

Impact

Infrastructure Cost

$1.2M/yr capex

~$480K/yr opex

~60% cost reduction; no stranded hardware

System Uptime / Availability

97.8% (approx. 174 hrs downtime/yr)

99.99% (< 1 hr/yr)

Eliminated planned maintenance windows

Deployment Cycle

6–8 weeks

Same-day CI/CD

10x faster feature delivery

Provisioning Time

6–8 weeks (hardware procurement)

< 15 minutes (IaC)

Rapid scaling for product launches

Disaster Recovery RTO

24–48 hours (tape restore)

< 1 hour (automated)

Business continuity for critical fintech ops

Disaster Recovery RPO

24 hours (nightly backup)

< 15 minutes

Near-zero data loss exposure

Security Incidents

4 critical incidents/yr (patching lag)

0 critical incidents post-migration

Automated patching, Guard Duty ML detection

Transaction Processing Latency

380–600 ms (avg)

< 50 ms (avg)

8x faster; improved customer experience

Regulatory Compliance Coverage

Manual audits, 60+ days prep

Continuous — AWS Config + CloudTrail

Audit-ready posture; PCI-DSS & RBI compliant

Engineering Productivity

40% time on infra maintenance

< 10% infra overhead

Freed 30% capacity for product innovation

 

7.2 Qualitative and Strategic Benefits

Regulatory Confidence

GetAdvantage successfully passed its post-migration PCI-DSS Level 1 assessment with zero non-compliance findings — the first clean audit in the company's history. The continuous compliance posture provided by AWS Config rules, CloudTrail, and Security Hub means the company is now audit-ready at all times rather than scrambling for 60 days prior to each assessment.

 

Product and Engineering Velocity

The migration to EKS with full CI/CD pipelines transformed the engineering organisation. The team shipped 3 new product features in the first 90 days post-migration — more than the total shipped in the prior 12 months. Developer satisfaction scores (measured via internal survey) improved from 54 to 81 out of 100. The ability to spin up isolated feature environments on demand using Terraform eliminated the 'works on my machine' class of production incidents entirely.

 

Customer Experience

Transaction latency improvements directly improved customer-facing payment success rates. The mobile app payment completion rate increased from 91.6% to 99.3% within 60 days of the payments API cutover. App store ratings improved from 3.8 to 4.6 stars, with customers specifically citing improved speed and reliability in reviews. Customer-reported payment failures dropped by over 90%.

 

Resilience and Business Continuity

GetAdvantage has now successfully completed two DR drills since migration, achieving an RTO of 47 minutes and an RPO of 8 minutes — well within RBI BCP requirements. The multi-AZ architecture absorbed two AZ-level network disruptions in the ap-south-1 region during the observation period without any customer-facing impact, demonstrating real-world resilience that was impossible on the single-site LDC.

 

Share This On

Leave a comment