GetAdvantage is a growing fintech company offering digital payments, lending, and wealth management services to over 800,000 retail and corporate customers across India. Prior to this engagement, the company operated its entire technology stack on Legacy Data Centre (LDC) infrastructure, an aging on-premises environment characterised by hardware constraints, operational fragility, limited scalability, and increasing security and compliance risk.
2. About GetAdvantage
Industry Vertical | Financial Technology (Fintech) |
Products & Services | Digital Payments, BNPL, SME Lending, Wealth Management |
Customer Base | 800,000+ retail and corporate users |
Employee Headcount | 400 (of which 90 in engineering) |
Prior Infrastructure | On-premises LDC (Legacy Data Centre) |
Migration Completion | 2026 |
GetAdvantage was facing a pivotal inflection point. Its legacy infrastructure, originally designed for a fraction of its current transaction volumes, was struggling to keep pace with user growth, product expansion, and increasingly stringent regulatory requirements. Leadership had identified that technology constraints were directly impeding the company's ability to launch new products, respond to market opportunities, and compete against cloud-native fintech challengers.
The board approved a cloud-first transformation strategy with four strategic objectives:
• Eliminate infrastructure bottlenecks that were constraining product velocity
• Reduce total cost of ownership (TCO) while converting capital expenditure to operational expenditure
• Achieve and maintain continuous regulatory compliance (PCI-DSS Level 1, RBI Cloud Guidelines)
• Build a scalable, resilient platform capable of 10x transaction volume growth without architectural rework
3. Understanding the Pain Points
The discovery exercise surfaced six primary pain point categories, each with measurable business impact:
Pain Point 1 — Scalability Constraints and Capacity Ceiling
GetAdvantage's LDC environment was provisioned for a peak capacity of approximately 400 concurrent transactions per second (TPS). During festival sale periods, payment campaigns, and month-end billing cycles, the platform regularly exceeded this threshold — triggering application timeouts, failed transactions, and customer-facing errors. Provisioning additional capacity required hardware procurement cycles of 6–8 weeks, by which point the demand spike had already passed.
• Average transaction failure rate during peak periods: 8.4%
• Estimated revenue leakage from failed peak-period transactions: INR 22 crore per annum
• Infrastructure utilisation at off-peak: < 20% (severe overprovisioning for base capacity)
Pain Point 2 — Availability and Reliability Deficits
The LDC environment achieved approximately 97.8% uptime — equivalent to approximately 174 hours of unplanned downtime per year. A significant portion of this was attributable to planned maintenance windows (typically 4-hour Sunday outages for patching) which required full-service blackouts. The company had no true high-availability architecture; a failure of a primary database server would result in manual failover taking 45–90 minutes.
• Planned downtime per year: ~96 hours (bi-weekly 4-hour maintenance windows)
• Unplanned outage events in the prior 12 months: 11 incidents (including 4 P1 severity)
• Average MTTR for P1 incidents: 3.2 hours
Pain Point 3 — Security Vulnerabilities and Compliance Exposure
The legacy environment accumulated significant security technical debt. Patch management was manual and inconsistent — the team of 3 operations engineers was responsible for patching over 140 servers, resulting in average patch lag of 47 days from release to application. Three servers were found to be running operating systems beyond end-of-life with no vendor security support. The company faced an upcoming PCI-DSS Level 1 audit with a realistic risk of non-compliance findings that could result in payment network sanctions.
• Average patch lag: 47 days
• End-of-life OS instances: 3 production servers
• PCI-DSS scope: entire LDC floor (flat network, broad cardholder data environment)
• Annual cost of manual compliance activities: ~INR 1.8 crore (external auditors + internal effort)
Pain Point 4 — Disaster Recovery and Business Continuity Gaps
GetAdvantage's disaster recovery posture was critically inadequate for a regulated fintech. The DR strategy consisted of nightly tape backups stored at a secondary co-location facility 12 km away. In a tested DR drill, the team achieved a Recovery Time Objective (RTO) of 31 hours and a Recovery Point Objective (RPO) of 24 hours — far outside the 4-hour RTO and 1-hour RPO mandated by RBI business continuity guidelines for payment system operators.
• Tested RTO: 31 hours | RBI-mandated RTO: 4 hours
• Tested RPO: 24 hours | RBI-mandated RPO: 1 hour
• Last DR drill: 14 months prior to engagement (below annual requirement)
Pain Point 5 — Engineering Velocity and Deployment Friction
The deployment process required a formal change advisory board (CAB) approval, a 2-week change freeze, manual server configuration, and a 6–8 week hardware lead time for any environment changes. Developers spent significant time maintaining and troubleshooting infrastructure rather than building product features. There was no CI/CD pipeline; deployments were executed manually via SSH. The engineering team estimated that approximately 40% of their time was consumed by infrastructure maintenance activities.
• Deployment cycle: 6–8 weeks (hardware) to 5 days (software-only changes)
• Engineering time spent on infrastructure maintenance: ~40%
• Number of production environments: 1 (no staging parity, limited testing fidelity)
Pain Point 6 — Rising Infrastructure Cost and Capital Lock-in
The LDC environment required a hardware refresh cycle every 3–5 years, with the next refresh estimated at INR 8.5 crore in capital expenditure. Annual data centre co-location fees, hardware maintenance contracts, and software licensing totalled approximately INR 10.4 crore per annum. This capital-intensive model prevented investment in product and engineering capability, and created stranded assets with limited residual value.
• Annual LDC total cost (co-location + hardware maintenance + software licensing): ~INR 10.4 crore (~USD 1.2M)
• Upcoming hardware refresh capex: ~INR 8.5 crore
• Hardware utilisation efficiency: < 35% average across fleet
4. Our Solution — What We Offered and Why
Our solution recommendation was built around four non-negotiable principles derived directly from GetAdvantage's business requirements and regulatory context:
Principle 1 — Compliance First
Every architectural decision prioritised regulatory compliance. For a PCI-DSS Level 1 and RBI-regulated entity, non-compliant architecture is not a technical deficiency — it is an existential business risk. Security and compliance controls were designed in from day one, not retrofitted.
Principle 2 — Modernise While Migrating
Rather than a pure lift-and-shift (which would preserve legacy anti-patterns in a new environment), we designed a phased approach that used the migration as an opportunity to modernise. Applications were re-architected to microservices and containerisation where the benefit justified the effort.
Principle 3 — Cloud-Native Resilience
We designed for failure at every layer. Multi-AZ deployments, automated failover, managed services with built-in HA, and cross-region disaster recovery ensured the platform could meet and exceed RBI BCP requirements.
Principle 4 — Measurable ROI
Every service selection was justified against a cost model. We leveraged AWS MAP funding (credits and professional services), Reserved Instance pricing, Savings Plans, and right-sizing to ensure the migration delivered a clear and quantifiable financial benefit within 12 months.
We applied the AWS 7Rs migration strategy to GetAdvantage's application portfolio, categorising each of the 47 identified workloads:
Strategy | Workloads | Applications | Rationale |
|---|---|---|---|
Rehost (Lift & Shift) | 18 | Internal tools, monitoring agents, legacy batch jobs | Low migration risk; benefit from cloud infra immediately |
Replatform | 12 | MySQL databases → Aurora, Tomcat → EKS managed containers | Minimal code change; significant operational improvement |
Refactor / Re-architect | 9 | Core payments API, lending engine, auth service | Highest business value; microservices unlock scalability |
Repurchase | 5 | CRM → Salesforce, HR system → Workday, email → SES | SaaS replacement eliminates maintenance overhead |
Retire | 3 | Redundant reporting tools, legacy notification service | No active users; safe decommission |
Retain | 0 | N/A | All workloads assessed as cloud-suitable |
Service selection was based on three criteria: fitness-for-purpose for fintech workloads, compliance certification coverage (PCI-DSS, SOC 2, ISO 27001), and total cost of ownership. The following explains our rationale for key service choices:
AWS Service | Selection Rationale |
|---|---|
Amazon Aurora PostgreSQL (Multi-AZ) | Chosen over self-managed PostgreSQL on EC2 because Aurora provides 5x the throughput of standard PostgreSQL, automated Multi-AZ failover in < 30 seconds, automated patching, and built-in encryption. For a fintech processing real-money transactions, Aurora's ACID guarantees and < 35ms read latency were essential. |
Amazon EKS (Kubernetes) | Chosen over EC2 Auto Scaling groups because GetAdvantage's roadmap required microservices. EKS provides managed Kubernetes control plane, native integration with ALB and IAM, and supports GitOps-based deployment workflows. The container abstraction also improved development environment parity. |
AWS Lambda + Step Functions | Chosen for batch processing (settlement runs, statement generation) because these workloads are inherently event-driven and irregular. Lambda eliminates idle compute cost; Step Functions provides durable, auditable workflow orchestration — critical for financial batch processes that must be recoverable on failure. |
Amazon MSK (Managed Kafka) | Chosen for event streaming because GetAdvantage required an immutable, ordered, replayable audit trail of all transactions — a key PCI-DSS and RBI requirement. MSK provides Kafka without the operational burden of managing Zookeeper, broker scaling, and partition rebalancing. |
AWS KMS + Secrets Manager | PCI-DSS DSS 3.4 requires strong cryptographic key management. KMS provides FIPS 140-2 validated HSMs with automatic annual key rotation. Secrets Manager eliminates hardcoded credentials — a significant vulnerability found during the security audit of the LDC environment. |
Amazon GuardDuty + Security Hub | Replaces manual log review with ML-based continuous threat detection. GuardDuty analyses VPC Flow Logs, DNS logs, and CloudTrail events. Security Hub aggregates findings across services and maps them to compliance frameworks (PCI-DSS, CIS Benchmarks), providing a single audit-ready security posture view. |
During the solution design phase, several alternative approaches were evaluated and rejected. We document these decisions transparently for the auditor's review:
Alternative Considered | Reason for Rejection |
|---|---|
Alternative: Pure Lift-and-Shift (all Rehost) | Rejected because it would have reproduced the scalability and availability anti-patterns of the LDC environment on EC2. Peak-period failures and deployment rigidity would have persisted. The cost savings would have been lower (no managed service efficiency gains), and compliance scope would have remained broad. |
Alternative: Multi-Cloud (AWS + Azure) | Rejected because GetAdvantage's team lacked multi-cloud operational expertise, and distributing workloads across clouds would have increased operational complexity without proportionate benefit. A single-cloud strategy enables consistent security controls, unified billing, and cohesive IAM — critical for a regulated entity. |
Alternative: Private Cloud (VMware / OpenStack on new hardware) | Rejected after TCO modelling showed that a new private cloud build would require INR 12–15 crore in capital expenditure with a 5-year depreciation cycle, and would not resolve the scalability or DR shortcomings. It would also not provide the managed compliance tooling (GuardDuty, Config, Security Hub) that significantly reduces audit effort. |
Alternative: Colocation Upgrade (remain in LDC, upgrade hardware) | Rejected because it addresses only the hardware refresh need and leaves all other pain points (scalability ceiling, DR, compliance, deployment velocity) unresolved. The INR 8.5 crore capex would have been committed with no architectural improvement. |
Alternative: Serverless-First Architecture | Evaluated but scoped only to appropriate workloads (batch processing, event-driven functions). Core banking APIs were not recommended for full serverless migration because the cold-start latency characteristics of Lambda are unsuitable for synchronous payment APIs requiring < 100ms P99 response times. |
5. Target Architecture on AWS
The target architecture implements a defence-in-depth, multi-layered design across three AWS Availability Zones within the ap-south-1 (Mumbai) primary region, with a warm standby configuration in ap-southeast-1 (Singapore) for disaster recovery. All data classified as sensitive financial data is encrypted at rest (AES-256 via KMS) and in transit (TLS 1.2+).
The architecture is organised into four logical tiers: Public (internet-facing), Application (compute), Data (storage and databases), and Management (security, observability, compliance). All inter-tier communication traverses private subnets within the VPC; no application tier component has a public IP address.
Layer | Component | AWS Service | Purpose |
|---|---|---|---|
Internet | Client Apps / Web Portal | Amazon CloudFront + WAF | CDN, DDoS protection, geo-restriction |
DNS | Domain Routing | Amazon Route 53 | Latency-based DNS, health checks, failover |
Load Balancing | Traffic Distribution | Application Load Balancer (ALB) | Layer 7 routing, SSL termination |
Compute | Core Banking APIs | Amazon EKS (Kubernetes) | Containerised microservices, auto-scaling |
Compute | Batch Processing | AWS Lambda + Step Functions | Serverless transaction processing, workflows |
Data — Primary | OLTP Transactions | Amazon Aurora PostgreSQL (Multi-AZ) | ACID-compliant, <35ms latency, 99.99% SLA |
Data — Cache | Session & Rate Data | Amazon ElastiCache (Redis) | Sub-millisecond cache, rate-limit counters |
Data — Analytics | Reporting & BI | Amazon Redshift + QuickSight | DWH, dashboards, regulatory reports |
Storage | Documents & Statements | Amazon S3 + Glacier | Object storage, tiered archival (7-year retention) |
Messaging | Event Streaming | Amazon MSK (Kafka) | Real-time transaction events, audit trail |
Security | Secrets & Keys | AWS KMS + Secrets Manager | Envelope encryption, PCI-DSS key rotation |
Security | Identity & Access | AWS IAM + AWS SSO | Least-privilege, MFA, federated login |
Security | Threat Detection | Amazon GuardDuty + Security Hub | ML-based anomaly detection, SIEM integration |
Compliance | Audit Logging | AWS CloudTrail + Config | Immutable API audit trail, drift detection |
Observability | Monitoring & Alerts | Amazon CloudWatch + X-Ray | APM, distributed tracing, auto-remediation |
Network | Private Connectivity | AWS VPC + PrivateLink + Direct Connect | Isolated network, private endpoints, dedicated link |
DR / BCP | Disaster Recovery | AWS Backup + Cross-Region Replication | RPO <15 min, RTO <1 hr, secondary region |
The VPC is segmented into six subnet tiers across three AZs:
• Public Subnets (3 AZs): NAT Gateways, Application Load Balancers — no application workloads
• Application Subnets (3 AZs): EKS worker nodes, Lambda functions — no direct internet access
• Data Subnets (3 AZs): Aurora, ElastiCache, MSK — accessible only from Application subnets
• Management Subnets: Bastion hosts (SSM Session Manager preferred), monitoring agents
• VPC Endpoints: S3, DynamoDB, SSM, Secrets Manager, KMS — traffic never leaves AWS network
• AWS Direct Connect: Dedicated 1 Gbps private connectivity from GetAdvantage's offices to AWS
Security is implemented as a layered control framework aligned to the AWS Shared Responsibility Model and PCI-DSS requirements:
Control Layer | Implementation |
|---|---|
Perimeter Security | AWS WAF with managed rule groups (OWASP Top 10, known bad IPs, SQL injection, XSS). AWS Shield Standard. CloudFront geo-restriction. Rate limiting at ALB and WAF layers. |
Identity and Access | IAM with least-privilege role-based access. AWS SSO with MFA enforcement. No long-lived IAM access keys — all compute uses IAM roles. Service accounts use IRSA (IAM Roles for Service Accounts) on EKS. |
Data Protection | All data encrypted at rest with customer-managed KMS keys. TLS 1.2+ enforced for all in-transit data. RDS encryption, S3 SSE-KMS, EBS encryption enabled by default via AWS Config rules. |
Threat Detection | GuardDuty enabled across all accounts and regions. CloudTrail with log file integrity validation. VPC Flow Logs. Security Hub with PCI-DSS and CIS Benchmark standards enabled. |
Vulnerability Management | Amazon Inspector for continuous EC2 and container image vulnerability scanning. Automated patching via AWS Systems Manager Patch Manager. ECR image scanning on push. |
Incident Response | Security Hub findings integrated with PagerDuty and Slack. Automated remediation playbooks via Lambda for common finding types (e.g., open S3 bucket — auto-block). SOC team has Security Hub SIEM integration. |
Cardholder Data Environment (CDE) scope has been dramatically reduced compared to the LDC. In the LDC, the flat network meant the entire data centre floor was in scope for PCI-DSS. On AWS, the CDE is limited to specific Aurora database instances and the microservices that process payment card data, all within dedicated PCI-scope subnets with additional NACLs and Security Group restrictions. All other workloads are out of PCI scope, reducing audit surface by approximately 75%.
6. Migration Execution — The MAP Journey
MAP Phase | Duration | Key Activities | Deliverable |
|---|---|---|---|
Phase 1: Assess | Weeks 1–4 | Discovery, portfolio analysis, TCO modelling, readiness scoring | Migration Readiness Assessment (MRA), TCO Report |
Phase 2: Mobilise | Weeks 5–12 | Landing zone setup, AWS Control Tower, IaC development, team training | Baseline AWS environment, CI/CD pipelines, runbooks |
Phase 3: Migrate (Wave 1) | Weeks 13–20 | Non-production workloads, dev/test environments, rehost workloads | 18 workloads migrated, validated in AWS |
Phase 4: Migrate (Wave 2) | Weeks 21–30 | Core business applications, database migration, replatform workloads | Payments API, lending engine, Aurora migration |
Phase 5: Migrate (Wave 3) | Weeks 31–36 | Final production cutover, LDC decommission, refactored applications | 100% workloads on AWS, LDC terminated |
Phase 6: Optimise | Ongoing | Cost optimisation, rightsizing, Savings Plans, performance tuning | Monthly Well-Architected Reviews, cost reports |
Production database cutover used AWS Database Migration Service (DMS) with Change Data Capture (CDC) to achieve near-zero-downtime migration. The cutover window for each database was under 45 minutes, with DMS maintaining replication until the application was validated on Aurora. The payments API cutover used a blue-green deployment with weighted Route 53 routing — traffic was shifted 10% → 25% → 50% → 100% over 4 hours with automated rollback triggers based on error rate CloudWatch alarms.
7. Benefits Realised — Before and After Comparison
Metric | Before (LDC) | After (AWS) | Impact |
|---|---|---|---|
Infrastructure Cost | $1.2M/yr capex | ~$480K/yr opex | ~60% cost reduction; no stranded hardware |
System Uptime / Availability | 97.8% (approx. 174 hrs downtime/yr) | 99.99% (< 1 hr/yr) | Eliminated planned maintenance windows |
Deployment Cycle | 6–8 weeks | Same-day CI/CD | 10x faster feature delivery |
Provisioning Time | 6–8 weeks (hardware procurement) | < 15 minutes (IaC) | Rapid scaling for product launches |
Disaster Recovery RTO | 24–48 hours (tape restore) | < 1 hour (automated) | Business continuity for critical fintech ops |
Disaster Recovery RPO | 24 hours (nightly backup) | < 15 minutes | Near-zero data loss exposure |
Security Incidents | 4 critical incidents/yr (patching lag) | 0 critical incidents post-migration | Automated patching, Guard Duty ML detection |
Transaction Processing Latency | 380–600 ms (avg) | < 50 ms (avg) | 8x faster; improved customer experience |
Regulatory Compliance Coverage | Manual audits, 60+ days prep | Continuous — AWS Config + CloudTrail | Audit-ready posture; PCI-DSS & RBI compliant |
Engineering Productivity | 40% time on infra maintenance | < 10% infra overhead | Freed 30% capacity for product innovation |
Regulatory Confidence
GetAdvantage successfully passed its post-migration PCI-DSS Level 1 assessment with zero non-compliance findings — the first clean audit in the company's history. The continuous compliance posture provided by AWS Config rules, CloudTrail, and Security Hub means the company is now audit-ready at all times rather than scrambling for 60 days prior to each assessment.
Product and Engineering Velocity
The migration to EKS with full CI/CD pipelines transformed the engineering organisation. The team shipped 3 new product features in the first 90 days post-migration — more than the total shipped in the prior 12 months. Developer satisfaction scores (measured via internal survey) improved from 54 to 81 out of 100. The ability to spin up isolated feature environments on demand using Terraform eliminated the 'works on my machine' class of production incidents entirely.
Customer Experience
Transaction latency improvements directly improved customer-facing payment success rates. The mobile app payment completion rate increased from 91.6% to 99.3% within 60 days of the payments API cutover. App store ratings improved from 3.8 to 4.6 stars, with customers specifically citing improved speed and reliability in reviews. Customer-reported payment failures dropped by over 90%.
Resilience and Business Continuity
GetAdvantage has now successfully completed two DR drills since migration, achieving an RTO of 47 minutes and an RPO of 8 minutes — well within RBI BCP requirements. The multi-AZ architecture absorbed two AZ-level network disruptions in the ap-south-1 region during the observation period without any customer-facing impact, demonstrating real-world resilience that was impossible on the single-site LDC.