Executive Summary
This ETL solution is based on a customer’s EdTech Learning Management System. This solution is built to deliver AI-Enabled content which is based on an Indian education board prescribed study content. It's tutor-free, self-paced, and gamified learning platform that adapts to a student's learning style, mood, and attention span. This solution utilizes student engagement data like daily logins, activities Log, Content usage based on Schedule and Timelines, Quiz scores, Objective Test scores, Subjective Test scores and overall activity logs.
The major objective is to deliver an adaptive learning methodology to enable students to learn based on individual grasping abilities and AI chatbot being a constant companion integrated with platform.
The Challenge
Earlier EdTech system provided one-size-fits-all lesson plans. So, every student referred to the same videos, worksheets, or quizzes regardless of their actual performance in their assessment. The system struggled to address the learning pace gap between fast and slow learners. Teachers had to manually group students or create multiple difficulty versions of the same content. It also relied on static videos and slides with minimal interactivity. There was no real or near real time feedback mechanism to understand student’s attention and engagement. Team involved in content creation manually built lessons, quizzes and summaries for every subject.
Solution
The solution is designed to overcome the above-mentioned challenges and to have a centralized Performance Analytics system in this EdTech Learning platform, utilizing student engagement data such as daily logins, quiz scores, content usage, and overall activity logs. AWS Glue service collectively performs Data Ingestion, Data transformation, Data cataloging, Data Management and Data integration.
AWS Glue service collects raw data from EdTech platform (stored on Amazon S3). The architecture relies on a unified Glue Catalog to maintain a single source of truth for schemas, thereby avoiding data silos. To optimize query performance and cost efficiency, the Catalog tables are partitioned by school_id, academic_year, class_id, allowing Athena queries to minimize data scanning.
Schema discovery for dynamic EdTech datasets is achieved using Glue Crawlers. A multi-crawler strategy provides performance isolation and cost control, leveraging Amazon EventBridge for event-based triggers and run-on-demand execution. Crawlers use explicit S3 paths and enable partition discovery to automatically register new partitions created by ETL jobs.
Transformation logic is implemented via several Glue Job types to transform the data in parquet format and write it to S3 processed data store. AWS Glue Studio is utilized for designing and deploying visual ETL jobs that transform raw data into a unified schema. For complex joins and calculating advanced performance metrics, custom ‘PySpark’ scripts are developed using AWS Glue Notebooks, applying advanced transformations using Spark ‘DataFrame’ APIs. Functionalities such as file type conversions and writing data to compressed GZIP formats are handled by PySpark Glue Jobs.
To ensure efficiency, AWS Glue Job Bookmarks track previously processed student engagement files, enabling incremental data loading and accelerating daily job runs. Job performance is optimized by continuously monitoring metrics like bytes read/written and adjusting DPU capacity (e.g., G.1X/G.2X) when necessary. Glue Jobs writes the processed data to S3 and then uses RedShift COPY command to load data from S3. AWS Bedrock and other Machine Learning services are utilized for adaptive tutoring and automated content creation. It generates natural-language summaries, quiz questions, and interactive tutoring conversations.
Architecture Diagram:
The Results & Benefits
Summary
The AWS Glue solution established a centralized Performance Analytics system for an EdTech platform using student engagement data. A unified Glue Catalog maintains a single schema source, utilizing partitions to minimize data scanning and optimize Athena query performance. Glue Job Bookmark enables incremental data loads, accelerating daily runs. Jobs are optimized by monitoring metrics and adjusting DPU capacity. Finally, Glue Jobs loads the processed, compressed data (GZIP) from S3 to Redshift and Amazon QuickSight for Dashboarding.