talk-data.com talk-data.com

S

Speaker

Satej Kumar Sahu

1

talks

Principal Data Engineer Zalando SE

Satej works as Principal Data Engineer with over 14 years of experience in the industry. He has worked with renowned organizations such as Boeing, Adidas, Honeywell specializing in architecture, big data and machine learning use cases. With a strong track record of architecting scalable and efficient systems, Satej has successfully delivered data-driven and ML applied solutions. He's also an author of two programming books named "Building Secure PHP Applications" and "PHP 8 Basics: For Programming and Web Development" with Apress / Springer publications with another one in pipeline on Generative AI Applications in Java.

Bio from: Data + AI Summit 2025

Filtering by: Data + AI Summit 2025 ×

Filter by Event / Source

Talks & appearances

Showing 1 of 1 activities

Search activities →
Leveraging GenAI for Synthetic Data Generation to Improve Spark Testing and Performance in Big Data

Testing Spark jobs in local environments is often difficult due to the lack of suitable datasets, especially under tight timelines. This creates challenges when jobs work in development clusters but fail in production, or when they run locally but encounter issues in staging clusters due to inadequate documentation or checks. In this session, we’ll discuss how these challenges can be overcome by leveraging Generative AI to create custom synthetic datasets for local testing. By incorporating variations and sampling, a testing framework can be introduced to solve some of these challenges, allowing for the generation of realistic data to aid in performance and load testing. We’ll show how this approach helps identify performance bottlenecks early, optimize job performance and recognize scalability issues while keeping costs low. This methodology fosters better deployment practices and enhances the reliability of Spark jobs across environments.