talk-data.com talk-data.com

Topic

AWS

Amazon Web Services (AWS)

cloud cloud provider infrastructure services

837

tagged

Activity Trend

190 peak/qtr
2020-Q1 2026-Q1

Activities

837 activities · Newest first

The Definitive Guide to OpenSearch

Learn how to harness the power of OpenSearch effectively with 'The Definitive Guide to OpenSearch'. This book explores installation, configuration, query building, and visualization, guiding readers through practical use cases and real-world implementations. Whether you're building search experiences or analyzing data patterns, this guide equips you thoroughly. What this Book will help me do Understand core OpenSearch principles, architecture, and the mechanics of its search and analytics capabilities. Learn how to perform data ingestion, execute advanced queries, and produce insightful visualizations on OpenSearch Dashboards. Implement scaling strategies and optimum configurations for high-performance OpenSearch clusters. Explore real-world case studies that demonstrate OpenSearch applications in diverse industries. Gain hands-on experience through practical exercises and tutorials for mastering OpenSearch functionality. Author(s) Jon Handler, Soujanya Konka, and Prashant Agrawal, celebrated experts in search technologies and big data analysis, bring their years of experience at AWS and other domains to this book. Their collective expertise ensures that readers receive both core theoretical knowledge and practical applications to implement directly. Who is it for? This book is aimed at developers, data professionals, engineers, and systems operators who work with search systems or analytics platforms. It is especially suitable for individuals in roles handling large-scale data, who want to improve their skills or deploy OpenSearch in production environments. Early learners and seasoned experts alike will find valuable insights.

AWS Certified Data Engineer Associate Study Guide

There's no better time to become a data engineer. And acing the AWS Certified Data Engineer Associate (DEA-C01) exam will help you tackle the demands of modern data engineering and secure your place in the technology-driven future. Authors Sakti Mishra, Dylan Qu, and Anusha Challa equip you with the knowledge and sought-after skills necessary to effectively manage data and excel in your career. Whether you're a data engineer, data analyst, or machine learning engineer, you'll discover in-depth guidance, practical exercises, sample questions, and expert advice you need to leverage AWS services effectively and achieve certification. By reading, you'll learn how to: Ingest, transform, and orchestrate data pipelines effectively Select the ideal data store, design efficient data models, and manage data lifecycles Analyze data rigorously and maintain high data quality standards Implement robust authentication, authorization, and data governance protocols Prepare thoroughly for the DEA-C01 exam with targeted strategies and practices

Está no ar, o Data Hackers News !! Os assuntos mais quentes da semana, com as principais notícias da área de Dados, IA e Tecnologia, que você também encontra na nossa Newsletter semanal, agora no Podcast do Data Hackers !! Aperte o play e ouça agora, o Data Hackers News dessa semana ! Para saber tudo sobre o que está acontecendo na área de dados, se inscreva na Newsletter semanal: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.datahackers.news/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conheça nossos comentaristas do Data Hackers News: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Monique Femme⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠Matérias/assuntos comentados: Evento Mettup Itaú Matéria Netflix Matéria CEO da AWS Demais canais do Data Hackers: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Site⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Linkedin⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Instagram⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Tik Tok⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠You Tube⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Combining LLMs with enterprise knowledge bases is creating powerful new agents that can transform business operations. These systems are dramatically improving on traditional chatbots by understanding context, following conversations naturally, and accessing up-to-date information. But how do you effectively manage the knowledge that powers these agents? What governance structures need to be in place before deployment? And as we look toward a future with physical AI and robotics, what fundamental computing challenges must we solve to ensure these technologies enhance rather than complicate our lives? Jun Qian is an accomplished technology leader with extensive experience in artificial intelligence and machine learning. Currently serving as Vice President of Generative AI Services at Oracle since May 2020, Jun founded and leads the Engineering and Science group, focusing on the creation and enhancement of Generative AI services and AI Agents. Previously held roles include Vice President of AI Science and Development at Oracle, Head of AI and Machine Learning at Sift, and Principal Group Engineering Manager at Microsoft, where Jun co-founded Microsoft Power Virtual Agents. Jun's career also includes significant contributions as the Founding Manager of Amazon Machine Learning at AWS and as a Principal Investigator at Verizon. In the episode, Richie and Jun explore the evolution of AI agents, the unique features of ChatGPT, the challenges and advancements in chatbot technology, the importance of data management and security in AI, and the future of AI in computing and robotics, and much more. Links Mentioned in the Show: OracleConnect with JunCourse: Introduction to AI AgentsJun at DataCamp RADARRelated Episode: A Framework for GenAI App and Agent Development with Jerry Liu, CEO at LlamaIndexRewatch RADAR AI  New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

In this season of the Analytics Engineering podcast, Tristan is deep into the world of developer tools and databases. If you're following us here, you've almost definitely used Amazon S3 it and its Blob Storage siblings. They form the foundation for nearly all data work in the cloud. In many ways, it was the innovations that happened inside of S3 that have unlocked all of the progress in cloud data over the last decade. In this episode, Tristan talks with Andy Warfield, VP and senior principal engineer at AWS, where he focuses primarily on storage. They go deep on S3, how it works, and what it unlocks. They close out italking about Iceberg, S3 table buckets, and what this all suggests about the outlines of the S3 product roadmap moving forward. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

Extreme weather events threaten industries and economic stability. NOAA’s National Centers for Environmental Information (NCEI) addresses this through the Industry Proving Grounds (IPG), which modernizes data delivery by collaborating with sectors like re/insurance and retail to develop practical, data-driven solutions. This presentation explores IPG’s technical innovations, including implementing Polars for efficient data processing, AWS for scalability, and CI/CD pipelines for streamlined deployment. These tools enhance data accessibility, reduce latency, and support real-time decision-making. By integrating scientific computing, cloud technology, and DevOps, NCEI improves climate resilience and provides a model for leveraging open-source tools to address global challenges.

In this keynote, Peeyush Rai and Vikram Koka will be walking through how Airflow is being used as part of a Agentic AI platform servicing insurance companies, which runs on all the major public clouds, leveraging models from Open AI, Google (Gemini), AWS (Claude and Bedrock). This talk walks through the details of the actual end user business workflow including gathering relevant financial data to make a decision, as well as the tricky challenge of handling AI hallucinations, with new Airflow capabilities such as “Human in the loop”. This talk offers something for both business and technical audiences. Business users will get a clear view of what it takes to bring an AI application into production and how to align their operations and business teams with an AI enabled workflow. Meanwhile, technical users will walk away with practical insights on how to orchestrate complex business processes enabling a seamless collaboration between Airflow, AI Agents and Human in the loop.

Apache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization. This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration. Whether you’re new to Airflow or an experienced user, this session will provide valuable insights into task execution and how you can combine the best of both local and remote execution paradigms.

Tekmetric is the largest cloud based auto shop management system in the United States. We process vast amounts of data from various integrations with internal and external systems. Data quality and governance are crucial for both our internal operations and the success of our customers. We leverage multi-step data processing pipelines using AWS services and Airflow. While we utilize traditional data pipeline workflows to manage and move data, we go beyond standard orchestration. After data is processed, we apply tailored quality checks for schema validation, record completeness, freshness, duplication and more. In this talk, we’ll explore how Airflow allows us to enhance data observability. We’ll discuss how Airflow’s flexibility enables seamless integration and monitoring across different teams and datasets, ensuring reliable and accurate data at every stage. This session will highlight how Tekmetric uses data quality governance and observability practices to drive business success through trusted data.

Discover how Apache Airflow powers scalable ELT pipelines, enabling seamless data ingestion, transformation, and machine learning-driven insights. This session will walk through: Automating Data Ingestion: Using Airflow to orchestrate raw data ingestion from third-party sources into your data lake (S3, GCP), ensuring a steady pipeline of high-quality training and prediction data. Optimizing Transformations with Serverless Computing: Offloading intensive transformations to serverless functions (GCP Cloud Run, AWS Lambda) and machine learning models (BigQuery ML, Sagemaker), integrating their outputs seamlessly into Airflow workflows. Real-World Impact: A case study on how INTRVL leveraged Airflow, BigQuery ML, and Cloud Run to analyze early voting data in near real-time, generating actionable insights on voter behavior across swing states. This talk not only provides a deep dive into the Political Tech space but also serves as a reference architecture for building robust, repeatable ELT pipelines. Attendees will gain insights into modern serverless technologies from AWS and GCP that enhance Airflow’s capabilities, helping data engineers design scalable, cloud-agnostic workflows.

This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.

In this talk, we’ll share our journey and lessons learned from developing a new open-source Airflow operator that integrates a newly-launched AWS service with the Airflow ecosystem. This real-world case study will illuminate the complete lifecycle of building an Airflow operator, from initial design to successful community contribution. We’ll dive deep into the practical challenges and solutions encountered throughout the journey, including: Evaluating when to build a new operator versus extending existing ones Navigating the Apache Airflow Open-source contribution process Best practices for operator design and implementation Key learnings and common pitfalls to avoid during the testing and release process Whether you’re looking to contribute to Apache Airflow or build custom operators, this session will provide valuable insights into the development process, common pitfalls to avoid, and best practices when contributing to and collaborating with the Apache Airflow community. Expect to leave with a practical roadmap for your own contributions and the confidence to successfully engage with the Apache Airflow ecosystem.

MWAA is an AWS-managed service that simplifies the deployment and maintenance of the open-source Apache Airflow data orchestration platform. MWAA has recently introduced several new features to enhance the experience for data engineering teams. Features such as Graceful Worker Replacement Strategy that enable seamless MWAA environment updates with zero downtime, IPv6 support, and in place minor Airflow Version Downgrade are some of the many new improvements MWAA has brought to their users in 2025. Last, but not the least, the release of Airflow 3.0 support brings the latest open-source features introducing a new web-server UI, better isolation and security for environments. These enhancements demonstrate Amazon’s continued investment in making Airflow more accessible and scalable for enterprises through the MWAA service.

MWAA is an AWS-managed service that simplifies the deployment and maintenance of the open-source Apache Airflow data orchestration platform. MWAA has recently introduced several new features to enhance the experience for data engineering teams. Features such as Graceful Worker Replacement Strategy that enable seamless MWAA environment updates with zero downtime, IPv6 support, and in place minor Airflow Version Downgrade are some of the many new improvements MWAA has brought to their users in 2025. Last, but not the least, the release of Airflow 3.0 support brings the latest open-source features introducing a new web-server UI, better isolation and security for environments. These enhancements demonstrate Amazon’s continued investment in making Airflow more accessible and scalable for enterprises through the MWAA service.

As organizations increasingly rely on data-driven applications, managing the diverse tools, data, and teams involved can create challenges. Amazon SageMaker Unified Studio addresses this by providing an integrated, governed platform to orchestrate end-to-end data and AI/ML workflows. In this workshop, we’ll explore how to leverage Amazon SageMaker Unified Studio to build and deploy scalable Apache Airflow workflows that span the data and AI/ML lifecycle. We’ll walk through real-world examples showcasing how this AWS service brings together familiar Airflow capabilities with SageMaker’s data processing, model training, and inference features - all within a unified, collaborative workspace. Key topics covered: Authoring and scheduling Airflow DAGs in SageMaker Unified Studio Understanding how Apache Airflow powers workflow orchestration under the hood Leveraging SageMaker capabilities like Notebooks, Data Wrangler, and Models Implementing centralized governance and workflow monitoring Enhancing productivity through unified development environments Join us to transform your ML workflow experience from complex and fragmented to streamlined and efficient.

At OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.

On March 13th, 2025, Amazon Web Services announced General Availability of Amazon SageMaker Unified Studio, bringing together AWS machine learning and analytics capabilities. At the heart of this next generation of Amazon SageMaker sits Apache Airflow. All SageMaker Unified Studio users have a personal, open-source Airflow deployment, running alongside their Jupyter notebook, enabling those users to easily develop Airflow DAGs that have unified access to all of their data. In this talk, I will go into details around the motivations for choosing Airflow for this capability, the challenges with incorporating Airflow into such a large and diverse experience, the key role that open-source plays, how we’re leveraging GenAI to make that open source development experience better, and the goals for the future of Airflow in SageMaker Unified Studio. Attendees will leave with a better understanding of the considerations they need to make when choosing Airflow as a component of their enterprise project, and a greater appreciation of how Airflow can power advanced capabilities.

Many SRE teams still rely on manual intervention for incident handling; automation can improve response times and reduce toil. We will cover: Setting up comprehensive observability: Cloud Logging, Cloud Monitoring, and OpenTelemetry; Incident automation strategies: Runbooks, Auto-Healing, and ChatOps; Lessons from AWS CloudWatch and Azure Monitor applied to GCP; Case study: Reducing MTTR (Mean Time to Resolution) through automated detection and remediation

Welcome Lakehouse, from a DWH transformation to a M&A data sharing

At DXC, we helped our customer FastWeb with their "Welcome Lakehouse" project - a data warehouse transformation from on-premises to Databricks on AWS. But the implementation became something more. Thanks to features such as Lakehouse Federation and Delta Sharing, from the first day of the Fastweb+Vodafone merger, we have been able to connect two different platforms with ease and make the business focus on the value of data and not on the IT integration. This session will feature our customer Alessandro Gattolin of Fastweb to talk about the experience.