AWS re:Invent 2024 - Innovations in AWS analytics: Data processing (ANT346)

2024-12-05 · AWS re:Invent 2024 Watch

video

by William Vambenepe (AWS) , Kinshuk Pahare (AWS) , Craig Suchanec (Bridgewater Associates)

Agile/Scrum Analytics Athena AWS Amazon EMR Big Data Cloud Computing

Join this session for an in-depth look at new capabilities to optimize data processing with AWS analytics services. Learn more about using Amazon EMR for scalable big data processing, AWS Glue for seamless data integration, Amazon Athena for powerful querying, and Amazon MWAA for supporting complex workflows. Whether you’re looking to improve performance, reduce costs, or streamline your data pipeline, this session provides valuable insights into the latest functionality and tools needed to enhance your data processing capabilities.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Scale with self-service analytics on AWS (ANT334)

2024-12-05 · AWS re:Invent 2024 Watch

video

by Srikanth Sopirala (AWS) , Raghavarao Sodabathina (AWS) , Utkarsh Mittal (AWS)

Agile/Scrum Airflow Analytics Athena AWS Cloud Computing Redshift SQL

Self-service analytics empowers users to independently access, explore, and analyze data to accelerate decision-making. It’s a foundational step for organizations to democratize the use of data for business growth with easy-to-use, simple-to-understand, and quick-to-deliver tools. Interactive querying, SQL analytics, data preparation, data transformation, data workflow orchestration and search analytics are some of the key day-to-day functions that data users need from self-service solutions. Learn how AWS analytics services like Amazon Athena, AWS Glue, Amazon Redshift, Amazon Managed Workflows for Apache Airflow, and Amazon OpenSearch Service enable data-driven decision-making through self-serve analytics.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - A practitioner’s guide to data for generative AI (DAT319)

2024-12-03 · AWS re:Invent 2024 Watch

video

by Jonathan Katz (Amazon Redshift) , Siva Raghupathy (AWS)

Agile/Scrum AI/ML AWS Aurora Kinesis Cloud Computing Data Lake Data Quality GenAI RAG Data Streaming

In this session, gain the skills needed to deploy end-to-end generative AI applications using your most valuable data. While this session focuses on the Retrieval Augmented Generation (RAG) process, the concepts also apply to other methods of customizing generative AI applications. Discover best practice architectures using AWS database services like Amazon Aurora, Amazon OpenSearch Service, or Amazon MemoryDB along with data processing services like AWS Glue and streaming data services like Amazon Kinesis. Learn data lake, governance, and data quality concepts and how Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and other features tie solution components together.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

Data Engineering with AWS Cookbook

2024-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Viquar Khan , Trâm Ngọc Phạm , Gonzalo Herreros González , Huda Nofal

Analytics Athena AWS Amazon EMR Big Data Cloud Computing Data Engineering Data Lake ETL/ELT QuickSight Redshift data +1 more

Data Engineering with AWS Cookbook serves as a comprehensive practical guide for building scalable and efficient data engineering solutions using AWS. With this book, you will master implementing data lakes, orchestrating data pipelines, and creating serving layers using AWS's robust services, such as Glue, EMR, Redshift, and Athena. With hands-on exercises and practical recipes, you will enhance your AWS-based data engineering projects. What this Book will help me do Gain the skills to design centralized data lake solutions and manage them securely at scale. Develop expertise in crafting data pipelines with AWS's ETL technologies like Glue and EMR. Learn to implement and automate governance, orchestration, and monitoring for data platforms. Build high-performance data serving layers using AWS analytics tools like Redshift and QuickSight. Effectively plan and execute data migrations to AWS from on-premises infrastructure. Author(s) Trâm Ngọc Phạm, Gonzalo Herreros González, Viquar Khan, and Huda Nofal bring together years of collective experience in data engineering and AWS cloud solutions. Each author's deep knowledge and passion for cloud technology have shaped this book into a valuable resource, geared towards practical learning and real-world application. Their approach ensures readers are not just learning but building tangible, impactful solutions. Who is it for? This book is geared towards data engineers and big data professionals engaged in or transitioning to cloud-based environments, specifically on AWS. Ideal readers are those looking to optimize workflows and master AWS tools to create scalable, efficient solutions. The content assumes a basic familiarity with AWS concepts like IAM roles and a command-line interface, ensuring all examples are accessible yet meaningful for those seeking advancement in AWS data engineering.

Michael Toland - How to Make Data Products Work in the Enterprise

2024-08-19 · Straight Data Talk Listen

podcast_episode

by Michael Toland (Pathfinder Product) , Yuliia Tkachova (Masthead Data)

Michael Toland is a Product Management Consultant and blog contributor with ⁠Test Double⁠, residing in Columbus, OH. His experience spans 8 formal years of internal Product Management, with a few additional years of doing Product Management without even knowing what the field really was. In this episode, Michael shared how a data empowered company the size of Verizon was able to drastically reduce time-to-market metrics, experiment, and run data product MVPs in production. The reference data became a cornerstone of Verizon's go-to-market strategy and a glue for different teams and departments. One of the key takeaways is that to deliver value with data products and architect them effectively, one does not need to be a data wizard but rather have a passion for solving problems. Michael is also the author of an infrequently updated product satire site, ⁠Dignified Product.⁠

#58 Maximizing Productivity: Bookmarklets, Q Command-Line, RouteLLM, and DuckDB Extensions

2024-07-12 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

AI/ML CSV DuckDB LLM SQL

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style! Bookmarklet Maker: Discover how to automate tasks with the Bookmarklet Maker, a tool for turning scripts into handy browser bookmarks. RouteLLM Framework: Explore the RouteLLM framework by LMSys and Anyscale, designed to optimize the cost-performance ratio of LLM routers. Learn more about this collaboration at LMSys and Anyscale. Q for SQL on CSV/TSV: Meet Q, a command-line tool that lets you run SQL queries directly on CSV or TSV files, simplifying data exploration from your terminal. DuckDB Community Extensions: Check out the latest updates in DuckDB's community extensions and see how this database system is evolving. Apple Intelligence and AI Maximalism: Explore Apple's AI strategy, their avoidance of chat UIs, risk management with OpenAI, and the shift of compute costs to users. Being Glue: Delve into the challenges of being "Glue" at work. Explore why women are more likely to take on non-promotable work and how this affects career progression and workplace dynamics.

Data Engineering with AWS - Second Edition

2023-10-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Gareth Eagar

Analytics AWS Data Engineering Data Governance QuickSight Redshift S3 Cyber Security data data-engineering

Learn data engineering and modern data pipeline design with AWS in this comprehensive guide! You will explore key AWS services like S3, Glue, Redshift, and QuickSight to ingest, transform, and analyze data, and you'll gain hands-on experience creating robust, scalable solutions. What this Book will help me do Understand and implement data ingestion and transformation processes using AWS tools. Optimize data for analytics with advanced AWS-powered workflows. Build end-to-end modern data pipelines leveraging cutting-edge AWS technologies. Design data governance strategies using AWS services for security and compliance. Visualize data and extract insights using Amazon QuickSight and other tools. Author(s) Gareth Eagar is a Senior Data Architect with over 25 years of experience in designing and implementing data solutions across various industries. He combines his deep technical expertise with a passion for teaching, aiming to make complex concepts approachable for learners at all levels. Who is it for? This book is intended for current or aspiring data engineers, data architects, and analysts seeking to leverage AWS for data engineering. It suits beginners with a basic understanding of data concepts who want to gain practical experience as well as intermediate professionals aiming to expand into AWS-based systems.

Community driven: The dbt Labs and AWS partnership - Coalesce 2023

2023-10-24 · dbt Coalesce 2023 Watch

video

by David Nalley (Amazon Web Services)

Athena AWS dbt Marketing Redshift Cyber Security

When companies work together through open source development, good things happen. Open source contributions lead to strong relationships between engineers across company lines, and positive outcomes for customers whether through improved functionality, performance, or supply chain security. In this keynote, learn about the power of open source in driving innovation, how AWS approaches open source collaboration, and some of the key improvements for Amazon Redshift, AWS Glue, and Amazon Athena customers and dbt users resulting from our partnership.

Speaker: David Nalley, Director, Open Source Strategy and Marketing, Amazon Web Services

Register for Coalesce at https://coalesce.getdbt.com/

An API for Deep Learning Inferencing on Apache Spark™

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Lee Yang

API Big Data Databricks ETL/ELT LLM MLOps PySpark Spark

Apache Spark is a popular distributed framework for big data processing. It is commonly used for ETL (extract, transform and load) across large datasets. Today, the transform stage can often include the application of deep learning models on the data. For example, common models can be used for classification of images, sentiment analysis of text, language translation, anomaly detection, and many other use cases. Applying these models within Spark can be done today with the combination of PySpark, Pandas_UDF, and a lot of glue code. Often, that glue code can be difficult to get right, because it requires expertise across multiple domains - deep learning frameworks, PySpark APIs, pandas_UDF internal behavior, and performance optimization.

In this session, we introduce a new, simplified API for deep learning inferencing on Spark, introduced in SPARK-40264 as a collaboration between NVIDIA and Databricks, which seeks to standardize and open source this glue code to make deep learning inference integrations easier for everyone. We discuss its design and demonstrate its usage across multiple deep learning frameworks and models.

Talk by: Lee Yang

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Noritaka Sekiyama (Amazon Web Services (AWS)) , Akira Ajisaka

Athena AWS Amazon EMR Amazon RDS Cloud Computing Data Lake Data Lakehouse Databricks Delta DWH DynamoDB MongoDB +3 more

Delta Lake is an open source project that helps implement modern data lake architectures commonly built on cloud storages. With Delta Lake, you can achieve ACID transactions, time travel queries, CDC, and other common use cases on the cloud.

There are a lot of use cases of Delta tables on AWS. AWS has invested a lot in this technology, and now Delta Lake is available with multiple AWS services, such as AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. With AWS Glue, you can easily ingest data from multiple data sources such as on-prem databases, Amazon RDS, DynamoDB, MongoDB into Delta Lake on Amazon S3 even without expertise in coding.

This session will demonstrate how to get started with processing Delta Lake tables on Amazon S3 using AWS Glue, and querying from Amazon Athena, and Amazon Redshift. The session also covers recent AWS service updates related to Delta Lake.

Talk by: Noritaka Sekiyama and Akira Ajisaka

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Airflow at GoDaddy: From on-prem to cloud to PaaS

2023-07-01 · Airflow Summit 2023

session

by Amit Kumar (Google Cloud)

Airflow Cloud Computing Data Lake GitHub

Discover the transformation of Airflow at GoDaddy: from its initial deployment on-prem to its migration to the cloud, and finally to a Single Pane Orchestration Model. This evolution has streamlined our Data Platform and improved governance. Our experience will be beneficial for anyone seeking to optimize their Airflow implementation and simplify their orchestration processes. History and Use-cases Design, Organization decisions, and Governance: Examining the decision-making process and governance structure. Migration to Cloud:Process of transitioning Airflow from on-premises to the cloud. Data Processing engines used with Airflow for Data Processing. Challenges: Obstacles faced during and after migration and how they were overcome. *Demonstrating how Airflow can be integrated with a central Glue Catalog and Data Lake Mesh model. Single Pane Orchestration (PAAS) and custom re-usable Github Actions: Examining benefits of using a Single Pane Orchestration model Monitoring

What Happens When The Abstractions Leak On Your Data

2023-05-15 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey

AI/ML Airbyte API AWS BigQuery CDP Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Modelling +10 more

Summary

All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm sharing some thoughts and observances about abstractions and impedance mismatches from my experience building a data lakehouse with an ELT workflow

Interview

Introduction impact of community tech debt

hive metastore new work being done but not widely adopted

tensions between automation and correctness data type mapping

integer types complex types naming things (keys/column names from APIs to databases)

disaggregated databases - pros and cons

flexibility and cost control not as much tooling invested vs. Snowflake/BigQuery/Redshift

data modeling

dimensional modeling vs. answering today's questions

What are the most interesting, unexpected, or challenging lessons that you have learned while working on your data platform? When is ELT the wrong choice? What do you have planned for the future of your data platform?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

dbt Airbyte

Podcast Episode

Dagster

Podcast Episode

Trino

Podcast Episode

ELT Data Lakehouse Snowflake BigQuery Redshift Technical Debt Hive Metastore AWS Glue

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Rudderstack:

RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.

RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.

Visit dataengineeringpodcast.com/rudderstack to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Support Data Engineering Podcast

Beyond the buzz: 20 real metadata use cases in 20 minutes with Atlan and dbt Labs

2022-10-25 · dbt Coalesce 2022 Watch

video

by Prukalpa Sankar (Atlan)

Analytics Databricks dbt Fivetran Looker Snowflake Tableau

for a few use cases like static and passive data catalogs. However, active metadata can be the key to unlock a variety of use cases, acting as the glue that binds together our diverse modern data stacks (e.g. dbt, Snowflake, Fivetran, Databricks, Looker, and Tableau) and diverse teams (e.g. analytics engineers, data analysts, data engineers, and business users)! At Atlan, we’ve worked closely with modern data teams like WeWork, Plaid, PayU, SnapCommerce, and Bestow. In this session, we’ll lay out all our learnings about how real-life data teams are using metadata to drive powerful use cases like column-level lineage, programmatic governance, root cause analysis, proactive upstream alerts, dynamic pipeline optimization, cost optimization, data deprecation, automated quality control, metrics management, and more. P.S. We’ll also reveal how active metadata and the dbt Semantic Layer can work together to transform the way your team works with metrics!

Check the slides here: https://docs.google.com/presentation/d/1xrC9yhHOQ00qWt-gVlgbakRELg2FzEPt-RwMsUWzdZA/edit?usp=sharing

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

Serverless ETL and Analytics with AWS Glue

2022-08-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Albert Quiroga , Subramanya Vajiraya , Vishal Pathak , Ishan Gaur , Noritaka Sekiyama (Amazon Web Services (AWS)) , Tomohiro Tanaka

AI/ML Analytics AWS Cloud Computing Data Analytics Data Engineering Data Lake Data Management ETL/ELT Amazon SageMaker Cyber Security data +2 more

Discover how to harness AWS Glue for your ETL and data analysis workflows with "Serverless ETL and Analytics with AWS Glue." This comprehensive guide introduces readers to the capabilities of AWS Glue, from building data lakes to performing advanced ETL tasks, allowing you to create efficient, secure, and scalable data pipelines with serverless technology. What this Book will help me do Understand and utilize various AWS Glue features for data lake and ETL pipeline creation. Leverage AWS Glue Studio and DataBrew for intuitive data preparation workflows. Implement effective storage optimization techniques for enhanced data analytics. Apply robust data security measures, including encryption and access control, to protect data. Integrate AWS Glue with machine learning tools like SageMaker to build intelligent models. Author(s) The authors of this book include experts across the fields of data engineering and AWS technologies. With backgrounds in data analytics, software development, and cloud architecture, they bring a depth of practical experience. Their approach combines hands-on tutorials with conceptual clarity, ensuring a blend of foundational knowledge and actionable insights. Who is it for? This book is designed for ETL developers, data engineers, and data analysts who are familiar with data management concepts and want to extend their skills into serverless cloud solutions. If you're looking to master AWS Glue for building scalable and efficient ETL pipelines or are transitioning existing systems to the cloud, this book is ideal for you.

Airflow and _____: A discussion around utilizing Airflow with other data tools

2022-07-01 · Airflow Summit 2022

session

by Alessandro Pregnolato , Jitendra Shah , Brad Kirn , Sarah Johnson

Airflow Azure ADF Databricks

Come hang with Airflow practitioners from around the world using Airflow AND other data tools to power their data practice. From Databricks to Glue to Azure Data Factory, smart businesses make the right decision to standardize on Airflow for what it’s best at while using the other systems for what they are best at.

How AI has Changed Manufacturing with Ranga Ramesh

2022-01-18 · Leaders of Analytics Listen

podcast_episode

by Ranga Ramesh (Georgia-Pacific) , Jonas Christensen

AI/ML Analytics Data Science

Data science and machine learning are integral parts of most large-scale product manufacturing processes and are used to understand customer needs, detect quality issues, automate repetitive tasks and optimise supply chains. It’s an invisible glue that helps us produce more things for less, and in a timely fashion. To learn more about this fascinating topic, I recently spoke to Ranga Ramesh who is Senior Director, Quality Innovation and Transformation at Georgia-Pacific. Georgia-Pacific is one of the world’s largest manufacturers of consumer paper products and uses AI technologies throughout their manufacturing process. In this episode of Leaders of Analytics, we explore how computer vision and machine learning can be used to classify tissue paper softness and instantly detect quality issues that could otherwise render large volumes of product useless. Ranga’s work is featured as a case study in our recently published book, Demystifying AI for the Enterprise.

Serverless Analytics with Amazon Athena

2021-11-19 · O'Reilly Data Science Books O'Reilly Amazon

book

by Aaron Wishnick , Anthony Virtuoso , Mert Turkay Hocanin

AI/ML Analytics Athena AWS BI Cloud Computing Data Analytics Data Engineering ETL/ELT S3 Cyber Security SQL +3 more

Delve into the serverless world of Amazon Athena with the comprehensive book 'Serverless Analytics with Amazon Athena'. This guide introduces you to the power of Athena, showing you how to efficiently query data in Amazon S3 using SQL without the hassle of managing infrastructure. With clear instructions and practical examples, you'll master querying structured, unstructured, and semi-structured data seamlessly. What this Book will help me do Effectively query and analyze both structured and unstructured data stored in S3 using Amazon Athena. Integrate Athena with other AWS services to create powerful, secure, and cost-efficient data workflows. Develop ETL pipelines and machine learning workflows leveraging Athena's compatibility with AWS Glue. Monitor and troubleshoot Athena queries for consistent performance and build scalable serverless data solutions. Implement security best practices and optimize costs when managing your Athena-driven data solutions. Author(s) None Virtuoso, along with co-authors Mert Turkay Hocanin None and None Wishnick, brings a wealth of experience in cloud solutions, serverless technologies, and data engineering. They excel in demystifying complex technical topics and have a passion for empowering readers with practical skills and knowledge. Who is it for? This book is tailored for business intelligence analysts, application developers, and system administrators who want to harness Amazon Athena for seamless, cost-efficient data analytics. It suits individuals with basic SQL knowledge looking to expand their capabilities in querying and processing data. Whether you're managing growing datasets or building data-driven applications, this book provides the know-how to get it right.

Airflow: The Power of Stitching Services Together

2021-07-01 · Airflow Summit 2021

session

by Rafal Biegacz , Filip Knapik

Airflow Cloud Computing Cloud Composer

Apache Airflow is known to be a great orchestration tool that enables use cases that would not be possible otherwise. One of the great features that Airflow has is the possibility to “glue” together totally separate services to establish bigger functionalities. In this talk you will learn about various Airflow usages that let Airflow users to automate their critical company processes and even establish businesses. The examples provided will be based on Airflow used in the context of Cloud Composer which is a managed service to provision and manage Airflow instances.

Putting Airflow Into Production With James Meickle - Episode 43

2018-08-13 · Data Engineering Podcast Listen

podcast_episode

by James Meickle , Tobias Macey

Airflow Ansible API Astronomer AWS CloudFormation Data Engineering Data Management Data Science DevOps ETL/ELT GitHub +7 more

Summary

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation

Interview

Introduction How did you get involved in the area of data management? What was your initial project requirement?

What tooling did you consider in addition to Airflow? What aspects of the Airflow platform led you to choose it as your implementation target?

Can you describe your current deployment architecture?

How many engineers are involved in writing tasks for your Airflow installation?

What resources were the most helpful while learning about Airflow design patterns?

How have you architected your DAGs for deployment and extensibility?

What kinds of tests and automation have you put in place to support the ongoing stability of your deployment? What are some of the dead-ends or other pitfalls that you encountered during the course of this project? What aspects of Airflow have you found to be lacking that you would like to see improved? What did you wish someone had told you before you started work on your Airflow installation?

If you were to start over would you make the same choice? If Airflow wasn’t available what would be your second choice?

What are your next steps for improvements and fixes?

Contact Info

@eronarn on Twitter Website eronarn on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Quantopian Harvard Brain Science Initiative DevOps Days Boston Google Maps API Cron ETL (Extract, Transform, Load) Azkaban Luigi AWS Glue Airflow Pachyderm

Podcast Interview

AirBnB Python YAML Ansible REST (Representational State Transfer) SAML (Security Assertion Markup Language) RBAC (Role-Based Access Control) Maxime Beauchemin

Medium Blog

Celery Dask

Podcast Interview

PostgreSQL

Podcast Interview

Redis Cloudformation Jupyter Notebook Qubole Astronomer

Podcast Interview

Gunicorn Kubernetes Airflow Improvement Proposals Python Enhancement Proposals (PEP)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

A First Course in Statistics, 12th Edition

2016-01-07 · O'Reilly Data Science Books O'Reilly Amazon

book

by James T. McClave , Terry T. Sincich

data data-science data-science-tasks statistics

For courses in introductory statistics. A Contemporary Classic Classic, yet contemporary; theoretical, yet applied—McClave & Sincich’s A First Course in Statistics gives you the best of both worlds. This text offers a trusted, comprehensive introduction to statistics that emphasizes inference and integrates real data throughout. The authors stress the development of statistical thinking, the assessment of credibility, and value of the inferences made from data. This new edition is extensively revised with an eye on clearer, more concise language throughout the text and in the exercises. Ideal for one- or two-semester courses in introductory statistics, this text assumes a mathematical background of basic algebra. Flexibility is built in for instructors who teach a more advanced course, with optional footnotes about calculus and the underlying theory. Also available with MyStatLab MyStatLab™ is an online homework, tutorial, and assessment program designed to work with this text to engage students and improve results. Within its structured environment, students practice what they learn, test their understanding, and pursue a personalized study plan that helps them absorb course material and understand difficult concepts. For this edition, MyStatLab offers 30% new and updated exercises. Note: You are purchasing a standalone product; MyLab™ & Mastering™ does not come packaged with this content. Students, if interested in purchasing this title with MyLab & Mastering, ask your instructor for the correct package ISBN and Course ID. Instructors, contact your Pearson representative for more information. If you would like to purchase both the physical text and MyLab & Mastering, search for: 0134090438 / 9780134090436 * Statistics Plus New MyStatLab with Pearson eText -- Access Card Package Package consists of: 0134080211 / 9780134080215 * Statistics 0321847997 / 9780321847997 * My StatLab Glue-in Access Card 032184839X / 9780321848390 * MyStatLab Inside Sticker for Glue-In Packages

talk-data.com

AWS Glue

Activity Trend

Top Events

Top Speakers

AWS re:Invent 2024 - Innovations in AWS analytics: Data processing (ANT346)

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Scale with self-service analytics on AWS (ANT334)

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - A practitioner’s guide to data for generative AI (DAT319)

AWSreInvent #AWSreInvent2024

Data Engineering with AWS Cookbook

Michael Toland - How to Make Data Products Work in the Enterprise

#58 Maximizing Productivity: Bookmarklets, Q Command-Line, RouteLLM, and DuckDB Extensions

Data Engineering with AWS - Second Edition

Community driven: The dbt Labs and AWS partnership - Coalesce 2023

An API for Deep Learning Inferencing on Apache Spark™

Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift

Airflow at GoDaddy: From on-prem to cloud to PaaS

What Happens When The Abstractions Leak On Your Data

Beyond the buzz: 20 real metadata use cases in 20 minutes with Atlan and dbt Labs

Serverless ETL and Analytics with AWS Glue

Airflow and _____: A discussion around utilizing Airflow with other data tools

How AI has Changed Manufacturing with Ranga Ramesh

Serverless Analytics with Amazon Athena

Airflow: The Power of Stitching Services Together

Putting Airflow Into Production With James Meickle - Episode 43

A First Course in Statistics, 12th Edition