Jupyter Notebooks are widely used by data scientists and engineers to prototype and experiment with data. However these engineers are often required to work with other data or platform engineers to productionize these experiments due to the complexity in navigating infrastructure and systems. In this talk, we will deep dive into this PR https://github.com/apache/airflow/pull/34840 and share how airflow can be leveraged as a platform to execute notebook pipelines (python, scala or spark) in dynamic environments like Kubernetes for various heterogeneous use cases. We will demonstrate how data scientists can use a Jupyter extension to easily build and manage such pipelines which are executed using Airflow streamlining data science workflow development and supercharging productivity
talk-data.com
Topic
Python
1446
tagged
Activity Trend
Top Events
As organizations grow, the task of creating and managing Airflow DAGs efficiently becomes a challenge. In this talk, we will delve into innovative approaches to streamlining Airflow DAG creation using YAML. By leveraging YAML configuration, we allow users to dynamically generate Airflow DAGs without requiring Python expertise or deep knowledge of Airflow primitives. We will showcase the significant benefits of this approach, including eliminating duplicate configurations, simplifying DAG management for a large group of workflows, and ultimately enhancing productivity within large organizations. Join us to learn practical strategies to optimize workflow orchestration, reduce development overhead, and facilitate seamless collaboration across teams.
Summary This episode features an insightful conversation with Petr Janda, the CEO and founder of Synq. Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems. Synq's platform helps data teams manage incidents, understand data dependencies, and ensure data quality by providing insights and automation capabilities. Petr emphasizes the need for a holistic approach to data reliability, integrating data systems into broader business processes. He highlights the role of data teams in modern organizations and how Synq is empowering them to achieve this. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Petr Janda about Synq, a data reliability platform focused on leveling up data teams by supporting a culture of engineering rigorInterview IntroductionHow did you get involved in the area of data management?Can you describe what Synq is and the story behind it? Data observability/reliability is a category that grew rapidly over the past ~5 years and has several vendors focused on different elements of the problem. What are the capabilities that you saw as lacking in the ecosystem which you are looking to address?Operational/infrastructure engineers have spent the past decade honing their approach to incident management and uptime commitments. How do those concepts map to the responsibilities and workflows of data teams? Tooling only plays a small part in SLAs and incident management. How does Synq help to support the cultural transformation that is necessary?What does an on-call rotation for a data engineer/data platform engineer look like as compared with an application-focused team?How does the focus on data assets/data products shift your approach to observability as compared to a table/pipeline centric approach?With the focus on sharing ownership beyond the boundaries on the data team there is a strong correlation with data governance principles. How do you see organizations incorporating Synq into their approach to data governance/compliance?Can you describe how Synq is designed/implemented? How have the scope and goals of the product changed since you first started working on it?For a team who is onboarding onto Synq, what are the steps required to get it integrated into their technology stack and workflows?What are the types of incidents/errors that you are able to identify and alert on? What does a typical incident/error resolution process look like with Synq?What are the most interesting, innovative, or unexpected ways that you have seen Synq used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Synq?When is Synq the wrong choice?What do you have planned for the future of Synq?Contact Info LinkedInSubstackParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links SynqIncident ManagementSLA == Service Level AgreementData GovernancePodcast EpisodePagerDutyOpsGenieClickhousePodcast EpisodedbtPodcast EpisodeSQLMeshPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
This book teaches solutions architects how to effectively design and implement AI/ML solutions utilizing Google Cloud services. Through detailed explanations, examples, and hands-on exercises, you will understand essential AI/ML concepts, tools, and best practices while building advanced applications. What this Book will help me do Build robust AI/ML solutions using Google Cloud tools such as TensorFlow, BigQuery, and Vertex AI. Prepare and process data efficiently for machine learning workloads. Establish and apply an MLOps framework for automating ML model lifecycle management. Implement cutting-edge generative AI solutions using best practices. Address common challenges in AI/ML projects with insights from expert solutions. Author(s) Kieran Kavanagh is a seasoned principal architect with nearly twenty years of experience in the tech industry. He has successfully led teams in designing, planning, and governing enterprise cloud strategies, and his wealth of experience is distilled into the practical approaches and insights in this book. Who is it for? This book is ideal for IT professionals aspiring to design AI/ML solutions, particularly in the role of solutions architects. It assumes a basic knowledge of Python and foundational AI/ML concepts but is suitable for both beginners and seasoned practitioners. If you're looking to deepen your understanding of state-of-the-art AI/ML applications on Google Cloud, this resource will guide you.
Part 1: FiftyOne Basics (terms, architecture, installation, and general usage); overview of useful workflows to explore, understand, and curate your data; how FiftyOne represents and semantically slices unstructured computer vision data.
90-minute hands-on workshop led by Harpreet Sahota (Machine Learning Engineer at Voxel51) covering FiftyOne basics, workflows to explore/understand/curate data, representation of unstructured computer vision data, and a hands-on session including loading datasets, navigating the FiftyOne App, inspecting attributes, adding samples/attributes, generating/evaluating model predictions, and saving insightful views.
A free 90-minute hands-on workshop on leveraging the FiftyOne computer vision toolset. Part 1 covers FiftyOne basics (terms, architecture, installation, and general usage), an overview of useful workflows to explore, understand, and curate data, and how FiftyOne represents and semantically slices unstructured computer vision data. Part 2 is a hands-on introduction to FiftyOne: load datasets from the FiftyOne Dataset Zoo, navigate the FiftyOne App, programmatically inspect attributes, add new samples and custom attributes, generate and evaluate model predictions, and save insightful views into the data.
Hands-on for loading datasets from the FiftyOne Dataset Zoo; navigate the FiftyOne App; programmatically inspect attributes of a dataset; add new sample and custom attributes to a dataset; generate and evaluate model predictions; save insightful views into the data.
Join us to learn how Alexa's speech recognition can transform your daily routine by creating keyboard-free applications. Discover the core components of Alexa, start with the Developer console, and customize it using Python in a serverless approach. We'll show you how to integrate Alexa into your development workflow, potentially replacing your keyboard with voice commands for automation.
This talk will cover the essential concepts, use cases, and best practices for implementing generators (and iterators) in Python. We'll delve into how these powerful tools can enhance your coding efficiency, manage memory more effectively, and handle large data sets gracefully.
Sweet Summer Child Score is an open source library to identify potential AI harms. A truism in tech is that we're good at asking 'can' we do it but not 'should' we do it. This library offers a system scan to quickly identify potential harms, and build the capability of relative risk assessment. SSCS does not explore the specifics of your stack or technical implementation -- instead it takes a step back to look at the ecosystem your technology will be deployed in, and the implementation choices which define the seam between your system and the broader world.
Unlock the full potential of DuckDB with 'Getting Started with DuckDB,' your guide to mastering data analysis efficiently. By reading this book, you'll discover how to load, transform, and query data using DuckDB, leveraging its unique capabilities for processing large datasets. Gain hands-on experience with SQL, Python, and R to enhance your data science and engineering workflows. What this Book will help me do Effectively load and manage various types of data in DuckDB for seamless processing. Gain hands-on experience writing and optimizing SQL queries tailored for analytical tasks. Integrate DuckDB capabilities into Python and R workflows for streamlined data analysis. Understand DuckDB's optimizations and extensions for specialized data applications. Explore the broader ecosystem of data tools that complement DuckDB's capabilities. Author(s) Simon Aubury and Ned Letcher are seasoned experts in the field of data analytics and engineering. With extensive experience in using both SQL and programming languages like Python and R, they bring practical insights into the innovative uses of DuckDB. They have designed this book to provide a hands-on and approachable way to learn DuckDB, making complex concepts accessible. Who is it for? This book is well-suited for data analysts aiming to accelerate their data analysis workflows, data engineers looking for effective tools for data processing, and data scientists searching for a versatile library for scalable data manipulation. Prior exposure to SQL and programming in Python or R will be beneficial for readers to maximize their learning.
Summary
Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Dipti Borkar about her work on Microsoft Fabric and performing analytics on data withou
Interview
Introduction How did you get involved in the area of data management? Can you describe what Microsoft Fabric is and the story behind it? Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. What are the motivating factors that you see for that trend? Microsoft has been investing heavily in open source in recent years, and the Fabric platform relies on several open components. What are the benefits of layering on top of existing technologies rather than building a fully custom solution?
What are the elements of Fabric that were engineered specifically for the service? What are the most interesting/complicated integration challenges?
How has your prior experience with Ahana and Presto informed your current work at Microsoft? AI plays a substantial role in the product. What are the benefits of embedding Copilot into the data engine?
What are the challenges in terms of safety and reliability?
What are the most interesting, innovative, or unexpected ways that you have seen the Fabric platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data lakes generally, and Fabric specifically? When is Fabric the wrong choice? What do you have planned for the future of data lake analytics?
Contact Info
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
Links
Microsoft Fabric Ahana episode DB2 Distributed Spark Presto Azure Data MAD Landscape
Podcast Episode ML Podcast Episode
Tableau dbt Medallion Architecture Microsoft Onelake ORC Parquet Avro Delta Lake Iceberg
Podcast Episode
Hudi
Podcast Episode
Hadoop PowerBI
Podcast Episode
Velox Gluten Apache XTable GraphQL Formula 1 McLaren
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Starburst: 
This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by T
Summary
Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse
Interview
Introduction How did you get involved in the area of data management? Can you describe what role Trino and Iceberg play in Stripe's data architecture?
What are the ways in which your job responsibilities intersect with Stripe's lakehouse infrastructure?
What were the requirements and selection criteria that led to the selection of that combination of technologies?
What are the other systems that feed into and rely on the Trino/Iceberg service?
what kinds of questions are you answering with table metadata
what use case/team does that support
comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg/Trino used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stripe's data infrastructure? When is a lakehouse on Trino/Iceberg the wrong choice? What do you have planned for the future of Trino and Iceberg at Stripe?
Contact Info
Substack LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
Links
Trino Iceberg Stripe Spark Redshift Hive Metastore Python Iceberg Python Iceberg REST Catalog Trino Metadata Table Flink
Podcast Episode
Tabular
Podcast Episode
Delta Table
Podcast Episode
Databricks Unity Catalog Starburst AWS Athena Kevin Trinofest Presentation Alluxio
Podcast Episode
Parquet Hudi Trino Project Tardigrade Trino On Ice
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Starburst: 
This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake.
Trusted by the teams at Comcast and Doordash, Starburst del
In this episode, Conor and Bryce chat about how to get started in programming. Link to Episode 186 on WebsiteDiscuss this episode, leave a comment, or ask a question (on GitHub)Twitter ADSP: The PodcastConor HoekstraBryce Adelstein LelbachShow Notes Date Recorded: 2024-06-07 & 2024-06-12 Date Released: 2024-06-14 Swift Programming LanguageBoost C++ LibrariesBoost SpiritNDC Oslo ConferenceCraft Conf 2024The Power of Function Composition - NDC Oslo - Conor HoekstraCityStrides.comcity-strides-hacking GitHub RepoHookStar Scrabble TrainerBeautiful Python Refactoring II - Conor Hoekstra - code::dive 2022 (Scrabble Talk)Intro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8
This book serves as the ultimate preparation for aspiring Databricks Certified Associate Developers specializing in Apache Spark. Deep dive into Spark's components, its applications, and exam techniques to achieve certification and expand your practical skills in big data processing and real-time analytics using Python. What this Book will help me do Deeply understand Apache Spark's core architecture for building big data applications. Write optimized SQL queries and leverage Spark DataFrame API for efficient data manipulation. Apply advanced Spark functions, including UDFs, to solve complex data engineering tasks. Use Spark Streaming capabilities to implement real-time and near-real-time processing solutions. Get hands-on preparation for the certification exam with mock tests and practice questions. Author(s) Saba Shah is a seasoned data engineer with extensive experience working at Databricks and leading data science teams. With her in-depth knowledge of big data applications and Spark, she delivers clear, actionable insights in this book. Her approach emphasizes practical learning and real-world applications. Who is it for? This book is ideal for data professionals such as engineers and analysts aiming to achieve Databricks certification. It is particularly helpful for individuals with moderate Python proficiency who are keen to understand Spark from scratch. If you're transitioning into big data roles, this guide prepares you comprehensively.
The 'Python and SQL Bible' is a comprehensive guide to mastering both Python programming and SQL querying. Starting from the very basics, the book takes readers through advanced techniques, including data manipulation, database management, and integration of Python with SQL, all while offering hands-on examples and real-world exercises. What this Book will help me do Gain a strong foundation in Python programming, including control flow, functions, and object-oriented programming. Learn how to write advanced SQL queries for data extraction, manipulation, and reporting. Understand how to integrate Python with SQL to form a seamless data manipulation workflow. Develop data analysis skills using Python and tools such as SQLAlchemy for advanced insights. Master database administration techniques to efficiently manage and query datasets. Author(s) Cuantum Technologies LLC is a renowned tech education provider with a focus on equipping learners with in-demand programming and data management skills. Their training methods blend theory with practice, ensuring students gain hands-on experience applicable in professional environments. Their team of experts crafts content to cater to both beginners and professionals seeking to advance their skill set. Who is it for? This book is ideal for beginners who are new to programming and experienced professionals who wish to master Python and SQL for data manipulation and analysis. It is perfect for aspiring data scientists, software developers, and IT professionals looking to unlock new career opportunities. By detailing concepts and providing practical exercises, it accommodates various skill levels and prepares readers for industry demands.
Dive into the fascinating world of graph theory and its applications with 'Modern Graph Theory Algorithms with Python.' Through Python programming and real-world case studies, this book equips you with the tools to transform data into graph structures, apply algorithms, and uncover insights, enabling effective solutions in diverse domains such as finance, epidemiology, and social networks. What this Book will help me do Understand how to wrangle a variety of data types into network formats suitable for analysis. Learn to use graph theory algorithms and toolkits such as NetworkX and igraph in Python. Apply network theory to predict and analyze trends, from epidemics to stock market dynamics. Explore the intersection of machine learning and graph theory through advanced neural network techniques. Gain expertise in database solutions with graph database querying and applications. Author(s) Colleen M. Farrelly, an experienced data scientist, and Franck Kalala Mutombo, a seasoned software engineer, bring years of expertise in network science and Python programming to every page of this book. Their professional experience includes working on cutting-edge problems in data analytics, graph theory, and scalable solutions for real-world issues. Combining their practical know-how, they deliver a resource aimed at both learning and applying techniques effectively. Who is it for? This book is tailored for data scientists, researchers, and analysts with an interest in using graph-based approaches for solving complex data problems. Ideal for those with a basic Python knowledge and familiarity with libraries like pandas and NumPy, the content bridges the gap between theory and application. It also provides insights into broad fields where network science can be impactful, contributing value to both students and professionals.
Cognitive Science, Computational Intelligence, and Data Analytics: Methods and Applications with Python introduces readers to the foundational concepts of data analysis, cognitive science, and computational intelligence, including AI and Machine Learning. The book's focus is on fundamental ideas, procedures, and computational intelligence tools that can be applied to a wide range of data analysis approaches, with applications that include mathematical programming, evolutionary simulation, machine learning, and logic-based models. It offers readers the fundamental and practical aspects of cognitive science and data analysis, exploring data analytics in terms of description, evolution, and applicability in real-life problems. The authors cover the history and evolution of cognitive analytics, methodological concerns in philosophy, syntax and semantics, understanding of generative linguistics, theory of memory and processing theory, structured and unstructured data, qualitative and quantitative data, measurement of variables, nominal, ordinals, intervals, and ratio scale data. The content in this book is tailored to the reader's needs in terms of both type and fundamentals, including coverage of multivariate analysis, CRISP methodology and SEMMA methodology. Each chapter provides practical, hands-on learning with real-world applications, including case studies and Python programs related to the key concepts being presented. Demystifies the theory of data analytics using a step-by-step approach Covers the intersection of cognitive science, computational intelligence, and data analytics by providing examples and case studies with applied algorithms, mathematics, and Python programming code Introduces foundational data analytics techniques such as CRISP-DM, SEMMA, and Object Detection Models in the context of computational intelligence methods and tools Covers key concepts of multivariate and cognitive data analytics such as factor analytics, principal component analytics, linear regression analysis, logistic regression analysis, and value chain applications
Practice makes perfect pandas! Work out your pandas skills against dozens of real-world challenges, each carefully designed to build an intuitive knowledge of essential pandas tasks. In Pandas Workout you’ll learn how to: Clean your data for accurate analysis Work with rows and columns for retrieving and assigning data Handle indexes, including hierarchical indexes Read and write data with a number of common formats, such as CSV and JSON Process and manipulate textual data from within pandas Work with dates and times in pandas Perform aggregate calculations on selected subsets of data Produce attractive and useful visualizations that make your data come alive Pandas Workout hones your pandas skills to a professional-level through two hundred exercises, each designed to strengthen your pandas skills. You’ll test your abilities against common pandas challenges such as importing and exporting, data cleaning, visualization, and performance optimization. Each exercise utilizes a real-world scenario based on real-world data, from tracking the parking tickets in New York City, to working out which country makes the best wines. You’ll soon find your pandas skills becoming second nature—no more trips to StackOverflow for what is now a natural part of your skillset. About the Technology Python’s pandas library can massively reduce the time you spend analyzing, cleaning, exploring, and manipulating data. And the only path to pandas mastery is practice, practice, and, you guessed it, more practice. In this book, Python guru Reuven Lerner is your personal trainer and guide through over 200 exercises guaranteed to boost your pandas skills. About the Book Pandas Workout is a thoughtful collection of practice problems, challenges, and mini-projects designed to build your data analysis skills using Python and pandas. The workouts use realistic data from many sources: the New York taxi fleet, Olympic athletes, SAT scores, oil prices, and more. Each can be completed in ten minutes or less. You’ll explore pandas’ rich functionality for string and date/time handling, complex indexing, and visualization, along with practical tips for every stage of a data analysis project. What's Inside Clean data with less manual labor Retrieving and assigning data Process and manipulate text Calculations on selected data subsets About the Reader For Python programmers and data analysts. About the Author Reuven M. Lerner teaches Python and data science around the world and publishes the “Bamboo Weekly” newsletter. He is the author of Manning’s Python Workout (2020). Quotes A carefully crafted tour through the pandas library, jam-packed with wisdom that will help you become a better pandas user and a better data scientist. - Kevin Markham, Founder of Data School, Creator of pandas in 30 days Will help you apply pandas to real problems and push you to the next level. - Michael Driscoll, RFA Engineering, creator of Teach Me Python The explanations, paired with Reuven’s storytelling and personal tone, make the concepts simple. I’ll never get them wrong again! - Rodrigo Girão Serrão, Python developer and educator The definitive source! - Kiran Anantha, Amazon