Python

Scaling Python with Ray

2022-11-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Holden Karau (Fight Health Insurance) , Boris Lublinsky

data data-science

Serverless computing enables developers to concentrate solely on their applications rather than worry about where they've been deployed. With the Ray general-purpose serverless implementation in Python, programmers and data scientists can hide servers, implement stateful applications, support direct communication between tasks, and access hardware accelerators. In this book, experienced software architecture practitioners Holden Karau and Boris Lublinsky show you how to scale existing Python applications and pipelines, allowing you to stay in the Python ecosystem while reducing single points of failure and manual scheduling. Scaling Python with Ray is ideal for software architects and developers eager to explore successful case studies and learn more about decision and measurement effectiveness. If your data processing or server application has grown beyond what a single computer can handle, this book is for you. You'll explore distributed processing (the pure Python implementation of serverless) and learn how to: Implement stateful applications with Ray actors Build workflow management in Ray Use Ray as a unified system for batch and stream processing Apply advanced data processing with Ray Build microservices with Ray Implement reliable Ray applications

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

2022-11-28 · Data Engineering Podcast Listen

podcast_episode

by Wes McKinney (Posit) , Tobias Macey

AI/ML Airflow Analytics Arrow BI Dashboard Data Engineering Data Management dbt DuckDB ETL/ELT GitHub +18 more

Summary The data ecosystem has been growing rapidly, with new communities joining and bringing their preferred programming languages to the mix. This has led to inefficiencies in how data is stored, accessed, and shared across process and system boundaries. The Arrow project is designed to eliminate wasted effort in translating between languages, and Voltron Data was created to help grow and support its technology and community. In this episode Wes McKinney shares the ways that Arrow and its related projects are improving the efficiency of data systems and driving their next stage of evolution.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Wes McKinney about his work at Voltron Data and on the Arrow ecosystem

Interview

Introduction How did you get involved in the area of data management? Can you describe what you are building at Voltron Data and the story behind it? What is the vision for the broader data ecosystem that you are trying to realize through your investment in Arrow and related projects?

How does your work at Voltron Data contribute to the realization of that vision?

What is the impact on engineer productivity and compute efficiency that gets introduced by the impedance mismatches between language and framework representations of data? The scope and capabilities of the Arrow project have grown substantially since it was first introduced. Can you give an overview of the current features and extensions to the project? What are some of the ways that ArrowVe and its related projects can be integrated with or replace the different elements of a data platform? Can you describe how Arrow is implemented?

What are the most complex/challenging aspects of the engineering needed to support interoperable data interchange between language runtimes?

How are you balancing the desire to move quickly and improve the Arrow protocol and implementations, with the need to wait for other players in the ecosystem (e.g. database engines, compute frameworks, etc.) to add support? With the growing application of data formats such as graphs and vectors, what do you see as the role of Arrow and its ideas in those use cases? For workflows that rely on integrating structured and unstructured data, what are the options for interaction with non-tabular data? (e.g. images, documents, etc.) With your support-focused business model, how are you approaching marketing and customer education to make it viable and scalable? What are the most interesting, innovative, or unexpected ways that you have seen Arrow used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Arrow and its ecosystem? When is Arrow the wrong choice? What do you have planned for the future of Arrow?

Contact Info

Website wesm on GitHub @wesmckinn on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Voltron Data Pandas

Podcast Episode

Apache Arrow Partial Differential Equation FPGA == Field-Programmable Gate Array GPU == Graphics Processing Unit Ursa Labs Voltron (cartoon) Feature Engineering PySpark Substrait Arrow Flight Acero Arrow Datafusion Velox Ibis SIMD == Single Instruction, Multiple Data Lance DuckDB

Podcast Episode

Data Threads Conference Nano-Arrow Arrow ADBC Protocol Apache Iceberg

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

2022-11-21 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey , Salma Bakouk (Sifflet)

Airflow Analytics AWS Azure BigQuery CDP CI/CD Cloud Computing Data Engineering Data Lake Data Management Data Quality +14 more

Summary The problems that are easiest to fix are the ones that you prevent from happening in the first place. Sifflet is a platform that brings your entire data stack into focus to improve the reliability of your data assets and empower collaboration across your teams. In this episode CEO and founder Salma Bakouk shares her views on the causes and impacts of "data entropy" and how you can tame it before it leads to failures.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Salma Bakouk about achieving data reliability and reducing entropy within your data stack with sifflet

Interview

Introduction How did you get involved in the area of data management? Can you describe what Sifflet is and the st

Episode 104: Jason Turner from CppCast! (Part 2)

2022-11-18 · ADSP: Algorithms + Data Structures = Programs Listen

podcast_episode

by Conor Hoekstra , Jason Turner , Bryce Adelstein Lelbach (NVIDIA)

Rust

In this episode, Conor continues his conversation with Jason Turner! Link to Episode 104 on Website

Twitter ADSP: The PodcastConor HoekstraBryce Adelstein LelbachAbout the Guest: Jason is host of the YouTube channel C++ Weekly, co-host emeritus of the podcast CppCast, author of C++ Best Practices, and author of the first casual puzzle books designed to teach C++ fundamentals while having fun! A list of Jason’s content: C++ Weekly YouTube ChannelThe [Fill in the Blank] Programmer YouTube ChannelC++ BooksTalk PlaylistShow Notes Date Recorded: 2022-10-26 Date Released: 2022-11-18 Final Episode of CppCastA talk with Jason Turner: the history of CppCast, and why it was shut downThe [Fill in the Blank] Programmer YouTube ChannelC++ autoMaking C++ Fun, Safe, and Accessible – Jason Turner - C++ on Sea 2022C++ Weekly - Ep 347 - This PlayStation Jailbreak NEVER SHOULD HAVE HAPPENEDC++ std::unordered_map::operator=Python defaultdictC++Now 2019: Peter Sommerlad “How I learned to Stop Worrying and Love the C++ Type System”C++ explicit specifierHoogle Haskell Function Search EngineRoogle Rust Function Search EngineCLion Code CompletionDenver C++ MeetupIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Taking A Look Under The Hood At CreditKarma's Data Platform

2022-11-14 · Data Engineering Podcast Listen

podcast_episode

by Vishnu Venkataraman (CreditKarma) , Tobias Macey

Airflow Analytics AWS Azure BigQuery CDP CI/CD Cloud Computing Data Engineering Data Lake Data Management Data Quality +14 more

Summary CreditKarma builds data products that help consumers take advantage of their credit and financial capabilities. To make that possible they need a reliable data platform that empowers all of the organization’s stakeholders. In this episode Vishnu Venkataraman shares the journey that he and his team have taken to build and evolve their systems and improve the product offerings that they are able to support.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Vishnu Venkataraman about building the data platform at CreditKarma and the forces that shaped the design

Interview

Introduction How did you get involved in the area of data management? Can you describe what CreditKarma is and the role

Applied Machine Learning and AI for Engineers

2022-11-10 · O'Reilly AI & ML Books O'Reilly Amazon

book

by Jeff Prosise

AI/ML Keras Scikit-learn TensorFlow ai-ml data machine-learning

While many introductory guides to AI are calculus books in disguise, this one mostly eschews the math. Instead, author Jeff Prosise helps engineers and software developers build an intuitive understanding of AI to solve business problems. Need to create a system to detect the sounds of illegal logging in the rainforest, analyze text for sentiment, or predict early failures in rotating machinery? This practical book teaches you the skills necessary to put AI and machine learning to work at your company. Applied Machine Learning and AI for Engineers provides examples and illustrations from the AI and ML course Prosise teaches at companies and research institutions worldwide. There's no fluff and no scary equations—just a fast start for engineers and software developers, complete with hands-on examples. This book helps you: Learn what machine learning and deep learning are and what they can accomplish Understand how popular learning algorithms work and when to apply them Build machine learning models in Python with Scikit-Learn, and neural networks with Keras and TensorFlow Train and score regression models and binary and multiclass classification models Build facial recognition models and object detection models Build language models that respond to natural-language queries and translate text to other languages Use Cognitive Services to infuse AI into the apps that you write

Clean Up Your Data Using Scalable Entity Resolution And Data Mastering With Zingg

2022-11-07 · Data Engineering Podcast Listen

podcast_episode

by Sonal Goyal (Nube Technologies) , Tobias Macey

AI/ML Airflow Analytics AWS Azure BigQuery CDP CI/CD Cloud Computing Data Engineering Data Lake Data Management +15 more

Summary Despite the best efforts of data engineers, data is as messy as the real world. Entity resolution and fuzzy matching are powerful utilities for cleaning up data from disconnected sources, but it has typically required custom development and training machine learning models. Sonal Goyal created and open-sourced Zingg as a generalized tool for data mastering and entity resolution to reduce the effort involved in adopting those practices. In this episode she shares the story behind the project, the details of how it is implemented, and how you can use it for your own data projects.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Sonal Goyal about Zingg, an open source entity resolution frame

#205: Nailing the Data Science / Analytics Job Interview with Jay Feng

2022-11-01 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Moe Kiss (Canva) , Michael Helbling (Search Discovery) , Jay Feng (Interview Query)

Analytics Data Science

So, you finally took that recruiter's call, and then you made it through the initial phone screen. You weren't really expecting that to happen, but now you're facing an actual interview! It sounds intense and, yet, you're not sure what to expect or how to prepare for it. Flash cards with statistical concepts? A crash course in Python? LinkedIn stalking of current employees of the company? Maybe. We asked Jay Feng from Interview Query to join us to discuss strategies and tactics for data scientists and analyst interviews, and we definitely wanted to hire him by the time we were done! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

#111 The Rise of the Julia Programming Language

2022-10-31 · DataFramed Listen

podcast_episode

by Zacharias Voulgaris

AI/ML Analytics Data Science

Python has dominated data science programming for the last few years, but there’s another rising star programming language seeing increased adoption and popularity—Julia.

As the fourth most popular programming language, many data teams and practitioners are turning their attention toward understanding Julia and seeing how it could benefit individual careers, business operations, and drive increased value across organizations.

Zacharias Voulgaris, PhD joins the show to talk about his experience with the Julia programming language and his perspective on the future of Julia’s widespread adoption. Zacharias is the author of Julia for Data Science. As a Data Science consultant and mentor with 10 years of international experience that includes the role of Chief Science Officer at three startups, Zacharias is an expert in data science, analytics, artificial intelligence, and information systems.

In this episode, we discuss the strengths of Julia, how data scientists can get started using Julia, how team members and leaders alike can transition to Julia, why companies are secretive about adopting Julia, the interoperability of Julia with Python and other popular programming languages, and much more.

Check out this month’s events: https://www.datacamp.com/data-driven-organizations-2022

Take the Introduction to Julia course for free!

https://www.datacamp.com/courses/introduction-to-julia

Analytics Engineering Without The Friction Of Complex Pipeline Development With Optimus and dbt

2022-10-30 · Data Engineering Podcast Listen

podcast_episode

by Nandam Karthik (Optimus) , Tobias Macey

Airflow Analytics Analytics Engineering AWS Azure BigQuery CDP CI/CD Cloud Computing Data Analytics Data Engineering Data Lake +16 more

Summary One of the most impactful technologies for data analytics in recent years has been dbt. It’s hard to have a conversation about data engineering or analysis without mentioning it. Despite its widespread adoption there are still rough edges in its workflow that cause friction for data analysts. To help simplify the adoption and management of dbt projects Nandam Karthik helped create Optimus. In this episode he shares his experiences working with organizations to adopt analytics engineering patterns and the ways that Optimus and dbt were combined to let data analysts deliver insights without the roadblocks of complex pipeline management.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Nand

How to leverage dbt Community as the first & ONLY data hire to survive

2022-10-25 · dbt Coalesce 2022 Watch

video

by Fabiyi Opeyemi (Data Culture)

AI/ML Analytics Cloud Computing Data Engineering Data Science dbt Snowflake SQL

As data science and machine learning adoption grew over the last few years, Python moved up the ranks catching up to SQL in popularity in the world of data processing. SQL and Python are both powerful on their own, but their value in modern analytics is highest when they work together. This was a key motivator for us at Snowflake to build Snowpark for Python: to help modern analytics, data engineering, and data science teams generate insights without complex infrastructure management for separate languages.

Join this session to learn more about how dbt's new support for Python-based models and Snowpark for Python can help polyglot data teams get more value from their data through secure, efficient and performant metrics stores, feature stores, or data factories in the Data Cloud.

Check the slides here: https://docs.google.com/presentation/d/1xJEyfg81azw2hVilhGZ5BptnAQo8q1L7aDLGrnSYoUM/edit?usp=sharing

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

Empowering pythonistas with dbt and snowpark

2022-10-25 · dbt Coalesce 2022 Watch

video

by Eda Johnson (Snowflake)

AI/ML dbt Snowflake

Python model support is one of the hottest topics in the dbt Community Slack right now—but also one of the most intangible. What’s possible? What’s practical? What’s the point? In this session, Eda Johnson (Snowflake) will show how Snowpark Python further enhances a dbt + Snowflake development experience by supporting new workloads, including Machine Learning within your dbt DAGs.

Check the slides here: https://docs.google.com/presentation/d/1IJSeE96bze7DECuDYqsTVv6FaOcNcJ5tTiCWKEku_QQ/edit#slide=id.g14034313943_0_36

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

dbt Labs and Databricks: best practices and future roadmap

2022-10-25 · dbt Coalesce 2022 Watch

video

by Bilal Aslam (Databricks) , Nana Essuman (Conde Nast)

AI/ML Analytics Cloud Computing Data Lakehouse Databricks dbt SQL

The Databricks Lakehouse Platform unifies the best of data warehouses and data lakes in one simple platform to handle all your data, analytics and AI use cases. Databricks now includes complete support for dbt Core and dbt Cloud and you will hear from Conde Nast using dbt and Databricks together to democratize insights. We will also share best practices for developing and productionizing dbt projects containing SQL and Python, governing data with standard SQL, and exciting features on our roadmap such as materialized views for Databricks SQL.

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

Money, Python, and the Holy Grail: Designing Operational Data Models

2022-10-25 · dbt Coalesce 2022 Watch

video

by Benn Stancil (Mode)

Analytics Analytics Engineering Dashboard

Most analysts don’t become analysts to build dashboards. We don’t become analysts to do data pulls, or clean up messy data, or put together pitch decks. We become analysts to do impactful, strategic analysis. This is our calling; it’s the most valuable work that we do; and it’s why we put up with the rest of our job—for that afternoon with nothing but a big question, a clear calendar, and a trajectory-changing aha moment buried somewhere in our well-prepped datasets.

But the rapid rise of analytics engineering should make us question all of this. Is strategic analysis actually the holy grail of analytics? Is it the most valuable thing we could do? Is it even what we want to do?

In chasing this ambition, Benn Stancil (Mode) thinks we’ve lost sight of something even more important—and potentially, more interesting: Designing operational models. These frameworks, which are a natural extension of the semantic models built by analytics engineers, are often more valuable than any dashboard, any dataset, or any deep dive analysis.

In his talk, Benn will share what these models are, why they’re valuable, and why, in our eternal quest to both quantify our value and to find work we love, they could prove to be our holy grail we’ve always been looking for.

Check the slides here: https://docs.google.com/presentation/d/1lOH6Sb8DQnnlmZkYOlqqHgQeXKkUEQCm_LOxsjBRJlM/edit?usp=sharing

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

The accidental analytics engineer

2022-10-25 · dbt Coalesce 2022 Watch

video

by Michael Chow (RStudio)

Analytics Analytics Engineering Data Engineering Data Science

There’s a good chance you’re an analytics engineer who just sort of landed in an analytics engineering career. Or made a murky transition from data science/data engineering/software engineering to full-time analytics person. When did you realize you fell into the wild world of analytics engineering?

In this session, Michael Chow (RStudio) draws upon his experience building open source data science tools and working with the data science community to discuss the early signs of a budding analytics engineer, and the small steps these folks can take to keep the best parts of Python and R, all while moving towards engineering best practices.

Check the slides here: https://docs.google.com/presentation/d/1H2fVa-I4D8ibanlqLutIrwPOVypIlXVzEITDUNzzPpU/edit?usp=sharing

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

Announcing dbt's Second Language: When and Why We Turn to Python

2022-10-25 · dbt Coalesce 2022 Watch

video

by Cody Peterson , Jeremy Cohen (dbt Labs) , Leah Antkiewicz

dbt SQL

For the first time in dbt, you can now run Python models, making it possible to supplement the accessibility of SQL with a new level of power and flexibility.

When is it useful to use Python, and when should you stick with SQL instead? What might a multilingual dbt project look like in practice, and what could it make possible for your team?

Join Jeremy Cohen, Cody Peterson, and Leah Antkiewicz to explore these questions in this interactive session.

Check the slides here: https://docs.google.com/presentation/d/1e3wB7EQ0EXugGhfCjVCp_dDFEbY_uKyVjMqG1o7alnA/edit?usp=sharing

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

dbt Labs + Snowflake: Why SQL and Python go perfectly well together

2022-10-25 · dbt Coalesce 2022 Watch

video

by Torsten Grabs (Snowflake)

AI/ML Analytics Cloud Computing Data Engineering Data Science dbt Snowflake SQL

As data science and machine learning adoption grew over the last few years, Python moved up the ranks catching up to SQL in popularity in the world of data processing. SQL and Python are both powerful on their own, but their value in modern analytics is highest when they work together. This was a key motivator for us at Snowflake to build Snowpark for Python: to help modern analytics, data engineering, and data science teams generate insights without complex infrastructure management for separate languages.

Join this session to learn more about how dbt's new support for Python-based models and Snowpark for Python can help polyglot data teams get more value from their data through secure, efficient and performant metrics stores, feature stores, or data factories in the Data Cloud.

Check Notion document here: https://www.notion.so/6382db82046f41599e9ec39afb035bdb

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

Workshop: Build your first dbt Python model

2022-10-25 · dbt Coalesce 2022 Watch

video

by Nicholas Yager (dbt Labs) , Wasila Quader (dbt Labs)

dbt SQL

Description: dbt now supports Python models! In this hands-on workshop you’ll learn how to build your first Python models in dbt, alongside SQL at the center of your transformations.

You’ll learn how to: - Build your Python transformation in a notebook - Add this transformation as a model in your dbt project - Decide between building models in SQL or in Python

Prerequisites: - Basic familiarity with Python and DataFrames - If you want to use your own Warehouse and dbt project, make sure that you have dbt 1.3 installed and have followed the “additional setup” from our docs

Check the slides here: https://docs.google.com/presentation/d/133CVwwAxc5qT80ZJwngQ_ZSikOkCttvzWwGpdZCgOHQ/edit#slide=id.g1693e59a4f4_0_0

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

The Book of Dash

2022-10-25 · O'Reilly Data Science Books O'Reilly Amazon

book

by Christian Mayer , Adam Schroeder , Ann Marie Ward

AI/ML Dashboard DataViz Git Pandas Plotly dashboards data data-science data-science-tasks data-visualization

A swift and practical introduction to building interactive data visualization apps in Python, known as dashboards. Youâ??ve seen dashboards before; think election result visualizations you can update in real time, or population maps you can filter by demographic. With the Python Dash library youâ??ll create analytic dashboards that present data in effective, usable, elegant ways in just a few lines of code. The book is fast-paced and caters to those entirely new to dashboards. It will talk you through the necessary software, then get straight into building the dashboards themselves. Youâ??ll learn the basic format of a Dash app by building a twitter analysis dashboard that maps the number of likes certain accounts gained over time. Youâ??ll build up skills through three more sophisticated projects. The first is a global analysis app that compares country data in three areas: the percentage of a population using the internet, percentage of parliament seats held by women, and CO2 emissions. Youâ??ll then build an investment portfolio dashboard, and an app that allows you to visualize and explore machine learning algorithms. In this book you will: â?¢Create and run your first Dash apps â?¢Use the pandas library to manipulate and analyze social media data â?¢Use Git to download and build on existing apps written by the pros â?¢Visualize machine learning models in your apps â?¢Create and manipulate statistical and scientific charts and maps using Plotly Dash combines several technologies to get you building dashboards quickly and efficiently. This book will do the same.

SQL Antipatterns, Volume 1

2022-10-24 · O'Reilly SQL Books O'Reilly Amazon

book

by Bill Karwin

Data Modelling Java MySQL RDBMS Cyber Security SQL

SQL is the ubiquitous language for software developers working with structured data. Most developers who rely on SQL are experts in their favorite language (such as Java, Python, or Go), but they're not experts in SQL. They often depend on antipatterns - solutions that look right but become increasingly painful to work with as you uncover their hidden costs. Learn to identify and avoid many of these common blunders. Refactor an inherited nightmare into a data model that really works. Updated for the current versions of MySQL and Python, this new edition adds a dozen brand new mini-antipatterns for quick wins. No matter which platform, framework, or language you use, the database is the foundation of your application, and the SQL database language is the standard for working with it. Antipatterns are solutions that look simple at the surface, but soon mire you down with needless work. Learn to identify these traps, and craft better solutions for the often-asked questions in this book. Avoid the mistakes that lead to poor performance and quality, and master the principles that make SQL a powerful and flexible tool for handling data and logic. Dive deep into SQL and database design, and learn to recognize the most common missteps made by software developers in database modeling, SQL query logic, and code design of data-driven applications. See practical examples of misconceptions about SQL that can lure software projects astray. Find the greatest value in each group of data. Understand why an intersection table may be your new best friend. Store passwords securely and don't reinvent the wheel. Handle NULL values like a pro. Defend your web applications against the security weakness of SQL injection. Use SQL the right way - it can save you from headaches and needless work, and let your application really shine! What You Need: The SQL examples use the MySQL 8.0 flavor, but other popular brands of RDBMS are mentioned. Other code examples use Python 3.9+ or Ruby 2.7+.

talk-data.com

Activity Trend

Top Events

Top Speakers

Scaling Python with Ray

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Episode 104: Jason Turner from CppCast! (Part 2)

Taking A Look Under The Hood At CreditKarma's Data Platform

Applied Machine Learning and AI for Engineers

Clean Up Your Data Using Scalable Entity Resolution And Data Mastering With Zingg

#205: Nailing the Data Science / Analytics Job Interview with Jay Feng

#111 The Rise of the Julia Programming Language

Analytics Engineering Without The Friction Of Complex Pipeline Development With Optimus and dbt

How to leverage dbt Community as the first & ONLY data hire to survive

Empowering pythonistas with dbt and snowpark

dbt Labs and Databricks: best practices and future roadmap

Money, Python, and the Holy Grail: Designing Operational Data Models

The accidental analytics engineer

Announcing dbt's Second Language: When and Why We Turn to Python

dbt Labs + Snowflake: Why SQL and Python go perfectly well together

Workshop: Build your first dbt Python model

The Book of Dash

SQL Antipatterns, Volume 1