Hadoop

ABN Story: Migrating to Future Proof Data Platformh

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Rakesh Singh , Marcel Kramer

Azure Cloud Computing Databricks Microsoft

ABN AMRO Bank is one of the top leading banks in the Netherlands. It is the third largest bank in the Netherlands by revenue and number of mortgages held within the Netherlands, and has top management support of the objective to become a fully data-driven bank. ABN AMRO started its data journey almost seven years ago and has built a data platform off-premises with Hadoop technologies. This data platform has been used by more than 200 data providers, 150 data consumers, and more than 3000 datasets.

To become a fully digital bank and address the limitation of the on-premises platform requires a future-proof data platform DIAL (digital integration and access layer). ABN AMRO decided to build an Azure cloud-native data platform with the help of Microsoft and Databricks. Last year this cloud-native platform was ready for our data providers and data consumers. Six months ago we started the journey of migrating all the content from the on-premises data platform to the Azure data platform, this was a very large-scale migration and was achieved in six months.

In this session, we will focus on three things: 1. The migration strategy going from on-premises to a cloud-native platform 2. Which Databricks solutions were used in the data platform 3. How the Databricks team assisted in the overall migration

Talk by: Rakesh Singh and Marcel Kramer

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Ten years of building open source standards: From Parquet to Arrow to OpenLineage | Astronomer

2023-05-11 · Data Council 2023 Watch

video

by Julien Le Dem (Astronomer)

AI/ML Analytics Arrow Astronomer Data Engineering Dremio Iceberg Parquet

ABOUT THE TALK: Over the last decade I have been lucky enough to contribute a few successful open source projects to the data ecosystem. In this talk

Julien Le Dem shares the story of his contribution to successful open source projects to the data ecosystem and what made their success possible. From the ideation process and early growth of the Apache Parquet columnar format and how this led to the creation of its in-memory alter-ego Apache Arrow. Julian will end with showing how this experience enabled the success of OpenLineage, an LFAI & Data project that brings observability to the data ecosystem.

ABOUT THE SPEAKER: Julien Le Dem is the Chief Architect of Astronomer and Co-Founder of Datakin. He co-created Apache Parquet and is involved in several open source projects including OpenLineage, Marquez (LFAI&Data), Apache Arrow, Apache Iceberg and a few others. Previously, he was a senior principal at Wework; principal architect at Dremio; and tech lead for Twitter’s data processing tools and principal engineer working on content platforms at Yahoo, where he received his Hadoop initiation.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Feed The Alligators With the Lights On: How Data Engineers Can See Who Really Uses Data | Stemma

2023-05-11 · Data Council 2023 Watch

video

by Mark Grover (Stemma)

AI/ML Analytics API Data Engineering Grafana Spark

ABOUT THE TALK: At Lyft, Mark Grover built the Amundsen data catalog so data scientists could navigate hundreds of thousands of tables to distinguish trustworthy data from sandboxed, out-of-date data. When he took Amundsen open source, he helped dozens of data teams support a variety of demands to make data discoverable and self-serve. Mark frequently sees processes that seem “good enough” come back to bite data teams. In this talk, Mark takes us deep into query logs and APIs to see where all of that metadata lives, and he'll demonstrate how to use it so you don’t lose any fingers during your next data change.

ABOUT THE SPEAKER: Mark Grover is the co-founder/CEO of Stemma - a modern data catalog for building self-serve data culture used by Grafana, iRobot, SoFi, Convoy and many others. He is the co-creator of the leading open-source data catalog, Amundsen, used by Lyft, Instacart, Square, ING, Snap and many more! Mark was previously a developer on Apache Spark at Cloudera and is a committer and PMC member on a few open-source Apache project. He is a co-author of Hadoop Application Architectures.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Mapping The Data Infrastructure Landscape As A Venture Capitalist

2023-04-03 · Data Engineering Podcast Listen

podcast_episode

by Matt Turck (FirstMark Capital) , Tobias Macey

AI/ML Cloud Computing Data Engineering Data Management Databricks Dataiku dbt DuckDB ETL/ELT GenAI Hudi Iceberg +5 more

Summary

The data ecosystem has been building momentum for several years now. As a venture capital investor Matt Turck has been trying to keep track of the main trends and has compiled his findings into the MAD (ML, AI, and Data) landscape reports each year. In this episode he shares his experiences building those reports and the perspective he has gained from the exercise.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack today to learn more Your host is Tobias Macey and today I'm interviewing Matt Turck about his annual report on the Machine Learning, AI, & Data landscape and the insights around data infrastructure that he has gained in the process

Interview

Introduction How did you get involved in the area of data management? Can you describe what the MAD landscape report is and the story behind it?

At a high level, what is your goal in the compilation and maintenance of your landscape document? What are your guidelines for what to include in the landscape?

As the data landscape matures, how have you seen that influence the types of projects/companies that are founded?

What are the product categories that were only viable when capital was plentiful and easy to obtain? What are the product categories that you think will be swallowed by adjacent concerns, and which are likely to consolidate to remain competitive?

The rapid growth and proliferation of data tools helped establish the "Modern Data Stack" as a de-facto architectural paradigm. As we move into this phase of contraction, what are your predictions for how the "Modern Data Stack" will evolve?

Is there a different architectural paradigm that you see as growing to take its place?

How has your presentation and the types of information that you collate in the MAD landscape evolved since you first started it?~~ What are the most interesting, innovative, or unexpected product and positioning approaches that you have seen while tracking data infrastructure as a VC and maintainer of the MAD landscape? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the MAD landscape over the years? What do you have planned for future iterations of the MAD landscape?

Contact Info

Website @mattturck on Twitter MAD Landscape Comments Email

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

MAD Landscape First Mark Capital Bayesian Learning AI Winter Databricks Cloud Native Landscape LUMA Scape Hadoop Ecosystem Modern Data Stack Reverse ETL Generative AI dbt Transform

Podcast Episode

Snowflake IPO Dataiku Iceberg

Podcast Episode

Hudi

Podcast Episode

DuckDB

Podcast Episode

Trino Y42

Podcast Episode

Mozart Data

Podcast Episode

Keboola MPP Database

The intro and outro music is f

Ami Gal, CEO & Co-founder at SQream. We dive deep into Big SQL analytics powered by GPUs, plus the future of compute.

2023-03-29 · Making Data Simple Listen

podcast_episode

by Ami Gal (SQream) , Al Martin (IBM)

Analytics IBM SQL

Send us a text Ami Gal, CEO & Co-founder at SQream. We dive deep into Big SQL analytics powered by GPUs, plus the future of compute. 02:20 Meet Ami Gal04:52 What's in a name? sqream.com08:10 Problem being solved13:53 The secret sauce : data flow16:52 Software or HW for scale20:47 Secret sauce take 225:02 Hadoop, future of27:52 Hybrid cloud31:31 Go-to-market35:09 The next 5 years of compute39:18 Ok, next 20 years44:17 For funLinkedIn: linkedin.com/in/galami Website: sqream.com Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Reflecting On The Past 6 Years Of Data Engineering

2023-02-06 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey

AI/ML Airflow Alation Analytics API AWS Lambda BI Big Data Cloud Computing Dagster Data Engineering Data Management +12 more

Summary

This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Your host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years

Interview

Introduction 6 years of running the Data Engineering Podcast Around the first time that data engineering was discussed as a role

Followed on from hype about "data science"

Hadoop era Streaming Lambda and Kappa architectures

Not really referenced anymore

"Big Data" era of capture everything has shifted to focusing on data that presents value

Regulatory environment increases risk, better tools introduce more capability to understand what data is useful

Data catalogs

Amundsen and Alation

Orchestration engine

Oozie, etc. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc. Orchestration is now a part of most vertical tools

Cloud data warehouses Data lakes DataOps and MLOps Data quality to data observability Metadata for everything

Data catalog -> data discovery -> active metadata

Business intelligence

Read only reports to metric/semantic layers Embedded analytics and data APIs

Rise of ELT

dbt Corresponding introduction of reverse ETL

What are the most interesting, unexpected, or challenging lessons that you have learned while working on running the podcast? What do you have planned for the future of the podcast?

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Materialize:

Looking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use.

Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.

Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.

Go to materialize.comSupport Data Engineering Podcast

Pushing the limits of scale/performance for enterprise-wide analytics: A fire-side chat with Akamai

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Analytics Azure Cloud Computing Databricks Microsoft

With the world’s most distributed compute platform — from cloud to edge — Akamai makes it easy for businesses to develop and run applications, while keeping experiences closer to users and threats farther away. So when it was time to scale it’s legacy Hadoop-like infrastructure reaching its capacity limits, while keeping their global operations running uninterrupted, Akamai partnered with Microsoft and Databricks to migrate to Azure Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Simplifying Migrations to Lakehouse—the Databricks Way

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Cloud Computing Data Lakehouse Databricks

Customers around the world are experiencing tremendous success migrating from legacy on-premises Hadoop architectures to a modern Databricks Lakehouse in the cloud. At Databricks, we have formulated a migration methodology that helps customers sail through this migration journey with ease. In this talk, we will touch upon some of the key elements that minimize risks and simplify the process of migrating to Databricks, and will walk through some of the customer journeys and use cases.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How AARP Services, Inc. automated SAS transformation to Databricks using LeapLogic

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Analytics Cloud Computing Data Science Databricks Delta DWH ETL/ELT Marketing Python SAS Spark

While SAS has been a standard in analytics and data science use cases, it is not cloud-native and does not scale well. Join us to learn how AARP automated the conversion of hundreds of complex data processing, model scoring, and campaign workloads to Databricks using LeapLogic, an intelligent code transformation accelerator that can transform any and all legacy ETL, analytics, data warehouse and Hadoop to modern data platforms.

In this session experts from AARP and Impetus will share about collaborating with Databricks and how they were able to: • Automate modernization of SAS marketing analytics based on coding best practices • Establish a rich library of Spark and Python equivalent functions on Databricks with the same capabilities as SAS procedures, DATA step operations, macros, and functions • Leverage Databricks-native services like Delta Live Tables to implement waterfall techniques for campaign execution and simplify pipeline monitoring

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Swedbank: Enterprise Analytics in Cloud

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Analytics Azure Cloud Computing Data Lake Databricks

Swedbank is the largest bank in Sweden & third largest in Nordics. They have about 7-8M customers across retail, mortgage , and investment (pensions). One of the key drivers for the bank was to look at data across all silos and build analytics to drive their ML models - they couldn’t. That’s when Swedbank made a strategic decision to go to the cloud and make bets on Databricks, Immuta, and Azure.

-Enterprise analytics in cloud is an initiative to move Swedbanks on-premise Hadoop based data lake into the cloud to provide improved analytical capabilities at scale. The strategic goals of the “Analytics Data Lake” are: -Advanced analytics: Improve analytical capabilities in terms of functionality, reduce analytics time to market and better predictive modelling -A Catalyst for Sharing Data: Make data Visible, Accessible, Understandable, Linked, and Trusted Technical advancements: Future proof with ability to add new tools/libraries, support for 3rd party solutions for Deep Learning/AI

To achieve these goals, Swedbank had to migrate existing capabilities and application services to Azure Databricks & implement Immuta as its unified access control plane. A “data discovery” space was created for data scientists to be able to come & scan (new) data, develop, train & operationalise ML models. To meet these goals Swedbank requires dynamic and granular data access controls to both mitigate data exposure (due to compromised accounts, attackers monitoring a network, and other threats) while empowering users via self-service data discovery & analytics. Protection of sensitive data is key to enable Swedbank to support key financial services use cases.

The presentation will focus on this journey, calling out key technical challenges, learning & benefits observed.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

"Data systems are just like toddlers" - Rohit Choudhary, Acceldata

2022-04-27 · Making Data Simple Listen

podcast_episode

by Rohit Choudhary (Acceldata) , Al Martin (IBM)

AI/ML IBM

Send us a text Core hypothesis is all companies are data driven, all have data complexity, and there is not enough talent out there to solve it. Show Notes 04:01 Why Start Acceldata 06:02 Core hypothesis for Acceldata 08:59 Will Hadoop survive? And a tip of the hat to mainframes 11:12 Meaning of Data Observability & managing the data pipeline 13:26 The most troublesome data layer 20:10 Acceldata value prop 24:58 A typical observability engagement 29:08 AI is built in 30:38 Early client adoption 33:26 Preventing future data reliability issues 35:10 Acceldata's futures and why now? Find Rohit: https://www.linkedin.com/in/rconline/ Find Acceldata : https://www.linkedin.com/company/acceldata/ Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next. Abstract Making Data Simple Podcast is hosted by Al Martin, WW VP Account Technical Leader IBM Technology Sales, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

2022-03-27 · Data Engineering Podcast Listen

podcast_episode

by Balaji Ganesan (Privacera) , Tobias Macey

Analytics CDP Cloud Computing Data Engineering Data Governance Data Lake Data Management ETL/ELT Kubernetes Modern Data Stack Oracle Presto +7 more

Summary Data governance is a practice that requires a high degree of flexibility and collaboration at the organizational and technical levels. The growing prominence of cloud and hybrid environments in data management adds additional stress to an already complex endeavor. Privacera is an enterprise grade solution for cloud and hybrid data governance built on top of the robust and battle tested Apache Ranger project. In this episode Balaji Ganesan shares how his experiences building and maintaining Ranger in previous roles helped him understand the needs of organizations and engineers as they define and evolve their data governance policies and practices.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Balaji Ganesan about his work at Privacera and his view on the state of data governance, access control, and security in the cloud

Interview

Introduction How did you get involved in the area of data management? Can you describe what Privacera is and the story behind it? What is your working definition of "data governance" and how does that influence your product focus and priorities? What are some of the lessons that you learned from your work on Apache Ranger that helped with your efforts at Privacera? How would you characterize your position in the market for data governance/data security tools? What are the unique constraints and challenges that come into play when managing data in cloud platforms? Can you explain how the Privacera platform is architected?

How have the design and goals of the system changed or evolved since you started working on it?

What is the workflow for an operator integrating Privacera into a data platform?

How do you provide feedback to users about the level of coverage for discovered data assets?

How does Privacera fit into the workflow of the different personas working with data?

What are some of the security and privacy controls that Privacera introduces?

How do you mitigate the potential for anyone to bypass Privacera’s controls by interacting directly with the underlying systems? What are the most interesting, innovative, or unexpected ways that you have seen Privacera used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Privacera? When is Privacera the wrong choice? What do you have planned for the future of Privacera?

Contact Info

LinkedIn @Balaji_Blog on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

Privacera Hadoop Hortonworks Apache Ranger Oracle Teradata Presto/Trino Starburst

Podcast Episode

Ahana

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Simplify Big Data Analytics with Amazon EMR

2022-03-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sakti Mishra (AWS)

Analytics AWS Amazon EMR Big Data Cloud Computing Data Analytics Data Governance ETL/ELT Java Python Scala Cyber Security +5 more

Simplify Big Data Analytics with Amazon EMR is a thorough guide to harnessing Amazon's EMR service for big data processing and analytics. From distributed computation pipelines to real-time streaming analytics, this book provides hands-on knowledge and actionable steps for implementing data solutions efficiently. What this Book will help me do Understand the architecture and key components of Amazon EMR and how to deploy it effectively. Learn to configure and manage distributed data processing pipelines using Amazon EMR. Implement security and data governance best practices within the Amazon EMR ecosystem. Master batch ETL and real-time analytics techniques using technologies like Apache Spark. Apply optimization and cost-saving strategies to scalable data solutions. Author(s) Sakti Mishra is a seasoned data professional with extensive expertise in deploying scalable analytics solutions on cloud platforms like AWS. With a background in big data technologies and a passion for teaching, Sakti ensures practical insights accompany every concept. Readers will find his approach thorough, hands-on, and highly informative. Who is it for? This book is perfect for data engineers, data scientists, and other professionals looking to leverage Amazon EMR for scalable analytics. If you are familiar with Python, Scala, or Java and have some exposure to Hadoop or AWS ecosystems, this book will empower you to design and implement robust data pipelines efficiently.

Data Analysis with Python and PySpark

2022-03-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jonathan Rioux

AI/ML Analytics API Big Data Cloud Computing Data Science Microsoft Pandas PySpark Python Spark apache-spark +2 more

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines. In Data Analysis with Python and PySpark you will learn how to: Manage your data as it scales across multiple machines Scale up your data programs with full confidence Read and write data to and from a variety of sources and formats Deal with messy data with PySpark’s data manipulation functionality Discover new data sets and perform exploratory data analysis Build automated data pipelines that transform, summarize, and get insights from data Troubleshoot common PySpark errors Creating reliable long-running jobs Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required. About the Technology The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem. About the Book Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code. What's Inside Organizing your PySpark code Managing your data, no matter the size Scale up your data programs with full confidence Troubleshooting common data pipeline problems Creating reliable long-running jobs About the Reader Written for data scientists and data engineers comfortable with Python. About the Author As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts. Quotes A clear and in-depth introduction for truly tackling big data with Python. - Gustavo Patino, Oakland University William Beaumont School of Medicine The perfect way to learn how to analyze and master huge datasets. - Gary Bake, Brambles Covers both basic and more advanced topics of PySpark, with a good balance between theory and hands-on. - Philippe Van Bergenl, P² Consulting For beginner to pro, a well-written book to help understand PySpark. - Raushan Kumar Jha, Microsoft

One Database to Rule All Workloads? With Jon "Natty" Natkins of dbt Labs

2022-03-11 · The Analytics Engineering Podcast Listen

podcast_episode

by Jon "Natty" Natkins (dbt Labs) , Julia Schottenstein (dbt labs)

Analytics Analytics Engineering Cloud Computing dbt

Will the dream of a mythical database to handle all workloads (transactional + analytical) ever become a reality, or does it violate the laws of physics? This question sparked a hearty debate internally at dbt Labs, and Jon "Natty" Natkins joins Julia here to continue the conversation. Natty knows databases, and this episode will take you on a historical romp through the rise and fall of Hadoop, the transition to cloud data warehouses, and what's waiting for us next in database-land. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

A Reflection On The Data Ecosystem For The Year 2021

2022-01-02 · Data Engineering Podcast Listen

podcast_episode

by Maura Church (Patreon) , David Wallace (Good Eggs) , Gleb Mezhanskiy (Datafold) , Benn Stancil (Mode) , Tobias Macey

Airbyte BI BigQuery Cloud Computing Dagster Data Engineering Data Management dbt DWH ETL/ELT Fivetran Kubernetes +4 more

Summary This has been an active year for the data ecosystem, with a number of new product categories and substantial growth in existing areas. In an attempt to capture the zeitgeist Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy join the show to reflect on the past year and share their thought son the year to come.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy about the key themes of 2021 in the data ecosystem and what to expect for next year

Interview

Introduction

How did you get involved in the area of data management?

What were the main themes that you saw data practitioners and vendors focused on this year?

What is the major bottleneck for Data teams in 2021? Will it be the same in 2022? One of the ways to reason about progress in any domain is to look at what was the primary bottleneck of further progress (data adoption for decision making) at different points in time. In the data domain, we have seen a number of bottlenecks, for example, scaling data platforms, the answer to which was Hadoop and on-prem columnar stores and then cloud data warehouses such as Snowflake & BigQuery. Then the problem was data integration and transformation which was solved by data integration vendors and frameworks such as Fivetran / Airbyte, modern orchestration frameworks such as Dagster & dbt and “reverse-ETL” Hightouch. What is the main challenge now?

Will SQL be challenged as a primary interface to analytical data? In 2020 we’ve seen a few launches of post-SQL languages such as Malloy, Preql, metric layer query languages from Transform and Supergrain.

To what extent does speed matter? Over the past

[COALESCE] How big is this wave? Ft. Martin Casado of a16z

2021-12-07 · The Analytics Engineering Podcast Listen

podcast_episode

by Martin Casado (a16z)

Analytics Analytics Engineering dbt DWH Modern Data Stack

The modern data stack is the third generation of data analysis products to come to prominence since the 90's. The prior waves—data warehouse appliances and then Hadoop—were both big steps forwards but ultimately failed to live up to their initial promise. Is the modern data stack just another iteration in a long string of "trendy technologies" in data––waves that crash upon the shore but ultimately recede? Or is it somehow more permanent? Register to catch the rest of Coalesce, the Analytics Engineering Conference, at https://coalesce.getdbt.com. The Analytics Engineering Podcast is brought to you by dbt Labs.

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

2021-11-20 · Data Engineering Podcast Listen

podcast_episode

by Ori Rafael (Upsolver) , Tobias Macey

Airflow Analytics AWS Lambda BI CI/CD Data Engineering Data Lake Data Management Data Quality Datafold dbt ETL/ELT +7 more

Summary One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand the proper processing patterns to make that performant. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Ori Rafael about strategies for building stream and batch processing patterns for data lake analytics

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of the state of the market for data lakes today?

What are the prevailing architectural and technological patterns that are being used to manage these systems?

Batch and streaming systems have been used in various combinations since the early days of Hadoop. The Lambda architecture has largely been abandoned, so what is the answer for today’s data lakes? What are the challenges presented by streaming approaches to data transformations?

The batch model for processing is intuitive despite its latency problems. What are the benefits that it provides?

The core concept for data orchestration is the DAG. How does that manifest in a streaming context? In batch processing idempotent/immutable datasets are created by re-running the entire pipeline when logic changes need to be made. Given that there is no definitive start or end of a stream, what are the options for amending logical errors in transformations? What are some of the da

[Replay] Understanding Apache Spark with Jean-Georges Perrin

2021-09-15 · Making Data Simple Listen

podcast_episode

by Jean-Georges Perrin (Actian) , Al Martin (IBM)

Big Data IBM Spark

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Jean-Georges Perrin, Director of Engineering at weexperience. Together, they discuss — and compare — Apache Spark and Hadoop, and explain what it means to hold the title of IBM Champion. Show Notes 02:07 - Connect with Jean-Georges Perrin on LinkedIn and Twitter, and check out his website. 13:14 - Check out Jean-Georges' book on Apache Spark. 24:38 - What does it mean to be an IBM Champion? Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Cloudera Data Platform Private Cloud Base with IBM Spectrum Scale

2021-08-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by John Sing , Prashanth Shetty , Wei Gong , Linda Cham

Analytics CDP Cloud Computing Data Lake IBM Spark cloudera data data-engineering

This IBM® Redpaper publication provides guidance on building an enterprise-grade data lake by using IBM Spectrum® Scale and Cloudera Data Platform (CDP) Private Cloud Base for performing in-place Cloudera Hadoop or Cloudera Spark-based analytics. It also covers the benefits of the integrated solution and gives guidance about the types of deployment models and considerations during the implementation of these models. August 2021 update added CES protocol support in Hadoop environment

talk-data.com

Activity Trend

Top Events

Top Speakers

ABN Story: Migrating to Future Proof Data Platformh

Ten years of building open source standards: From Parquet to Arrow to OpenLineage | Astronomer

Feed The Alligators With the Lights On: How Data Engineers Can See Who Really Uses Data | Stemma

Mapping The Data Infrastructure Landscape As A Venture Capitalist

Ami Gal, CEO & Co-founder at SQream. We dive deep into Big SQL analytics powered by GPUs, plus the future of compute.

Reflecting On The Past 6 Years Of Data Engineering

Pushing the limits of scale/performance for enterprise-wide analytics: A fire-side chat with Akamai

Simplifying Migrations to Lakehouse—the Databricks Way

How AARP Services, Inc. automated SAS transformation to Databricks using LeapLogic

Swedbank: Enterprise Analytics in Cloud

"Data systems are just like toddlers" - Rohit Choudhary, Acceldata

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Simplify Big Data Analytics with Amazon EMR

Data Analysis with Python and PySpark

One Database to Rule All Workloads? With Jon "Natty" Natkins of dbt Labs

A Reflection On The Data Ecosystem For The Year 2021

[COALESCE] How big is this wave? Ft. Martin Casado of a16z

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

[Replay] Understanding Apache Spark with Jean-Georges Perrin

Cloudera Data Platform Private Cloud Base with IBM Spectrum Scale