Analytics

Presto 101: An Introduction to Open Source Presto

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

by Philip , Rohan

AWS Data Lake Databricks Presto Spark SQL

Presto is a widely adopted distributed SQL engine for data lake analytics. With Presto, you can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis. Additionally, new features like the disaggregated coordinator, Presto-on-Spark, scan optimizations, a reusable native engine, and a Pinot connector enable added benefits around performance, scale, and ecosystem.

In this session, Philip and Rohan will introduce the Presto technology and share why it’s becoming so popular – in fact, companies like Facebook, Uber, Twitter, Alibaba, and much more use Presto for interactive ad hoc queries, reporting & dashboarding data lake analytics, and much more. We’ll also show a quick demo on getting Presto running in AWS.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Privacy Preserving Machine Learning and Big Data Analytics Using Apache Spark

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Big Data Cloud Computing Data Analytics Databricks Java PySpark Python Scala Cyber Security Spark

In recent years, latest privacy laws & regulations bring a fundamental shift in the protection of data and privacy, placing new challenges to data applications. To resolve these privacy & security challenges in big data ecosystem without impacting existing applications, several hardware TEE (Trusted Execution Environment) solutions have been proposed for Apache Spark, e.g., PySpark with Scone and Opaque etc. However, to the best of our knowledge, none of them provide full protection to data pipelines in Spark applications. An adversary may still get sensitive information from unprotected components and stages. Furthermore, some of them greatly narrowed supported applications, e.g., only support SparkSQL. In this presentation, we will present a new PPMLA (privacy preserving machine learning and analytics) solution built on top of Apache Spark, BigDL, Occlum and Intel SGX. It ensures all spark components and pipelines are fully protected by Intel SGX, and existing Spark applications written in Scala, Java or Python can be migrated into our platform without any code change. We will demonstrate how to build distributed end-to-end SparkML/SparkSQL workloads with our solution on untrusted cloud environment and share real-world use cases for PPMLA.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Revolutionizing agriculture with AI: Delivering smart industrial solutions built upon a Lakehouse

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Big Data Data Lakehouse Databricks IoT Marketing

John Deere is leveraging big data and AI to deliver ‘smart’ industrial solutions that are revolutionizing agriculture and construction, driving sustainability and ultimately helping to feed the world. The John Deere Data Factory that is built upon the Databricks Lakehouse Platform is at the core of this innovation. It ingests petabytes of data and trillions of records to give data teams fast, reliable access to standardized data sets supporting 100s of ML and analytics use cases across the organization. From IoT sensor-enabled equipment driving proactive alerts that prevent failures, to precision agriculture that maximizes field output, to optimizing operations in the supply chain, finance and marketing, John Deere is providing advanced products, technology and services for customers who cultivate, harvest, transform, enrich, and build upon the land.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML API AWS Amazon EC2 Amazon EMR Big Data CI/CD Databricks ETL/ELT Redshift Spark

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Swedbank: Enterprise Analytics in Cloud

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Azure Cloud Computing Data Lake Databricks Hadoop

Swedbank is the largest bank in Sweden & third largest in Nordics. They have about 7-8M customers across retail, mortgage , and investment (pensions). One of the key drivers for the bank was to look at data across all silos and build analytics to drive their ML models - they couldn’t. That’s when Swedbank made a strategic decision to go to the cloud and make bets on Databricks, Immuta, and Azure.

-Enterprise analytics in cloud is an initiative to move Swedbanks on-premise Hadoop based data lake into the cloud to provide improved analytical capabilities at scale. The strategic goals of the “Analytics Data Lake” are: -Advanced analytics: Improve analytical capabilities in terms of functionality, reduce analytics time to market and better predictive modelling -A Catalyst for Sharing Data: Make data Visible, Accessible, Understandable, Linked, and Trusted Technical advancements: Future proof with ability to add new tools/libraries, support for 3rd party solutions for Deep Learning/AI

To achieve these goals, Swedbank had to migrate existing capabilities and application services to Azure Databricks & implement Immuta as its unified access control plane. A “data discovery” space was created for data scientists to be able to come & scan (new) data, develop, train & operationalise ML models. To meet these goals Swedbank requires dynamic and granular data access controls to both mitigate data exposure (due to compromised accounts, attackers monitoring a network, and other threats) while empowering users via self-service data discovery & analytics. Protection of sensitive data is key to enable Swedbank to support key financial services use cases.

The presentation will focus on this journey, calling out key technical challenges, learning & benefits observed.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Unlocking the power of data, AI & analytics: Amgen’s journey to the Lakehouse | Kerby Johnson

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

by Kerby Johnson

AI/ML Data Lakehouse Data Science Databricks DWH

In this keynote, you will learn more about Amgen's data platform journey from data warehouse to data lakehouse. They’’ll discuss our decision process and the challenges they faced with legacy architectures, and how they designed and implemented a sustaining platform strategy with Databricks Lakehouse, accelerating their ability to democratize data to thousands of users.
Today, Amgen has implemented 400+ data science and analytics projects covering use cases like clinical trial optimization, supply chain management and commercial sales reporting, with more to come as they complete their digital transformation and unlock the power of data across the company.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

US Air Force: Safeguarding Personnel Data at Enterprise Scale

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Cloud Computing Data Analytics Data Governance Databricks Cyber Security

The US Air Force VAULT platform is a cloud-native enterprise data platform designed to provide the Department of the Air Force (DAF) with a robust, interoperable, and secure data environment. The strategic goals of VAULT include:

Leading Data Culture - Increase data use and literacy to improve efficiency and effectiveness of decisions, readiness, mission operations, and cybersecurity.
A Catalyst for Sharing Data - Make data Visible, Accessible, Understandable, Linked, and Trusted (VAULT).
Driving Data Capabilities - Increase access to the right combination of state-of-the-art technologies needed to best utilize data.

To achieve these goals, the VAULT team created a self-service platform to onboard and extract, transform and load data, perform data analytics, machine learning and visualization, and data governance. Supporting over 50 tenants across NIPR and SIPR, adds complexity to maintaining data security while ensuring data can be shared and utilized for analytics. To meet these goals VAULT requires dynamic and granular data access controls to both mitigate data exposure (due to compromised accounts, attackers monitoring a network, and other threats) while empowering users via self-service analytics. Protection of sensitive data is key to enable VAULT to support key use cases such as personal readiness to optimally place Airmen trainees to meet production goals, increase readiness, and match trainees to their preferences.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Using Feast Feature Store with Apache Spark for Self-Served Data Sharing and Analysis for Streaming

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Data Management Databricks KPI Spark Data Streaming

In this presentation we will talk about how we will use available NER based sensitive data detection methods, automated record of activity processing on top of spark and feast for collaborative intelligent analytics & governed data sharing. Information sharing is the key to successful business outcomes but it's complicated by sensitive information both user centric and business centric.

Our presentation is motivated by the need to share key KPIs, outcomes for health screening data collected from various surveys to improve care and assistance. In particular, collaborative information sharing was needed to help with health data management, early detection and prevention of disease KPIs. We will present a framework or an approach we have used for these purposes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Making The Total Cost Of Ownership For External Data Manageable With Crux

2022-07-17 · Data Engineering Podcast Listen

podcast_episode

by Mark Etherington (Crux) , Tobias Macey

Airflow API BI CI/CD Data Engineering Data Management Data Quality Datafold dbt Kubernetes MongoDB MySQL +2 more

Summary There are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Crux was built to reduce the total cost of acquisition and ownership for integrating external data, offering a fully managed service for delivering those data assets in the manner that best suits your infrastructure. In this episode Crux CTO Mark Etherington discusses the different costs involved in managing external data, how to think about the total return on investment for your data, and how the Crux platform is architected to reduce the toil involved in managing third party data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias M

Dissecting CPI, Ditching Hoagies

2022-07-15 · Moody's Talks - Inside Economics Listen

podcast_episode

by Michael Brisson (Moody's Analytics) , Jesse Rogers (Moody's Analytics) , Cris deRitis , Mark Zandi (Moody's Analytics) , Chris Lafakis , Ryan Sweet , Juan Fuentes (Moody's Analytics)

Mark, Ryan and Cris are joined by a bevy of colleagues to dig deeper into the June Consumer Price Index report and the sources of inflation, including energy, vehicles, food and shelter. Full Episode Transcript Follow Mark Zandi @MarkZandi, Ryan Sweet @RealTime_Econ and Cris deRitis @MiddleWayEcon for additional insight.

Questions or Comments, please email us at [email protected]. We would love to hear from you. To stay informed and follow the insights of Moody's Analytics economists, visit Economic View.

Data Activation Everywhere (w/ Julie Beynon of Clearbit)

2022-07-15 · The Analytics Engineering Podcast Listen

podcast_episode

by Tristan Handy (dbt Labs) , Julia Schottenstein (dbt labs) , Julie Beynon (Clearbit)

Analytics Engineering dbt

As Head of Analytics at Clearbit, Julie serves as a data team of one in a 200+ person company (wow!). In this conversation with Tristan and Julia, Julie dives into how she's helped Clearbit implement data activation throughout the business, and realize the glorious dream of self-serve analytics. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

How To Maximize the Odds of Startup Success with Patrick Parker at SaaS Partners

2022-07-14 · SaaS Scaled - Interviews about SaaS Startups, Analytics, & Operations Listen

podcast_episode

by Patrick Parker (SaaS Partners)

AI/ML AWS Blockchain SaaS Cyber Security Web3

Welcome to the latest episode of SaaS Scaled, where we’re talking to Patrick Parker, CEO of SaaS Partners — a company aimed at helping people build and scale their own SaaS businesses. Patrick talks about his experience in consulting and software security, and what ultimately led him to start building SaaS Partners. We address why new startups so often fail and discuss some of the key challenges they need to overcome. What are some simple things new startup founders can do to boost their chances of success? We discuss the importance of focusing on one key problem at a time, and Patrick talks about the value of reusing and copying certain successful models of building a startup. We also dive into the impact of emerging technologies on the SaaS space such as AI, machine learning, and the metaverse. Finally, we talk about Web3, cryptocurrency, and blockchain, what this growing trend means for SaaS and the world as a whole, and how long it will take to become truly mainstream. This episode is brought to you by Qrvey The tools you need to take action with your data, on a platform built for maximum scalability, security, and cost efficiencies. If you’re ready to reduce complexity and dramatically lower costs, contact us today at qrvey.com. Qrvey, the modern no-code analytics solution for SaaS companies on AWS.

API Strategy for Decision Makers

2022-07-13 · O'Reilly Data Science Books O'Reilly Amazon

book

by Derric Gilling , Mike Amundsen

API data data-science google-analytics web-analytics

API Strategy for Decision Makers provides actionable best practices for building winning API products. Whether you're with an API-first company or with an existing enterprise software company looking to expand your API offerings, this report shows you how to grow your API product into a center of revenue and metrics to measure against. Authors Mike Amundsen and Derric Gilling draw on real-life examples to guide you through the important considerations for successfully developing and executing on your API strategy. By identifying winning business cases, sharing best practices for adoption and monetization, and showing you how to create a great API experience, this report presents a detailed overview for every C-suite executive and product owner managing an API platform. With this report, you'll: Learn how to apply common product management techniques to better manage your APIs as products Discover how to define a strategy from initial adoption, monetization, and deprecation of your APIs Determine how to position your API products in your company's portfolio as well as the wider market Dig deep into the step-by-step process of making your API strategy come to life Explore the power of API observability, monitoring, and analytics as a means to quantify, track, and optimize your API strategy

Pro Power BI Dashboard Creation: Building Elegant and Interactive Dashboards with Visually Arresting Analytics

2022-07-13 · O'Reilly Data Science Books O'Reilly Amazon

book

by Adam Aspin

BI Dashboard Microsoft Power BI business-intelligence data data-science microsoft-power-platform power-bi

Produce high-quality, visually attractive analysis quickly and effectively with Microsoft’s key BI tool. This book teaches analysts, managers, power users, and developers how to harness the power of Microsoft’s self-service business intelligence flagship product to deliver compelling and interactive insight with remarkable ease. It then shows you the essential techniques needed to go from source data to dashboards that seize your audience’s attention and provide them with clear and accurate information. As well as producing elegant and visually arresting output, you learn how to enhance the user experience through adding polished interactivity. This book shows you how to make interactive dashboards that allow you to guide users through the meaning of the data that they are exploring. Drill down features are also covered that allow you and your audience to dig deeper and uncover new insights by exploring anomalous and interesting data points. Reading this book builds your skills around creating meaningful and elegant dashboards using a range of compelling visuals. It shows you how to apply simple techniques to convert data into business insight. The book covers tablet and smartphone layouts for delivering business value in today’s highly mobile world. You’ll learn about formatting for effect to make your data tell its story, and you’ll be a master at creating visually arresting output on multiple devices that grabs attention, builds influence, and drives change. What You Will Learn Produce designer output that will astound your bosses and peers Make new insights as you chop and tweak your data as never before Create high-quality analyses in record time Create interdependent charts, maps, and tables Deliver visually stunning information Drill down through data to provide unique understandings Outshinecompeting products and enhance existing skills Adapt your dashboard delivery to mobile devices Who This Book Is For For any Power BI user who wants to strengthen their ability to deliver compelling analytics via Microsoft’s widely adopted analytics platform. For those new to Power BI who want to learn the full extent of what the platform is capable of. For power users such as BI analysts, data architects, IT managers, accountants, and C-suite members who want to drive change in their organizations.

The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

2022-07-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ron L'Esteve

AI/ML Azure BI Cloud Computing Data Lakehouse Databricks Delta ETL/ELT Microsoft PySpark Snowflake Spark +6 more

Design and implement a modern data lakehouse on the Azure Data Platform using Delta Lake, Apache Spark, Azure Databricks, Azure Synapse Analytics, and Snowflake. This book teaches you the intricate details of the Data Lakehouse Paradigm and how to efficiently design a cloud-based data lakehouse using highly performant and cutting-edge Apache Spark capabilities using Azure Databricks, Azure Synapse Analytics, and Snowflake. You will learn to write efficient PySpark code for batch and streaming ELT jobs on Azure. And you will follow along with practical, scenario-based examples showing how to apply the capabilities of Delta Lake and Apache Spark to optimize performance, and secure, share, and manage a high volume, high velocity, and high variety of data in your lakehouse with ease. The patterns of success that you acquire from reading this book will help you hone your skills to build high-performing and scalable ACID-compliant lakehouses using flexible and cost-efficient decoupled storage and compute capabilities. Extensive coverage of Delta Lake ensures that you are aware of and can benefit from all that this new, open source storage layer can offer. In addition to the deep examples on Databricks in the book, there is coverage of alternative platforms such as Synapse Analytics and Snowflake so that you can make the right platform choice for your needs. After reading this book, you will be able to implement Delta Lake capabilities, including Schema Evolution, Change Feed, Live Tables, Sharing, and Clones to enable better business intelligence and advanced analytics on your data within the Azure Data Platform. What You Will Learn Implement the Data Lakehouse Paradigm on Microsoft’s Azure cloud platform Benefit from the new Delta Lake open-source storage layer for data lakehouses Take advantage of schema evolution, change feeds, live tables, and more Writefunctional PySpark code for data lakehouse ELT jobs Optimize Apache Spark performance through partitioning, indexing, and other tuning options Choose between alternatives such as Databricks, Synapse Analytics, and Snowflake Who This Book Is For Data, analytics, and AI professionals at all levels, including data architect and data engineer practitioners. Also for data professionals seeking patterns of success by which to remain relevant as they learn to build scalable data lakehouses for their organizations and customers who are migrating into the modern Azure Data Platform.

095 - Increasing Adoption of Data Products Through Design Training: My Interview from TDWI Munich

2022-07-12 · Experiencing Data w/ Brian T. O’Neill (AI & data product management leadership—powered by UX design) Listen

podcast_episode

by Christoph Kreutz (TDWI Europe) , Brian O’Neill (Designing for Analytics)

Data Engineering

Today I am bringing you a recording of a live interview I did at the TDWI Munich conference for data leaders, and this episode is a bit unique as I’m in the “guest” seat being interviewed by the VP of TDWI Europe, Christoph Kreutz.

Christoph wanted me to explain the new workshop I was giving later that day, which focuses on helping leaders increase user adoption of data products through design. In our chat, I explained the three main areas I pulled out of my full 4-week seminar to create this new ½-day workshop as well as the hands-on practice that participants would be engaging in. The three focal points for the workshop were: measuring usability via usability studies, identifying the unarticulated needs of stakeholders and users, and sketching in low fidelity to avoid over committing to solutions that users won’t value.

Christoph also asks about the format of the workshop, and I explain how I believe data leaders will best learn design by doing it. As such, the new workshop was designed to use small group activities, role-playing scenarios, peer review…and minimal lecture! After discussing the differences between the abbreviated workshop and my full 4-week seminar, we talk about my consulting and training business “Designing for Analytics,” and conclude with a fun conversation about music and my other career as a professional musician.

In a hurry? Skip to:

I summarize the new workshop version of “Designing Human-Centered Data Products” I was premiering at TDWI (4:18) We talk about the format of my workshop (7:32) Christoph and I discuss future opportunities for people to participate in this workshop (9:37) I explain the format of the main 8-week seminar versus the new half-day workshop (10:14) We talk about one on one coaching (12:22) I discuss my background, including my formal music training and my other career as a professional musician (14:03)

Quotes from Today’s Episode “We spend a lot of time building outputs and infrastructure and pipelines and data engineering and generating stuff, but not always generating outcomes. Users only care about how does this make my life better, my job better, my job easier? How do I look better? How do I get a promotion? How do I make the company more money? Whatever those goals are. And there’s a gap there sometimes, between the things that we ship and delivering these outcomes.” (4:36) “In order to run a usability study on a data product, you have to come up with some type of learning goals and some kind of scenarios that you’re going to give to a user and ask them to go show me how you would do x using the data thing that we built for you.” (5:54) “The reality is most data users and stakeholders aren’t designers and they’re not thinking about the user’s workflow and how a solution fits into their job. They don’t have that context. So, how do we get the really important requirements out of a user or stakeholder’s head? I teach techniques from qualitative UX interviewing, sales, and even hostage negotiation to get unarticulated needs out of people’s head.” (6:41) “How do we work in low fidelity to get data leaders on the same page with a stakeholder or a user? How do we design with users instead of for them? Because most of the time, when we communicate visually, it starts to click (or you’ll know it’s not clicking!)” (7:05) “There’s no right or wrong [in the workshop]. [The workshop] is really about the practice of using these design methods and not the final output that comes out of the end of it.” (8:14) “You learn design by doing design so I really like to get data people going by trying it instead of talking about trying it. More design doing and less design thinking!” (8:40) “The tricky thing [for most of my training clients], [and perhaps this is true with any type of adult education] is, ‘Yeah, I get the concept of what Brian’s talking about, but, how do I apply these design techniques to my situation? I work in this really weird domain, or on this particularly hard data space.’ Working on an exercise or real project, together, in small groups, is how I like start to make the conceptual idea of design into a tangible tool for data leaders..” (12:26)

Links Brian’s training seminar

Charting the Path of Riskified's Data Platform Journey

2022-07-10 · Data Engineering Podcast Listen

podcast_episode

by Inbar Yogev (Riskified) , Lior Winner (Riskified) , Tobias Macey

Airflow API BI CI/CD Data Engineering Data Management Data Quality Datafold dbt Kubernetes MongoDB MySQL +2 more

Summary Building a data platform is a journey, not a destination. Beyond the work of assembling a set of technologies and building integrations across them, there is also the work of growing and organizing a team that can support and benefit from that platform. In this episode Inbar Yogev and Lior Winner share the journey that they and their teams at Riskified have been on for their data platform. They also discuss how they have established a guild system for training and supporting data professionals in the organization.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias Macey and today I’m interviewing Inbar Yogev and Lior Winner about the data platform that the team at Riskified are building to power their fraud management service

Interview

Introduction How did

Maintain Your Data Engineers' Sanity By Embracing Automation

2022-07-10 · Data Engineering Podcast Listen

podcast_episode

by Chris Riccomini (WePay; LinkedIn) , Tobias Macey

AWS Azure BigQuery CDP CI/CD Cloud Computing Data Engineering Data Lake Data Management Databricks ETL/ELT GCP +12 more

Summary Building and maintaining reliable data assets is the prime directive for data engineers. While it is easy to say, it is endlessly complex to implement, requiring data professionals to be experts in a wide range of disparate topics while designing and implementing complex topologies of information workflows. In order to make this a tractable problem it is essential that engineers embrace automation at every opportunity. In this episode Chris Riccomini shares his experiences building and scaling data operations at WePay and LinkedIn, as well as the lessons he has learned working with other teams as they automated their own systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Chris Riccomini about building awareness of data usage into CI/CD pipelines for application development

Interview

Introduction How did you get involved in the area of data management? What are the pieces of data platforms and processing that have been most difficult to scale in an organizational sense? What are the opportunities for automation to alleviate some of the toil that data and analytics engineers get caught up in? The application delivery ecosystem has been going through ongoing transformation in the form of CI/CD, infrastructure as code, etc. What are the parallels in the data ecosystem that are still nascent? What are the principles that still need to be translated for data practitioners? Which are subject to impedance mismatch and may never make sense to translate? As someone with a software engineering background and extensive e

Designing a Data Science Organization - Lisa Cohen

2022-07-08 · DataTalks.Club Listen

podcast_episode

by Lisa Cohen (Anthropic)

Data Science GitHub HTML MLOps

We talked about:

Lisa’s background Centralized org vs decentralized org Hybrid org (centralized/decentralized) Reporting your results in a data organization Planning in a data organization Having all the moving parts work towards the same goals Which approach Twitter follows (centralized vs decentralized) Pros and cons of a decentralized approach Pros and cons of a centralized approach Finding a common language with all the functions of an org Finding the right approach for companies that want to implement data science How many data scientists does a company need? Who do data scientists report huge findings to? The importance of partnering closely with other functions of the org The role of Product Managers in the org and across functions Who does analytics at Twitter (analysts vs data scientists) The importance of goals, objectives and key results Conflicting objectives The importance of research Finding Lisa online

Links:

LinkedIn: https://www.linkedin.com/in/cohenlisa/ Twitter: https://twitter.com/lisafeig Medium: https://medium.com/@lisa_cohen Lisa Cohen's YouTube videos: https://www.youtube.com/playlist?list=PLRhmnnfr2bX7-GAPHzvfUeIEt2iYCbI3w

MLOps Zoomcamp: https://github.com/DataTalksClub/mlops-zoomcamp

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

Sahm on the Soft Landing

2022-07-08 · Moody's Talks - Inside Economics Listen

podcast_episode

by Claudia Sahm (Stay-at-Home Macro Consulting) , Mark Zandi (Moody's Analytics) , Ryan Sweet

Claudia Sahm, founder of Stay-at-Home Macro Consulting, joins Mark and Ryan to discuss the June employment report. They also talk about inflation, monetary policy, and the odds of a recession. Full episode transcript. Follow Mark Zandi @MarkZandi, Ryan Sweet @RealTime_Econ and Cris deRitis on LinkedIn for additional insight.

Questions or Comments, please email us at [email protected]. We would love to hear from you. To stay informed and follow the insights of Moody's Analytics economists, visit Economic View.

talk-data.com

Activity Trend

Top Events

Top Speakers

Presto 101: An Introduction to Open Source Presto

Privacy Preserving Machine Learning and Big Data Analytics Using Apache Spark

Revolutionizing agriculture with AI: Delivering smart industrial solutions built upon a Lakehouse

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs

Swedbank: Enterprise Analytics in Cloud

Unlocking the power of data, AI & analytics: Amgen’s journey to the Lakehouse | Kerby Johnson

US Air Force: Safeguarding Personnel Data at Enterprise Scale

Using Feast Feature Store with Apache Spark for Self-Served Data Sharing and Analysis for Streaming

Making The Total Cost Of Ownership For External Data Manageable With Crux

Dissecting CPI, Ditching Hoagies

Data Activation Everywhere (w/ Julie Beynon of Clearbit)

How To Maximize the Odds of Startup Success with Patrick Parker at SaaS Partners

API Strategy for Decision Makers

Pro Power BI Dashboard Creation: Building Elegant and Interactive Dashboards with Visually Arresting Analytics

The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

095 - Increasing Adoption of Data Products Through Design Training: My Interview from TDWI Munich

Charting the Path of Riskified's Data Platform Journey

Maintain Your Data Engineers' Sanity By Embracing Automation

Designing a Data Science Organization - Lisa Cohen

Sahm on the Soft Landing