talk-data.com talk-data.com

Topic

DWH

Data Warehouse

analytics business_intelligence data_storage

568

tagged

Activity Trend

35 peak/qtr
2020-Q1 2026-Q1

Activities

568 activities · Newest first

How AARP Services, Inc. automated SAS transformation to Databricks using LeapLogic

While SAS has been a standard in analytics and data science use cases, it is not cloud-native and does not scale well. Join us to learn how AARP automated the conversion of hundreds of complex data processing, model scoring, and campaign workloads to Databricks using LeapLogic, an intelligent code transformation accelerator that can transform any and all legacy ETL, analytics, data warehouse and Hadoop to modern data platforms.

In this session experts from AARP and Impetus will share about collaborating with Databricks and how they were able to: • Automate modernization of SAS marketing analytics based on coding best practices • Establish a rich library of Spark and Python equivalent functions on Databricks with the same capabilities as SAS procedures, DATA step operations, macros, and functions • Leverage Databricks-native services like Delta Live Tables to implement waterfall techniques for campaign execution and simplify pipeline monitoring

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How to Automate the Modernization and Migration of Your Data Warehousing Workloads to Databricks

The logic in your data is the heartbeat of your organization’s reports, analytics, dashboards and applications. But that logic is often trapped in antiquated technologies that can’t take advantage of the massive scalability in the Databricks Lakehouse.

In this session BladeBridge will show how to automate the conversion of this metadata and code into Databricks PySpark and DBSQL. BladeBridge will demonstrate the flexibility of configuring for N legacy technologies to facilitate an automated path for not just a single modernization project but a factory approach for corporate wide modernization.

BladeBridge will also present how you can empirically size your migration project to determine the level of effort required.

In this session you will learn: What BladeBridge Converter is What BladeBridge Analyzer is How BladeBridge configures Readers and Writers How to size a conversion effort How to accelerate adoption of Databricks

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Day 1 Morning Keynote | Data + AI Summit 2022

Day 1 Morning Keynote | Data + AI Summit 2022 Welcome & "Destination Lakehouse" | Ali Ghodsi Apache Spark Community Update | Reynold Xin Streaming Lakehouse | Karthik Ramasamy Delta Lake | Michael Armbrust How Adobe migrated to a unified and open data Lakehouse to deliver personalization at unprecedented scale | Dave Weinstein Data Governance and Sharing on Lakehouse |Matei Zaharia Analytics Engineering and the Great Convergence | Tristan Handy Data Warehousing | Shant Hovespian Unlocking the power of data, AI & analytics: Amgen’s journey to the Lakehouse | Kerby Johnson

Get insights on how to launch a successful lakehouse architecture in Rise of the Data Lakehouse by Bill Inmon, the father of the data warehouse. Download the ebook: https://dbricks.co/3ER9Y0K

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Unlocking the power of data, AI & analytics: Amgen’s journey to the Lakehouse | Kerby Johnson

In this keynote, you will learn more about Amgen's data platform journey from data warehouse to data lakehouse. They’’ll discuss our decision process and the challenges they faced with legacy architectures, and how they designed and implemented a sustaining platform strategy with Databricks Lakehouse, accelerating their ability to democratize data to thousands of users.
Today, Amgen has implemented 400+ data science and analytics projects covering use cases like clinical trial optimization, supply chain management and commercial sales reporting, with more to come as they complete their digital transformation and unlock the power of data across the company.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Summary The perennial challenge of data engineers is ensuring that information is integrated reliably. While it is straightforward to know whether a synchronization process succeeded, it is not always clear whether every record was copied correctly. In order to quickly identify if and how two data systems are out of sync Gleb Mezhanskiy and Simon Eskildsen partnered to create the open source data-diff utility. In this episode they explain how the utility is implemented to run quickly and how you can start using it in your own data workflows to ensure that your data warehouse isn’t missing any records from your source systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Random data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try! Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or

Jordan Tigani is an expert in large-scale data processing, having spent a decade+ in the development and growth of BigQuery, and later SingleStore. Today, Jordan and his team at MotherDuck are in the early days of working on commercial applications for the open source DuckDB OLAP database. In this conversation with Tristan and Julia, Jordan dives into the origin story of BigQuery, why he thinks we should do away with the concept of working in files, and how truly performant "data apps" will require bringing data to an end user's machine (rather than requiring them to query a warehouse directly).

The use of version control and continuous deployment in a data pipeline is one of the biggest features unlocked by the modern data stack. In this talk, I’ll demonstrate how to use Airbyte to pull data into your data warehouse, dbt to generate insights from your data, and Airflow to orchestrate every step of the pipeline. The complete project will be managed by version control and continuously deployed by Github. This talk will share how to achieve a more secure, scalable, and manageable workflow for your data projects.

In a recent conversation with data warehousing legend Bill Inmon, I learned about a new way to structure your data warehouse and self-service BI environment called the Unified Star Schema. The Unified Star Schema is potentially a small revolution for data analysts and business users as it allows them to easily join tables in a data warehouse or BI platform through a bridge. This gives users the ability to spend time and effort on discovering insights rather than dealing with data connectivity challenges and joining pitfalls. Behind this deceptively simple and ingenious invention is author and data modelling innovator Francesco Puppini. Francesco and Bill have co-written the book ‘The Unified Star Schema: An Agile and Resilient Approach to Data Warehouse and Analytics Design’ to allow data modellers around the world to take advantage of the Unified Star Schema and its possibilities. Listen to this episode of Leaders of Analytics, where we explore: What the Unified Star Schema is and why we need itHow Francesco came up with the concept of the USSReal-life examples of how to use the USSThe benefits of a USS over a traditional star schema galaxyHow Francesco sees the USS and data warehousing evolving in the next 5-10 years to keep up with new demands in data science and AI, and much more.Connect with Francesco Francesco on Linkedin: https://www.linkedin.com/in/francescopuppini/ Francesco's book on the USS: https://www.goodreads.com/author/show/20792240.Francesco_Puppini

Summary The latest generation of data warehouse platforms have brought unprecedented operational simplicity and effectively infinite scale. Along with those benefits, they have also introduced a new consumption model that can lead to incredibly expensive bills at the end of the month. In order to ensure that you can explore and analyze your data without spending money on inefficient queries Mingsheng Hong and Zheng Shao created Bluesky Data. In this episode they explain how their platform optimizes your Snowflake warehouses to reduce cost, as well as identifying improvements that you can make in your queries to reduce their contribution to your bill.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Mingsheng Hong and Zheng Shao about Bluesky Data where they are combining domain expertise and machine learning to optimize your cloud warehouse usage and reduce operational costs

Interview

Introduction How did you get involved in the area of data management? Can you describe what Bluesky is and the story behind it?

What are the platforms/technologies that you are focused on in your current early stage? What are some of the other targets that you are considering once you validate your initial hypothesis?

Cloud cost optimization is an active area for application infrastructures as well. What are the corollaries and differences between compute and storage optimization strategies and what you are doing at Bluesky? How have your experiences at hyperscale companies using various combinations of cloud and on-premise data platforms informed your approach to the cost management probl

Justin Borgman is the co-founder, Chairman and CEO of Starburst, and has almost a decade spent in senior executive roles building new businesses in the data warehousing and analytics space.  In this conversation with Tristan and Julia, Justin dives into the nuts and bolts of Trino, the open source distributed query engine, and explores how teams are adopting a data mesh architecture without making a mess.  For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com.  The Analytics Engineering Podcast is sponsored by dbt Labs.

An estimated 80 to 90 percent of the data in an enterprise is text. Sadly, this rich information is mostly neglected for analytical purposes. Textual data is typically full of information, but also very complex to interpret computationally and statistically. Why? Because textual data is both content and context. The same words and sentences can have very different meanings depending on the context. Textual data is truly a goldmine, but how can we mine it without being digital superpowers like Google, Microsoft or Facebook? To answer this question and many more relating to interpretation of textual data, I recently spoke to Bill Inmon. Bill is the Founder, Chairman and CEO of Forest Rim Technology and author of more than 60 books on data warehousing. He is often described as the Father of Data Warehousing due to his pioneering efforts in making data and data technologies available to organisations across all industries and sizes. In this episode of Leaders of Analytics, we discuss: How Bill became the Father of Data WarehousingThe history of data warehousing and the most exciting developments in this space todayThe typical challenges holding us back from extracting value from textual dataThe concept of the “Textual ETL” and it’s benefits over other text data storage and analytics approachesWhy NLP is not the best approach for textual data analyticsThe biggest opportunities for textual analytics today and in the future, and much more.Connect with Bill: Forest Rim Technnology: https://www.forestrimtech.com/ Bill on LinkedIn: https://www.linkedin.com/in/billinmon/

Summary Building a data platform is an iterative and evolutionary process that requires collaboration with internal stakeholders to ensure that their needs are being met. Yotpo has been on a journey to evolve and scale their data platform to continue serving the needs of their organization as it increases the scale and sophistication of data usage. In this episode Doron Porat and Liran Yogev explain how they arrived at their current architecture, the capabilities that they are optimizing for, and the complex process of identifying and evaluating new components to integrate into their systems. This is an excellent exploration of the decisions and tradeoffs that need to be made while building such a complex system.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Doron Porat and Liran Yogev about their experiences designing and implementing a self-serve data platform at Yotpo

Interview

Introduction How did you get involved in the area of data management? Can you describe what Yotpo is and the role that data plays in the organization? What are the core data types and sources that you are working with?

What kinds of data assets are being produced and how do those get consumed and re-integrated into the business?

What are the user personas that you are supporting and what are the interfaces that they are comfortable interacting with?

What is the size of your team and how is it structured?

You recently posted about the current architecture of your data platform. What was the starting point on your platform journey?

What did the early stages of feature and platform evolution look like? What was the catalyst for making a concerted effort to integrate your systems into a cohesive platform?

What was the scope and directive of the project for building a platform?

What are the metrics and capabilities that you are optimizing for in the structure of your data platform? What are the organizational or regulatory constraints that you needed to account for?

What are some of the early decisions that affected your available choices in later stages of the project? What does the current state of your architecture look like?

How long did it take to get to where you are today?

What were the factors that you considered in the various build vs. buy decisions?

How did you manage cost modeling to understand the true savings on either side of that decision?

If you were to start from scratch on a new data platform today what might you do differently? What are the decisions that proved helpful in the later stages of your platform development? What are the most interesting, innovative, or unexpected ways that you have seen your platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing your platform? What do you have planned for the future of your platform infrastructure?

Contact Info

Doron

LinkedIn

Liran

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

Yotpo

Data Platform Architecture Blog Post

Greenplum Databricks Metorikku Apache Hive CDC == Change Data Capture Debezium

Podcast Episode

Apache Hudi

Podcast Episode

Upsolver

Podcast Episode

Spark PrestoDB Snowflake

Podcast Episode

Druid Rockset

Podcast Episode

dbt

Podcast Episode

Acryl

Podcast Episode

Atlan

Podcast Episode

OpenLineage

Podcast Episode

Okera Shopify Data Warehouse Episode Redshift Delta Lake

Podcast Episode

Iceberg

Podcast Episode

Outbox Pattern Backstage Roadie Nomad Kubernetes Deequ Great Expectations

Podcast Episode

LakeFS

Podcast Episode

2021 Recap Episode Monte Carlo

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

a…

Summary The flexibility of software oriented data workflows is useful for fulfilling complex requirements, but for simple and repetitious use cases it adds significant complexity. Coalesce is a platform designed to reduce repetitive work for common workflows by adopting a visual pipeline builder to support your data warehouse transformations. In this episode Satish Jayanthi explains how he is building a framework to allow enterprises to move quickly while maintaining guardrails for data workflows. This allows everyone in the business to participate in data analysis in a sustainable manner.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! Your host is Tobias Macey and today I’m interviewing Satish Jayanthi about how organizations can use data architectural patterns to stay competitive in today’s data-rich environment

Interview

Introduction How did you get involved in the area of data management? Can you describe what you are building at C

Nothing has galvanized the data community more in recent months than two new architectural paradigms for managing enterprise data. On one side there is the data fabric: a centralized architecture that runs a variety of analytic services and applications on top of a layer of universal connectivity. On the other side, is a data mesh: a decentralized architecture that empowers domain owners to manage their own data according to enterprise standards and make it available to peers as they desire.

Most data leaders are still trying to ferret out the implications of both approaches for their own data environments. One of those is Srinivasan Sankar, the enterprise data & analytics leader at Hanover Insurance Group. In this wide-ranging, back-and-forth discussion, Sankar and Eckerson explore the suitability of the data mesh for Hanover, how the Data Fabric might support a Data Mesh, whether a Data Mesh obviates the need for a data warehouse, and practical steps Hanover might to take implement a Data Mesh built on top of a Data Fabric.

Key Takeaways: - What is the essence of a data mesh?
- How does it relate to the data fabric? - Does the data mesh require a cultural transformation? - Does the data mesh obviate the need for a data warehouse? - How does data architecture as a service fit with the data mesh? - What is the best way to roll out a data mesh? - What's the role of a data catalog? - What is a suitable roadmap for full implementation?

Nothing has galvanized the data community more in recent months than two new architectural paradigms for managing enterprise data. On one side there is the data fabric: a centralized architecture that runs a variety of analytic services and applications on top of a layer of universal connectivity. On the other side, is a data mesh: a decentralized architecture that empowers domain owners to manage their own data according to enterprise standards and make it available to peers as they desire.

Most data leaders are still trying to ferret out the implications of both approaches for their own data environments. One of those is Srinivasan Sankar, the enterprise data & analytics leader at Hanover Insurance Group. In this wide-ranging, back-and-forth discussion, Sankar and Eckerson explore the suitability of the data mesh for Hanover, how the Data Fabric might support a Data Mesh, whether a Data Mesh obviates the need for a data warehouse, and practical steps Hanover might to take implement a Data Mesh built on top of a Data Fabric.

Snowflake Access Control: Mastering the Features for Data Privacy and Regulatory Compliance

Understand the different access control paradigms available in the Snowflake Data Cloud and learn how to implement access control in support of data privacy and compliance with regulations such as GDPR, APPI, CCPA, and SOX. The information in this book will help you and your organization adhere to privacy requirements that are important to consumers and becoming codified in the law. You will learn to protect your valuable data from those who should not see it while making it accessible to the analysts whom you trust to mine the data and create business value for your organization. Snowflake is increasingly the choice for companies looking to move to a data warehousing solution, and security is an increasing concern due to recent high-profile attacks. This book shows how to use Snowflake's wide range of features that support access control, making it easier to protect data access from the data origination point all the way to the presentation and visualization layer.Reading this book helps you embrace the benefits of securing data and provide valuable support for data analysis while also protecting the rights and privacy of the consumers and customers with whom you do business. What You Will Learn Identify data that is sensitive and should be restricted Implement access control in the Snowflake Data Cloud Choose the right access control paradigm for your organization Comply with CCPA, GDPR, SOX, APPI, and similar privacy regulations Take advantage of recognized best practices for role-based access control Prevent upstream and downstream services from subverting your access control Benefit from access control features unique to the Snowflake Data Cloud Who This Book Is For Data engineers, database administrators, and engineering managers who wantto improve their access control model; those whose access control model is not meeting privacy and regulatory requirements; those new to Snowflake who want to benefit from access control features that are unique to the platform; technology leaders in organizations that have just gone public and are now required to conform to SOX reporting requirements

Summary The life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. In this episode Guy Yachdav, director of software engineering for ImmunAI, shares the complexities that are inherent to managing data workflows for bioinformatics. He also explains how he has architected the systems that ingest, process, and distribute the data that he is responsible for and the requirements that are introduced when collaborating with researchers, domain experts, and machine learning developers.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. Your host is Tobias Macey and today I’m interviewing Guy Yachdav, Director of Software Engineering at Immunai, about his work at Immunai to wrangle biological data for advancing research into the human immune system.

Interview

Introduction (see Guy’s bio below) How did you get involved in the area of data management? Can you describe what Immunai is and the story behind it? What are some of the categories of information that you are working with?

What kinds of insights are you trying to power/questions that you are trying to answer with that data?

Who are the stakeholders that you are working with and how does that influence your approach to the integration/transformation/presentation of the data? What are some of the challenges unique to the biological data domain that you have had to address?

What are some of the limitations in the off-the-shelf tools when applied to biological data? How have you approached the selection of tools/techniques/technologies to make your work maintainable for your engineers and accessible for your end users?

Can

In my opinion, any organisation with respect for its data should have a Chief Data & Analytics Officer (CDAO) as part of their C-suite. Although the CDAO role is still nascent, business leaders across many industries are starting to appreciate the need for a data and analytics voice at board and executive level. So, what does a CDAO do? How should they spend their time to balance strategic influence with operational delivery of data products? To answer these questions and many more related to the principal analytics role, I recently spoke to Kshira Saagar, who is the Chief Data Officer at Latitude Financial. As the CDO at one of Australia’s largest consumer financial services firms, Kshira is responsible for the end-to-end journey of data through the organisation, from extraction to value creation through data products. He leads a large team of Data Scientists, Data analysts, Data Architects, Data Engineers, Machine Learning Engineers, Data Warehouse Developers, BI Developers and Data Governance experts, who are responsible for bringing the company’s data and analytics strategy to life. In this episode of Leaders of Analytics, we discuss: What a week in the role of a CDAO looks likeHow to secure strategic support and executive sponsorship for analytics projectsWhat’s required of CDAOs and their teams to foster a data literate organisationHow to structure data and analytics functions for successThe future of the CDAO role, and much more.Learn more about Kshira at https://www.kshirasaagar.com/

Summary Data quality control is a requirement for being able to trust the various reports and machine learning models that are relying on the information that you curate. Rules based systems are useful for validating known requirements, but with the scale and complexity of data in modern organizations it is impractical, and often impossible, to manually create rules for all potential errors. The team at Anomalo are building a machine learning powered platform for identifying and alerting on anomalous and invalid changes in your data so that you aren’t flying blind. In this episode founders Elliot Shmukler and Jeremy Stanley explain how they have architected the system to work with your data warehouse and let you know about the critical issues hiding in your data without overwhelming you with alerts.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Elliot Shmukler and Jeremy Stanley about Anomalo, a data quality platform aiming to automate issue detection with zero setup

Interview

Introduction How did you get involved in the area of data management? Can you describe what Anomalo is and the story behind it? Managing data quality is ostensibly about building trust in your data. What are the promises that data teams are able to make about the information in their control when they are using Anomalo?

What are some of the claims that cannot be made unequivocally when relying on data quality monitoring systems?

types of data quality issues identified

utility of automated vs programmatic tests

Can you describe how the Anomalo system is designed and implemented?

How have the design and goals of the platform changed or evolved since you started working on it?

What is your approach for validating changes to the business logic in your platform given the unpredictable nature of the system under test? model training/customization process statistical model seasonality/windowing CI/CD With any monitoring system the most challenging thing to do i