talk-data.com talk-data.com

Topic

Data Modelling

data_governance data_quality metadata_management

355

tagged

Activity Trend

18 peak/qtr
2020-Q1 2026-Q1

Activities

355 activities · Newest first

Field-level lineage with dbt, ANTLR, and Snowflake

Lineage is a critical component of any root cause, impact analysis, and overall analytics heath assessment workflow. But it hasn’t always been easy to create, particularly at the field level. In this session, Mei Tao, Helena Munoz, and Xuanzi Han (Monte Carlo) tackle this challenge head-on by leveraging some of the most popular tools in the modern data stack, including dbt, Airflow, Snowflake, and ANother Tool for Language Recognition (ANTLR). Learn how they designed the data model, query parser, and larger database design for field-level lineage—highlighting learnings, wrong turns, and best practices developed along the way.

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

Building turnkey dashboards for core financial metrics with dbt: A Little Modeling Goes a Long Way

Let’s get down to business! Most business users don't want to be bogged down in the data modeling and complexities that us data folk work so hard to accomplish and overcome. Instead, business users and leadership members want the dashboards and numbers they care about. In this session, Matthew Hoss (Element Biosciences) shares his four-step approach to modeling and creating turnkey cost dashboards, all sitting on top of a Netsuite/Fivetran/Snowflake/dbt/Tableau data stack, that help business users get the answers they need, quickly.

Check the slides here: https://docs.google.com/presentation/d/1VVZwm2Kloy1aeewqbB--7WfxIifnpIZflx9V8Q2N-x0/edit?usp=sharing

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

Driving actionable insights

See how visual data modeling and dbt combine to improve interaction and understanding between analytics engineering practitioners, product owners, and business partners. We will demonstrate conceptual and logical modeling techniques and diagrams to establish common understanding, enhance business partner collaboration, enhance translation of requirements, and ultimately complement analytics engineering within dbt to improve time to value. Demonstrate how to pair data modeling concepts (conceptual, logical, physical) and tools (SqlDBM) to engage your customers and inform the analytics engineering with dbt and Snowflake. We will show how this workbench and tools complement the analytics lifecycle for engineers and data consumers alike. The workbench includes a dbt, a visual modeling tool, and phData Toolkit CLI.

This session requires pre-registration. Sign up here. If session is filled you are welcome to come to the room and join the waitlist onsite. Open seats will be made available 10 minutes after session start.

Check the slides here: https://docs.google.com/presentation/d/1fJhaMGvD7TvVft4nEJYhMRhyanQTw3lbzLrgZFsmj-0/edit?usp=sharing

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

Preparing for the Next Wave: Data Apps

Data apps are the next wave in analytics engineering. The explosion of data volume and variety combined with an increasing demand for analytics by consumers, and a leap in cloud data technologies triggered an evolution of traditional analytics into the realms of modern data apps. Question is: How do you prepare for this wave? In this session we’ll explore real-world examples of modern data apps, and how the modern data stack is advancing to support sub-second and high concurrency analytics to meet the new wave of demand. We will cover: performance challenges, semi-structured data, data freshness, data modeling and toolsets.

Check the slides here: https://docs.google.com/presentation/d/1MC18SgT_ZHOJePjYizz_WT7dVveaycNw/edit?usp=sharing&ouid=110293204340061069659&rtpof=true&sd=true

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

How Entity Modeling Accelerates Product Led Growth

The gap between engineering and business teams is widening. The better engineering teams get at iterating to support new features, the harder it is for business teams to keep up with the nuance in a rapidly evolving customer journey. In this session, Rachel Bradley-Haas (BigTimeData.io) takes a step back to explain why defining entities, relationships, and properties, helps build a scalable and cohesive data model that business users can action to accelerate PLG motions.

Check the slides here: https://docs.google.com/presentation/d/1genfMH9v8mZgZBCW-yMhvRla4DwVwoeD6fqSf5L9RGU/edit?usp=sharing

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

SQL Antipatterns, Volume 1

SQL is the ubiquitous language for software developers working with structured data. Most developers who rely on SQL are experts in their favorite language (such as Java, Python, or Go), but they're not experts in SQL. They often depend on antipatterns - solutions that look right but become increasingly painful to work with as you uncover their hidden costs. Learn to identify and avoid many of these common blunders. Refactor an inherited nightmare into a data model that really works. Updated for the current versions of MySQL and Python, this new edition adds a dozen brand new mini-antipatterns for quick wins. No matter which platform, framework, or language you use, the database is the foundation of your application, and the SQL database language is the standard for working with it. Antipatterns are solutions that look simple at the surface, but soon mire you down with needless work. Learn to identify these traps, and craft better solutions for the often-asked questions in this book. Avoid the mistakes that lead to poor performance and quality, and master the principles that make SQL a powerful and flexible tool for handling data and logic. Dive deep into SQL and database design, and learn to recognize the most common missteps made by software developers in database modeling, SQL query logic, and code design of data-driven applications. See practical examples of misconceptions about SQL that can lure software projects astray. Find the greatest value in each group of data. Understand why an intersection table may be your new best friend. Store passwords securely and don't reinvent the wheel. Handle NULL values like a pro. Defend your web applications against the security weakness of SQL injection. Use SQL the right way - it can save you from headaches and needless work, and let your application really shine! What You Need: The SQL examples use the MySQL 8.0 flavor, but other popular brands of RDBMS are mentioned. Other code examples use Python 3.9+ or Ruby 2.7+.

Summary Agile methodologies have been adopted by a majority of teams for building software applications. Applying those same practices to data can prove challenging due to the number of systems that need to be included to implement a complete feature. In this episode Shane Gibson shares practical advice and insights from his years of experience as a consultant and engineer working in data about how to adopt agile principles in your data work so that you can move faster and provide more value to the business, while building systems that are maintainable and adaptable.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Shane Gibson about how to bring Agile practices to your data management workflows

Interview

Introduction How did you get involved in the area of data management? Can you describe what AgileData is and the story behind it? What are the main industries and/or use cases that you are focused on supporting? The data ecosystem has been trying on different paradigms from software development for some time now (e.g. DataOps, version control, etc.). What are the aspects of Agile that do and don’t map well to data engineering/analysis? One of the perennial challenges of data analysis is how to approach data modeling. How do you balance the need to provide value with the long-term impacts of incomplete or underinformed modeling decisions made in haste at the beginning of a project?

How do you design in affordances for refactoring of the data models without breaking downstream assets?

Another aspect of implementing data products/platforms is how to manage permissions and governance. What are the incremental ways that those principles can be incorporated early and evolved along with the overall analytical products? What are some of the organizational design strategies that you find most helpful when establishing or training a team who is working on data products? In order to have a useful target to work toward it’s necessary to understand what the data consumers are hoping to achieve. What are some of the challenges of doing requirements gathering for data products? (e.g. not knowing what information is available, consumers not understanding what’s hard vs. easy, etc.)

How do you work with the "customers" to help them understand what a reasonable scope is and translate that to the actual project stages for the engineers?

What are some of the perennial questions or points of confusion that you have had to address with your clients on how to design and implement analytical assets? What are the most interesting, innovative, or unexpected ways that you have seen agile principles used for data? What are the most interesting, unexpected, or challenging lessons that you have learned while working on AgileData? When is agile the wrong choice for a data project? What do you have planned for the future of AgileData?

Contact Info

LinkedIn @shagility on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

AgileData OptimalBI How To Make Toast Data Mesh Information Product Canvas DataKitchen

Podcast Episode

Great Expectations

Podcast Episode

Soda Data

Podcast Episode

Google DataStore Unfix.work Activity Schema

Podcast Episode

Data Vault

Podcast Episode

Star Schema Lean Methodology Scrum Kanban

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By: Atlan: Atlan

Have you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?

Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.

Go to dataengineeringpodcast.com/atlan and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.Prefect: Prefect

Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit…

Today I’m chatting with Vin Vashishta, Founder of V Squared. Vin believes that with methodical strategic planning, companies can prepare for continuous transformation by removing the silos that exist between leadership, data, AI, and product teams. How can these barriers be overcome, and what is the impact of doing so? Vin answers those questions and more, explaining why process disruption is necessary for long-term success and gives real-world examples of companies who are adopting these strategies.

Highlights/ Skip to:

What the AI ‘Last Mile’ Problem is (03:09) Why Vin sees so many businesses are reevaluating their offerings and realigning with their core business model (09:01) Why every company today is struggling to figure out how to bridge the gap between data, product, and business value (14:25) How the skillsets needed for success are evolving for data, product, and business leaders (14:40) Vin’s process when he’s helping a team with a data strategy, and what the end result looks like (21:53) Why digital transformation is dead, and how to reframe what business transformation means in today’s day and age (25:03) How Airbnb used data to inform their overall strategy to survive during a time of massive industry disruption, and how those strategies can be used by others as a preventative measure (29:03) Unpacking how a data strategy leader can work backward from a high-level business strategy to determining actionable steps and use cases for ML and analytics (32:52) Who (what roles) are ultimately responsible in an ideal strategy planning session? (34:41) How the C-Suite can bridge business & data strategy and the impact the world’s largest companies are seeing as a result (36:01)

Quotes from Today’s Episode “And when you have that [core business & technology strategy] disconnect, technology goes in one direction, what the business needs and what customers need sort of lives outside of the silo.” – Vin Vashishta (06:06)

“Why are we doing data and not just traditional software development? Why are we doing data science and not analytics? There has to be a justification because each one of these is more expensive than the last, each one is, you know, less certain.” – Vin Vashishta (10:36)

“[The right people to train] are smart about the technology, but have also lived with the users, have some domain expertise, and the interest in making a bigger impact. Let’s put them in strategy roles.” – Vin Vashishta (18:58) “You know, this is never going to end. Transformation is continuous. I don’t call it digital transformation anymore because that’s making you think that this thing is somehow a once-in-a-generation change. It’s not. It’s once every five years now.” – Vin Vashishta (25:03) “When do you want to have those [business] opportunities done by? When do you want to have those objectives completed by? Well, then that tells you how fast you have to transform if you want to use each one of these different technologies.” – Vin Vashishta (25:37) “You’ve got to disrupt the process. Strategy planning is not the same anymore. Look at how Amazon does it. ... They are destroying their competitors because their strategy planning process is both expert and data model-driven.” – Vin Vashishta (33:44) “And one of the critical things for CDOs to do is tell stories with data to the board. When they sit in and talk to the board. They need to tell those stories about how one data point hit this one use case and the company made $4 million.” – Vin Vashishta (39:33)

Links HumblePod: https://humblepod.com V Squared: https://datascience.vin LinkedIn: https://www.linkedin.com/in/vineetvashishta/ Twitter: https://twitter.com/v_vashishta YouTube channel: https://www.youtube.com/c/TheHighROIDataScientist Substack: https://vinvashishta.substack.com/

SAP HANA Cloud in a Nutshell: Design, Develop, and Deploy Data Models using SAP HANA Cloud

This book introduces SAP HANA Cloud and helps you develop an understanding of its key features, including technology, architecture, and data modeling. SAP HANA Cloud in a Nutshell will help you develop the skills needed to use the core features of the completely managed and in-memory cloud-based data foundation available in the SAP Business Technology Platform. The book covers modern modeling concepts and equips you with practical knowledge to unleash the best use of SAP HANA Cloud. As you progress, you will learn how to provision your own SAP HANA Cloud instance, understand how to work with different roles, and work with data modeling for analytical and transactional use cases. Additionally, you will learn how to pilot SAP BTP Cockpit and work with entitlements, quotas, account structure, spaces, instances, and cloud providers. You will learn how to perform administration tasks such as stop and start an SAP HANA Cloud instance and make it available for use. To fully leverage the knowledge this book offers, you will find practical step-by-step instructions for how to establish a cloud account model and create your first SAP HANA Cloud artifacts. The book is an important prerequisite for those who want to take full advantage of SAP HANA Cloud. What You Will Learn Master the concepts and terminology of SAP Business Technology Platform (BTP) and SAP HANA Cloud Understand the key roles of an SAP HANA Cloud implementation Become familiar with the key tools used by administrators, architects, and application developers Upgrade an SAP HANA Cloud database Understand how to work with SAP HANA Cloud modeling supporting analytical and transactional use cases Who This Book Is For SAP consultants, cloud engineers, and architects; application consultants and developers; and project stakeholders

Mastering MongoDB 6.x - Third Edition

Mastering MongoDB 6.x is your complete guide to understanding MongoDB at depth and fully leveraging its capabilities. Learn to design, develop, and administer MongoDB databases that are high-performing, scalable, and secure. From schema modeling to using MongoDB Atlas tools, this book ensures you are well-equipped to build robust applications backed by MongoDB. What this Book will help me do Understand and apply advanced data modeling techniques for MongoDB to optimize data access. Utilize advanced querying capabilities, including aggregation, indexing, and transactions. Implement scalable and distributed systems using MongoDB features like replication and sharding. Administer MongoDB databases securely and efficiently using monitoring and backup tools. Master cloud-based solutions with MongoDB Atlas tools such as Serverless, Atlas Search, and Compass. Author(s) Alex Giamas, the author of Mastering MongoDB 6.x, is a seasoned expert in database systems and software engineering. With a deep knowledge of MongoDB gained through years of practical experience, Alex has contributed to numerous projects that utilize MongoDB to power large-scale applications. Passionate about sharing knowledge, Alex creates thorough, accessible guides to empower developers and administrators alike. Who is it for? This book is perfect for MongoDB developers and database administrators seeking to deepen their skills. If you're involved in designing, deploying, or managing greenfield or existing projects using MongoDB, this book is invaluable. Basic familiarity with MongoDB, shell commands, and database design concepts is recommended to fully benefit from the insights provided.

Pro Data Mashup for Power BI: Powering Up with Power Query and the M Language to Find, Load, and Transform Data

This book provides all you need to find data from external sources and load and transform that data into Power BI where you can mine it for business insights and a competitive edge. This ranges from connecting to corporate databases such as Azure SQL and SQL Server to file-based data sources, and cloud- and web-based data sources. The book also explains the use of Direct Query and Live Connect to establish instant connections to databases and data warehouses and avoid loading data. The book provides detailed guidance on techniques for transforming inbound data into normalized data sets that are easy to query and analyze. This covers data cleansing, data modification, and standardization as well as merging source data into robust data structures that can feed into your data model. You will learn how to pivot and transpose data and extrapolate missing values as well as harness external programs such as R and Python into a Power Query data flow. You also will see how to handle errors in source data and extend basic data ingestion to create robust and parameterized data load and transformation processes. Everything in this book is aimed at helping you deliver compelling and interactive insight with remarkable ease using Power BI’s built-in data load and transformation tools. What You Will Learn Connect Power BI to a range of external data sources Prepare data from external sources for easy analysis in Power BI Cleanse data from duplicates, outliers, and other bad values Make live connections from which to refresh data quickly and easily Apply advanced techniques to interpolate missing data Who This Book Is For All Power BI users from beginners to super users. Any user of the world’s leading dashboarding toolcan leverage the techniques explained in this book to turbo-charge their data preparation skills and learn how a wide range of external data sources can be harnessed and loaded into Power BI to drive their analytics. No previous knowledge of working with data, databases, or external data sources is required—merely the need to find, transform, and load data into Power BI..

Exam Ref PL-300 Microsoft Power BI Data Analyst

Prepare for Microsoft Exam PL-300 and help demonstrate your real-world ability to deliver actionable insights with Power BI by leveraging available data and domain expertise; to provide meaningful business value through clear data visualizations; to enable others to perform self-service analytics, and to deploy and configure solutions for consumption. Designed for data analysts, business users, and other professionals, this Exam Ref focuses on the critical thinking and decision-making acumen needed for success at the Microsoft Certified: Power BI Data Analyst Associate level. Focus on the expertise measured by these objectives: Prepare the data Model the data Visualize and analyze the data Deploy and maintain assets This Microsoft Exam Ref: Organizes its coverage by exam objectives Features strategic, what-if scenarios to challenge you Assumes you are a data analyst, business intelligence professional, report creator, or other professional seeking to validate your skills and knowledge in analyzing data with Power BI About the Exam Exam PL-300 focuses on knowledge needed to get data from different data sources; clean, transform, and load data; design and develop data models; create model calculations with DAX; optimize model performance; create reports and dashboards; enhance reports for usability and storytelling; identify patterns and trends; and manage files, datasets, and workspaces. About Microsoft Certification Passing this exam fulfills your requirements for the Microsoft Certified: Power BI Data Analyst Associate certification, demonstrating your understanding of data repositories and data processes, and your skills in designing and building scalable data models, cleaning and transforming data, enabling advanced analytic capabilities to provide meaningful business value, and collaborating with key stakeholders to deliver relevant insights based on identified business requirements. See full details at: microsoft.com/learn ...

Simplifying Data Engineering and Analytics with Delta

This book will guide you through mastering Delta, a robust and versatile protocol for data engineering and analytics. You'll discover how Delta simplifies data workflows, supports both batch and streaming data, and is optimized for analytics applications in various industries. By the end, you will know how to create high-performing, analytics-ready data pipelines. What this Book will help me do Understand Delta's unique offering for unifying batch and streaming data processing. Learn approaches to address data governance, reliability, and scalability challenges. Gain technical expertise in building data pipelines optimized for analytics and machine learning use. Master core concepts like data modeling, distributed computing, and Delta's schema evolution features. Develop and deploy production-grade data engineering solutions leveraging Delta for business intelligence. Author(s) Anindita Mahapatra is an experienced data engineer and author with years of expertise in working on Delta and data-driven solutions. Her hands-on approach to explaining complex data concepts makes this book an invaluable resource for professionals in data engineering and analytics. Who is it for? Ideal for data engineers, data analysts, and anyone involved in AI/BI workflows, this book suits learners with some basic knowledge of SQL and Python. Whether you're an experienced professional or looking to upgrade your skills with Delta, this book will provide practical insights and actionable knowledge.

Enabling BI in a Lakehouse Environment: How Spark and Delta Can Help With Automating a DWH Develop

Traditional data warehouses typically struggle when it comes to handling large volumes of data and traffic, particularly when it comes to unstructured data. In contrast, data lakes overcome such issues and have become the central hub for storing data. We outline how we can enable BI Kimball data modelling in a Lakehouse environment.

We present how we built a Spark-based framework to modernize DWH development with data lake as central storage, assuring high data quality and scalability. The framework was implemented at over 15 enterprise data warehouses across Europe.

We present how one can tackle in Spark & with Delta Lake the data warehouse principles like surrogate, foreign and business keys, SCD type 1 and 2 etc. Additionally, we share our experiences on how such a unified data modelling framework can bridge BI with modern day use cases, such as machine learning and real time analytics. The session outlines the original challenges, the steps taken and the technical hurdles we faced.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

dbt and Python—Better Together

Drew Banin is the co-founder of dbt Labs and one of the maintainers of dbt Core, the open source standard in data modeling and transformation. In this talk, he will demonstrate an approach to unifying SQL and Python workloads under a single dbt execution graph, illustrating the powerful, flexible nature of dbt running on Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Git for Data Lakes—How lakeFS Scales Data Versioning to Billions of Objects

Modern data lake architectures rely on object storage as the single source of truth. We use them to store an increasing amount of data, which is increasingly complex and interconnected. While scalable, these object stores provide little safety guarantees: lacking semantics that allow atomicity, rollbacks, and reproducibility capabilities needed for data quality and resiliency.

lakeFS - an open source data version control system designed for Data Lakes solves these problems by introducing concepts borrowed from Git: branching, committing, merging and rolling back changes to data.

In this talk you'll learn about the challenges with using object storage for data lakes and how lakeFS enables you to solve them.

By the end of the session you’ll understand how lakeFS scales its Git-like data model to petabytes of data, across billions of objects - without affecting throughput or performance. We will also demo branching, writing data using Spark and merging it on a billion-object repository.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

In a recent conversation with data warehousing legend Bill Inmon, I learned about a new way to structure your data warehouse and self-service BI environment called the Unified Star Schema. The Unified Star Schema is potentially a small revolution for data analysts and business users as it allows them to easily join tables in a data warehouse or BI platform through a bridge. This gives users the ability to spend time and effort on discovering insights rather than dealing with data connectivity challenges and joining pitfalls. Behind this deceptively simple and ingenious invention is author and data modelling innovator Francesco Puppini. Francesco and Bill have co-written the book ‘The Unified Star Schema: An Agile and Resilient Approach to Data Warehouse and Analytics Design’ to allow data modellers around the world to take advantage of the Unified Star Schema and its possibilities. Listen to this episode of Leaders of Analytics, where we explore: What the Unified Star Schema is and why we need itHow Francesco came up with the concept of the USSReal-life examples of how to use the USSThe benefits of a USS over a traditional star schema galaxyHow Francesco sees the USS and data warehousing evolving in the next 5-10 years to keep up with new demands in data science and AI, and much more.Connect with Francesco Francesco on Linkedin: https://www.linkedin.com/in/francescopuppini/ Francesco's book on the USS: https://www.goodreads.com/author/show/20792240.Francesco_Puppini

We talked about:

Jeff’s background Getting feedback to become a better teacher Going from engineering to teaching Jeff on becoming a curriculum writer Creating a curriculum that reinforces learning Jeff on starting his own data engineering bootcamp Shifting from teaching ML and data science to teaching data engineering Making sure that students get hired Screening bootcamp applicants Knowing when it’s time to apply for jobs The curriculum of JigsawLabs.io The market demand of Spark, Kafka, and Kubernetes (or lack thereof) Advice for data analysts that want to move into data engineering The market demand of ETL/ELT and DBT (or lack thereof) The importance of Python, SQL, and data modeling for data engineering roles Interview expectations How to get started in teaching The challenges of being a one-person company Teaching fundamentals vs the “shiny new stuff” JigsawLabs.io Finding Jeff online

Links: 

Jigsaw Labs: https://www.jigsawlabs.io/free Teaching my mom to code: https://www.youtube.com/watch?v=OfWwfTXGjBM Getting a Data Engineering Job Webinar with Jeff Katz: https://www.eventbrite.de/e/getting-a-data-engineering-job-tickets-310270877547

MLOps Zoomcamp: https://github.com/DataTalksClub/mlops-zoomcamp

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

CockroachDB: The Definitive Guide

Get the lowdown on CockroachDB, the distributed SQL database built to handle the demands of today's data-driven cloud applications. In this hands-on guide, software developers, architects, and DevOps/SRE teams will learn how to use CockroachDB to create applications that scale elastically and provide seamless delivery for end users while remaining indestructible. Teams will also learn how to migrate existing applications to CockroachDB's performant, cloud native data architecture. If you're familiar with distributed systems, you'll quickly discover the benefits of strong data correctness and consistency guarantees as well as optimizations for delivering ultra low latencies to globally distributed end users. You'll learn how to: Design and build applications for distributed infrastructure, including data modeling and schema design Migrate data into CockroachDB Read and write data and run ACID transactions across distributed infrastructure Plan a CockroachDB deployment for resiliency across single region and multi-region clusters Secure, monitor, and optimize your CockroachDB deployment

Summary Data and analytics are permeating every system, including customer-facing applications. The introduction of embedded analytics to an end-user product creates a significant shift in requirements for your data layer. The Pinot OLAP datastore was created for this purpose, optimizing for low latency queries on rapidly updating datasets with highly concurrent queries. In this episode Kishore Gopalakrishna and Xiang Fu explain how it is able to achieve those characteristics, their work at StarTree to make it more easily available, and how you can start using it for your own high throughput data workloads today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product today at dataengineeringpodcast.com/acryl Your host is Tobias Macey and today I’m interviewing Kishore Gopalakrishna and Xiang Fu about Apache Pinot and its applications for powering user-facing analytics

Interview

Introduction How did you get involved in the area of data management? Can you describe what Pinot is and the story behind it? What are the primary use cases that Pinot is designed to support? There are numerous OLAP engines available with varying tradeoffs and optimal use cases. What are the cases where Pinot is the preferred choice?

How does it compare to systems such as Clickhouse (for OLAP) or CubeJS/GoodData (for embedded analytics)?

How do the operational needs of a database engine change as you move from serving internal stakeholders to external end-users? Can you describe how Pinot is architected?

What were the key design elements that were necessary to support low-latency queries with high concurrency?

Can you describe a typical end-to-end architecture where Pinot will be used for embedded analytics?

What are some of the tools/technologies/platforms/design patterns that Pinot might replace or obviate?

What are some of the useful lessons related to data modeling that users of Pinot should consider?

What are some edge cases that they might encounter due to details of how the storage layer is architected? (e.g. data