talk-data.com talk-data.com

Topic

Git

version_control source_code_management collaboration

136

tagged

Activity Trend

16 peak/qtr
2020-Q1 2026-Q1

Activities

136 activities · Newest first

From Days to Seconds — Reducing Query Times on Large Geospatial Datasets by 99%

The Global Water Security Center translates environmental science into actionable insights for the U.S. Department of Defense. Prior to incorporating Databricks, responding to these requests required querying approximately five hundred thousand raster files representing over five hundred billion points. By leveraging lakehouse architecture, Databricks Auto Loader, Spark Streaming, Databricks Spatial SQL, H3 geospatial indexing and Databricks Liquid Clustering, we were able to drastically reduce our “time to analysis” from multiple business days to a matter of seconds. Now, our data scientists execute queries on pre-computed tables in Databricks, resulting in a “time to analysis” that is 99% faster, giving our teams more time for deeper analysis of the data. Additionally, we’ve incorporated Databricks Workflows, Databricks Asset Bundles, Git and Git Actions to support CI/CD across workspaces. We completed this work in close partnership with Databricks.

At Zillow, we have accelerated the volume and quality of our dashboards by leveraging a modern SDLC with version control and CI/CD. In the past three months, we have released 32 production-grade dashboards and shared them securely across the organization while cutting error rates in half over that span. In this session, we will provide an overview of how we utilize Databricks asset bundles and GitLab CI/CD to create performant dashboards that can be confidently used for mission-critical operations. As a concrete example, we'll then explore how Zillow's Data Platform team used this approach to automate our on-call support analysis, leveraging our dashboard development strategy alongside Databricks LLM offerings to create a comprehensive view that provides actionable performance metrics alongside AI-generated insights and action items from the hundreds of requests that make up our support workload.

Deploying Databricks Asset Bundles (DABs) at Scale

This session is repeated.Managing data and AI workloads in Databricks can be complex. Databricks Asset Bundles (DABs) simplify this by enabling declarative, Git-driven deployment workflows for notebooks, jobs, Lakeflow Declarative Pipelines, dashboards, ML models and more.Join the DABs Team for a Deep Dive and learn about:The Basics: Understanding Databricks asset bundlesDeclare, define and deploy assets, follow best practices, use templates and manage dependenciesCI/CD & Governance: Automate deployments with GitHub Actions/Azure DevOps, manage Dev vs. Prod differences, and ensure reproducibilityWhat’s new and what's coming up! AI/BI Dashboard support, Databricks Apps support, a Pythonic interface and workspace-based deploymentIf you're a data engineer, ML practitioner or platform architect, this talk will provide practical insights to improve reliability, efficiency and compliance in your Databricks workflows.

SQL-Based ETL: Options for SQL-Only Databricks Development

Using SQL for data transformation is a powerful way for an analytics team to create their own data pipelines. However, relying on SQL often comes with tradeoffs such as limited functionality, hard-to-maintain stored procedures or skipping best practices like version control and data tests. Databricks supports building high-performing SQL ETL workloads. Attend this session to hear how Databricks supports SQL for data transformation jobs as a core part of your Data Intelligence Platform. In this session we will cover 4 options to use Databricks with SQL syntax to create Delta tables: Lakeflow Declarative Pipelines: A declarative ETL option to simplify batch and streaming pipelines dbt: An open-source framework to apply engineering best practices to SQL based data transformations SQLMesh: an open-core product to easily build high-quality and high-performance data pipelines SQL notebooks jobs: a combination of Databricks Workflows and parameterized SQL notebooks

Boosting Data Science and AI Productivity With Databricks Notebooks

This session is repeated. Want to accelerate your team's data science workflow? This session reveals how Databricks Notebooks can transform your productivity through an optimized environment designed specifically for data science and AI work. Discover how notebooks serve as a central collaboration hub where code, visualizations, documentation and results coexist seamlessly, enabling faster iteration and development. Key takeaways: Leveraging interactive coding features including multi-language support, command-mode shortcuts and magic commands Implementing version control best practices through Git integration and notebook revision history Maximizing collaboration through commenting, sharing and real-time co-editing capabilities Streamlining ML workflows with built-in MLflow tracking and experiment management You'll leave with practical techniques to enhance your notebook-based workflow and deliver AI projects faster with higher-quality results.

The course is designed to cover advanced concepts and workflows in machine learning operations. It starts by introducing participants to continuous integration (CI) and continuous development (CD) workflows within machine learning projects, guiding them through the deployment of a sample CI/CD workflow using Databricks in the first section. Moving on to the second part, participants delve into data and model testing, where they actively create tests and automate CI/CD workflows. Finally, the course concludes with an exploration of model monitoring concepts, demonstrating the use of Lakehouse Monitoring to oversee machine learning models in production settings. Pre-requisites: Familiarity with Databricks workspace and notebooks; knowledge of machine learning model development and deployment with MLflow (e.g. intermediate-level knowledge of traditional ML concepts, development with CI/CD, the use of Python and Git for ML projects with popular platforms like GitHub) Labs: Yes Certification Path: Databricks Certified Machine Learning Professional

In this course, you’ll learn how to apply patterns to securely store and delete personal information for data governance and compliance on the Data Intelligence Platform. We’ll cover topics like storing sensitive data appropriately to simplify granting access and processing deletes, processing deletes to ensure compliance with the right to be forgotten, performing data masking, and configuring fine-grained access control to configure appropriate privileges to sensitive data.Pre-requisites: Ability to perform basic code development tasks using the Databricks workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc), intermediate programming experience with SQL and PySpark (extract data from a variety of file formats and data sources, apply a number of common transformations to clean data, reshape and manipulate complex data using advanced built-in functions), intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions etc.). Beginner experience with Lakeflow Declarative Pipelines and streaming workloads.Labs: YesCertification Path: Databricks Certified Data Engineer Professional

In this course, you’ll learn how to optimize workloads and physical layout with Spark and Delta Lake and and analyze the Spark UI to assess performance and debug applications. We’ll cover topics like streaming, liquid clustering, data skipping, caching, photons, and more. Pre-requisites: Ability to perform basic code development tasks using the Databricks workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc), intermediate programming experience with SQL and PySpark (extract data from a variety of file formats and data sources, apply a number of common transformations to clean data, reshape and manipulate complex data using advanced built-in functions), intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions etc.). Labs: Yes Certification Path: Databricks Certified Data Engineer Professional

In this course, you’ll learn how to Incrementally process data to power analytic insights with Structured Streaming and Auto Loader, and how to apply design patterns for designing workloads to perform ETL on the Data Intelligence Platform with Lakeflow Declarative Pipelines. First, we’ll cover topics including ingesting raw streaming data, enforcing data quality, implementing CDC, and exploring and tuning state information. Then, we’ll cover options to perform a streaming read on a source, requirements for end-to-end fault tolerance, options to perform a streaming write to a sink, and creating an aggregation and watermark on a streaming dataset. Pre-requisites: Ability to perform basic code development tasks using the Databricks workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc.), intermediate programming experience with SQL and PySpark (extract data from a variety of file formats and data sources, apply a number of common transformations to clean data, reshape and manipulate complex data using advanced built-in functions), intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions etc.). Beginner experience with streaming workloads and familiarity with Lakeflow Declarative Pipelines. Labs: No Certification Path: Databricks Certified Data Engineer Professional

Code changing lives? Absolutely. We're diving into Python's power to deploy cutting-edge solutions for lung cancer diagnosis and treatment in medical and surgical robotics. Expect demos showcasing algorithms, data analysis, and real-world impact—bridging MedTech innovation and life-changing solutions. Ready to see Python revolutionize lung health? Join us. Let's code a healthier future together!

Smarter Demand Planning: How to Build a No-Code Forecasting App | The Data Apps Conference

Sales forecasting and demand planning are critical business processes, but most organizations still rely on spreadsheets—leading to version control issues, fragmented approvals, and lack of historical tracking. Traditional BI tools struggle to solve these problems because they don’t allow for cell-level edits, inline comments, and structured approval workflows in a governed way.

In this session, Ian Reed will demonstrate how to:

Enable real-time forecasting by replacing manual spreadsheets with a structured, cloud-based data app Allow for cell-level edits and inline commentary so teams can capture assumptions behind forecast changes Implement an automated approval workflow with proper governance Integrate seamlessly with live data sources for continuous updates and visibility into actual vs. forecasted performance Track historical changes and maintain audit trails of all modifications

With Sigma, demand planning is no longer a fragmented, error-prone process—it’s a seamless, governed workflow that scales with business growth. Join this session for a demo and a step-by-step walkthrough of how this app was built, proving that anyone can create a highly customizable, enterprise-grade demand planning system in Sigma without deep technical expertise.

➡️ Learn more about Data Apps: https://www.sigmacomputing.com/product/data-applications?utm_source=youtube&utm_medium=organic&utm_campaign=data_apps_conference&utm_content=pp_data_apps


➡️ Sign up for your free trial: https://www.sigmacomputing.com/go/free-trial?utm_source=youtube&utm_medium=video&utm_campaign=free_trial&utm_content=free_trial

sigma #sigmacomputing #dataanalytics #dataanalysis #businessintelligence #cloudcomputing #clouddata #datacloud #datastructures #datadriven #datadrivendecisionmaking #datadriveninsights #businessdecisions #datadrivendecisions #embeddedanalytics #cloudcomputing #SigmaAI #AI #AIdataanalytics #AIdataanalysis #GPT #dataprivacy #python #dataintelligence #moderndataarchitecture

Streamlining Contract Management: TEKsystems’ Journey to Automation | The Data Apps Conference

Managing contracts in spreadsheets was creating significant challenges for TEKSystems - crashes, version control issues, and no connection to enterprise data were impacting 20+ users trying to collaborate on thousands of records.

In this session, Riley Owens (Controller) and Chase Brookmyer (Business Reporting Analyst) will demonstrate how they:

Built a contract lifecycle management app with no formal development background Transformed from an unstable spreadsheet to a centralized data app Created streamlined workflows and reporting capabilities Accomplished the build in approximately 10 hours of learning and development By leveraging Sigma’s flexible architecture, TEKSystems was able to build this solution in a matter of hours, demonstrating how teams without formal development experience can create powerful workflow automation tools. Watch this session to see how they transformed contract management and what’s next for their data-driven approach.

➡️ Learn more about Data Apps: https://www.sigmacomputing.com/product/data-applications?utm_source=youtube&utm_medium=organic&utm_campaign=data_apps_conference&utm_content=pp_data_apps


➡️ Sign up for your free trial: https://www.sigmacomputing.com/go/free-trial?utm_source=youtube&utm_medium=video&utm_campaign=free_trial&utm_content=free_trial

sigma #sigmacomputing #dataanalytics #dataanalysis #businessintelligence #cloudcomputing #clouddata #datacloud #datastructures #datadriven #datadrivendecisionmaking #datadriveninsights #businessdecisions #datadrivendecisions #embeddedanalytics #cloudcomputing #SigmaAI #AI #AIdataanalytics #AIdataanalysis #GPT #dataprivacy #python #dataintelligence #moderndataarchitecture

Shift Left with Apache Iceberg Data Products to Power AI | Andrew Madson | Shift Left Data Confer...

Shift Left with Apache Iceberg Data Products to Power AI | Andrew Madson | Shift Left Data Conference 2025

High-quality, governed, and performant data from the outset is vital for agile, trustworthy enterprise AI systems. Traditional approaches delay addressing data quality and governance, causing inefficiencies and rework. Apache Iceberg, a modern table format for data lakes, empowers organizations to "Shift Left" by integrating data management best practices earlier in the pipeline to enable successful AI systems.

This session covers how Iceberg's schema evolution, time travel, ACID transactions, and Git-like data branching allow teams to validate, version, and optimize data at its source. Attendees will learn to create resilient, reusable data assets, streamline engineering workflows, enforce governance efficiently, and reduce late-stage transformations—accelerating analytics, machine learning, and AI initiatives.

Shifting Left with Data DevOps | Chad Sanderson | Shift Left Data Conference 2025

Data DevOps applies rigorous software development practices—such as version control, automated testing, and governance—to data workflows, empowering software engineers to proactively manage data changes and address data-related issues directly within application code. By adopting a "shift left" approach with Data DevOps, SWE teams become more aware of data requirements, dependencies, and expectations early in the software development lifecycle, significantly reducing risks, improving data quality, and enhancing collaboration.

This session will provide practical strategies for integrating Data DevOps into application development, enabling teams to build more robust data products and accelerate adoption of production AI systems.

Data Insight Foundations: Step-by-Step Data Analysis with R

This book is an essential guide designed to equip you with the vital tools and knowledge needed to excel in data science. Master the end-to-end process of data collection, processing, validation, and imputation using R, and understand fundamental theories to achieve transparency with literate programming, renv, and Git--and much more. Each chapter is concise and focused, rendering complex topics accessible and easy to understand. Data Insight Foundations caters to a diverse audience, including web developers, mathematicians, data analysts, and economists, and its flexible structure allows enables you to explore chapters in sequence or navigate directly to the topics most relevant to you. While examples are primarily in R, a basic understanding of the language is advantageous but not essential. Many chapters, especially those focusing on theory, require no programming knowledge at all. Dive in and discover how to manipulate data, ensure reproducibility, conduct thorough literature reviews, collect data effectively, and present your findings with clarity. What You Will Learn Data Management: Master the end-to-end process of data collection, processing, validation, and imputation using R. Reproducible Research: Understand fundamental theories and achieve transparency with literate programming, renv, and Git. Academic Writing: Conduct scientific literature reviews and write structured papers and reports with Quarto. Survey Design: Design well-structured surveys and manage data collection effectively. Data Visualization: Understand data visualization theory and create well-designed and captivating graphics using ggplot2. Who this Book is For Career professionals such as research and data analysts transitioning from academia to a professional setting where production quality significantly impacts career progression. Some familiarity with data analytics processes and an interest in learning R or Python are ideal.

Supported by Our Partners • Formation — Level up your career and compensation with Formation.  • WorkOS — The modern identity platform for B2B SaaS • Vanta — Automate compliance and simplify security with Vanta. — In today’s episode of The Pragmatic Engineer, I’m joined by Jonas Tyroller, one of the developers behind Thronefall, a minimalist indie strategy game that blends tower defense and kingdom-building, now available on Steam. Jonas takes us through the journey of creating Thronefall from start to finish, offering insights into the world of indie game development. We explore: • Why indie developers often skip traditional testing and how they find bugs • The developer workflow using Unity, C# and Blender • The two types of prototypes game developers build  • Why Jonas spent months building game prototypes in 1-2 days • How Jonas uses ChatGPT to build games • Jonas’s tips on making games that sell • And more! — Timestamps (00:00) Intro (02:07) Building in Unity (04:05) What the shader tool is used for  (08:44) How a Unity build is structured (11:01) How game developers write and debug code  (16:21) Jonas’s Unity workflow (18:13) Importing assets from Blender (21:06) The size of Thronefall and how it can be so small (24:04) Jonas’s thoughts on code review (26:42) Why practices like code review and source control might not be relevant for all contexts (30:40) How Jonas and Paul ensure the game is fun  (32:25) How Jonas and Paul used beta testing feedback to improve their game (35:14) The mini-games in Thronefall and why they are so difficult (38:14) The struggle to find the right level of difficulty for the game (41:43) Porting to Nintendo Switch (45:11) The prototypes Jonas and Paul made to get to Thronefall (46:59) The challenge of finding something you want to build that will sell (47:20) Jonas’s ideation process and how they figure out what to build  (49:35) How Thronefall evolved from a mini-game prototype (51:50) How long you spend on prototyping  (52:30) A lesson in failing fast (53:50) The gameplay prototype vs. the art prototype (55:53) How Jonas and Paul distribute work  (57:35) Next steps after having the play prototype and art prototype (59:36) How a launch on Steam works  (1:01:18) Why pathfinding was the most challenging part of building Thronefall (1:08:40) Gen AI tools for building indie games  (1:09:50) How Jonas uses ChatGPT for editing code and as a translator  (1:13:25) The pros and cons of being an indie developer  (1:15:32) Jonas’s advice for software engineers looking to get into indie game development (1:19:32) What to look for in a game design school (1:22:46) How luck figures into success and Jonas’s tips for building a game that sells (1:26:32) Rapid fire round — The Pragmatic Engineer deepdives relevant for this episode: • Game development basics https://newsletter.pragmaticengineer.com/p/game-development-basics  • Building a simple game using Unity https://newsletter.pragmaticengineer.com/p/building-a-simple-game — See the transcript and other references from the episode at ⁠⁠https://newsletter.pragmaticengineer.com/podcast⁠⁠ — Production and marketing by ⁠⁠⁠⁠⁠⁠⁠⁠https://penname.co/⁠⁠⁠⁠⁠⁠⁠⁠. For inquiries about sponsoring the podcast, email [email protected].

Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe

Open table formats such as Apache Iceberg, Delta Lake, and Apache Hudi have dramatically transformed the data management landscape by enabling high-speed operations on massive datasets stored in object stores while maintaining ACID guarantees.

In this talk, we will explore the evolution and future of dataset versioning in the context of open table formats. Open table formats introduced the concept of table-level versioning and have become widely adopted standards. Data versioning systems that have emerged more recently, bringing best practices from software engineering into the data ecosystem, enable the management of multiple datasets within a large-scale data repository using Git-like semantics. Data versioning systems operate at the file level and are compatible with any open table format. On top of this, new catalogs that support these table formats and add a layer of access control are becoming the standard way to manage tabular datasets.

Despite these advancements, there remains a significant gap between current data versioning practices and the requirements for effective tabular dataset versioning.

The session will introduce the concept of a versioned catalog as a solution, demonstrating how it provides comprehensive data and metadata versioning for tables.

We’ll cover key requirements of tabular dataset management, including:

  • Capturing multi-table changes as single logical operations
  • Enabling seamless rollbacks without identifying each affected table
  • Implementing table format-aware versioning operations such as diff and merge

Join us to explore the future of dataset versioning in the era of open table formats and evolving data management practices!

Learn how to track file changes using version control with Git and GitHub. Source code is usually under version control with snapshots taken over time. It is a powerful feature with many benefits. For example, programmers can go back in time if needed, be it to address and reverse problems caused by the latest changes or to look up past implementation decisions in order to inform the making of new ones. You're going to need the Git program installed on your machine. In addition, create an account on the GitHub website.

Learn how to track file changes using version control with Git and GitHub. Source code is usually under version control with snapshots taken over time. It is a powerful feature with many benefits. For example, programmers can go back in time if needed, be it to address and reverse problems caused by the latest changes or to look up past implementation decisions in order to inform the making of new ones. You're going to need the Git program installed on your machine. In addition, create an account on the GitHub website.

At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily. To bridge this gap we have built a web-based IDE, called Quix, that allows simple notebook-like development of Trino SQL workflows and converts them to Airflow DAGs when a user hits the “schedule” button. During the talk we will go through the problems of building a reliable and extendable DAG generating tool, why we preferred Airflow over Apache Oozie and also tricks (sharding, HA-mode, etc) allowing Airflow to run 8000 active DAGs on a single cluster in k8s.