Git

Best Practices for Scaling End-to-End ML Workloads in Snowflake ML

2025-10-09 · Snowflake World Tour London

session

AI/ML MLOps Snowflake

Snowflake ML enables efficient development and deployment of advanced models without any data movement. With multi-GPU support, MLOps integration and Git-based workflows, Container Runtime provides a scalable environment for training, and Snowflake ML’s products such as Model Registry and Model Serving make it easy to deploy these models in production. This session explores best practices for scalable ML workflows and the creation of production-ready ML pipelines in Snowflake.

Best Practices for Scaling End-to-End ML Workloads in Snowflake ML

2025-10-01 · Snowflake World Tour Berlin

session

AI/ML MLOps Snowflake

Snowflake ML enables efficient development and deployment of advanced models without any data movement. With multi-GPU support, MLOps integration and Git-based workflows, Container Runtime provides a scalable environment for training, and Snowflake ML’s products such as Model Registry and Model Serving make it easy to deploy these models in production. This session explores best practices for scalable ML workflows and the creation of production-ready ML pipelines in Snowflake.

Unlocking dbt: Design and Deploy Transformations in Your Cloud Data Warehouse

2025-09-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dustin Dorsey (Onix) , Cameron Cyr

CI/CD Cloud Computing dbt DWH Modern Data Stack Python SQL data data-engineering data-warehouse storage-repositories

Master the art of data transformation with the second edition of this trusted guide to dbt. Building on the foundation of the first edition, this updated volume offers a deeper, more comprehensive exploration of dbt’s capabilities—whether you're new to the tool or looking to sharpen your skills. It dives into the latest features and techniques, equipping you with the tools to create scalable, maintainable, and production-ready data transformation pipelines. Unlocking dbt, Second Edition introduces key advancements, including the semantic layer, which allows you to define and manage metrics at scale, and dbt Mesh, empowering organizations to orchestrate decentralized data workflows with confidence. You’ll also explore more advanced testing capabilities, expanded CI/CD and deployment strategies, and enhancements in documentation—such as the newly introduced dbt Catalog. As in the first edition, you’ll learn how to harness dbt’s power to transform raw data into actionable insights, while incorporating software engineering best practices like code reusability, version control, and automated testing. From configuring projects with the dbt Platform or open source dbt to mastering advanced transformations using SQL and Jinja, this book provides everything you need to tackle real-world challenges effectively. What You Will Learn Understand dbt and its role in the modern data stack Set up projects using both the cloud-hosted dbt Platform and open source project Connect dbt projects to cloud data warehouses Build scalable models in SQL and Python Configure development, testing, and production environments Capture reusable logic with Jinja macros Incorporate version control with your data transformation code Seamlessly connect your projects using dbt Mesh Build and manage a semantic layer using dbt Deploy dbt using CI/CD best practices Who This Book Is For Current and aspiring data professionals, including architects, developers, analysts, engineers, data scientists, and consultants who are beginning the journey of using dbt as part of their data pipeline’s transformation layer. Readers should have a foundational knowledge of writing basic SQL statements, development best practices, and working with data in an analytical context such as a data warehouse.

Model Context Protocol: Principles and Practice

2025-09-26 · PyData Amsterdam 2025 Watch

talk

by Fabio Lipreri , Gabriele Orlandi

GenAI GitHub JSON LLM Cyber Security SQL Data Streaming postgresql

Large‑language‑model agents are only as useful as the context and tools they can reach.

Anthropic’s Model Context Protocol (MCP) proposes a universal, bidirectional interface that turns every external system—SQL databases, Slack, Git, web browsers, even your local file‑system—into first‑class “context providers.”

In just 30 minutes we’ll step from high‑level buzzwords to hands‑on engineering details:

How MCP’s JSON‑RPC message format, streaming channels, and version‑negotiation work under the hood.
Why per‑tool sandboxing via isolated client processes hardens security (and what happens when an LLM tries rm ‑rf /).
Techniques for hierarchical context retrieval that stretch a model’s effective window beyond token limits.
Real‑world patterns for accessing multiple tools—Postgres, Slack, GitHub—and plugging MCP into GenAI applications.

Expect code snippets and lessons from early adoption.

You’ll leave ready to wire your own services into any MCP‑aware model and level‑up your GenAI applications—without the N×M integration nightmare.

From Stored Procedures to Scalable Data

2025-09-25 · Big Data LDN 2025

Face To Face

by Veronica Saha (Zopa Bank) , Francis Nwobu (Zopa Bank) , Tayo Moore (Zopa Bank)

dbt

Zopa outgrew stored procedures: opaque logic, poor documentation, and inefficient pipelines. Moving to dbt Core brought modular models, tests, and version control, cutting change risk and spreading siloed knowledge. The dbt Platform then simplified onboarding, performance, and ownership, with live docs and lineage that boosted adoption and trust across teams, and supported regulatory reporting. Learn the tactics behind Zopa’s hub-and-spoke model, faster onboarding, and reusable definitions via dbt Mesh.

Actionable Techniques for Finding Performance Regressions

2025-09-25 · PyData Amsterdam 2025

talk

by Thijs Nieuwdorp (VodafoneZiggo) , Dr. Jeroen Janssens (Posit)

Bash Data Science Parquet Polars Python

Ever been burned by a mysterious slowdown in your data pipeline? In this session, we'll reveal how a stealthy performance regression in the Polars DataFrame library was hunted down and squashed. Using git bisect, Bash scripting, and uv, we automated commit compilation and benchmarking across two repos to pinpoint a commit that degraded multi-file Parquet loading. This led to challenging assumptions and rethinking performance monitoring for the Python data science library Polars.

Why Data Is Killing Your AI Project and What to Do About It

2025-09-24 · Big Data LDN 2025

Face To Face

by Nir Ozeri (lakeFS)

AI/ML Analytics LLM

Most enterprise AI initiatives don’t fail because of bad models. They fail because of bad data. As organizations rush to integrate LLMs and advanced analytics into production, they often hit a roadblock: datasets that are messy, constantly evolving, and nearly impossible to manage at scale.

This session reveals why data is the Achilles’ heel of enterprise AI and how data version control can turn that weakness into a strength. You’ll learn how data version control transforms the way teams manage training datasets, track ML experiments, and ensure reproducibility across complex, distributed systems.

We’ll cover the fundamentals of data versioning, its role in modern enterprise AI architecture, and real-world examples of teams using it to build scalable, trustworthy AI systems.

Whether you’re an ML engineer, data architect, or AI leader, this talk will help you identify critical data challenges before they stall your roadmap, and provide you with a proven framework to overcome them.

Marimo - a reactive, git-friendly Python notebook

2025-09-22 · PyData Helsinki September meetup at Supercell

talk

Python reactive programming

Marimo - a reactive, git-friendly Python notebook

Data Engineering for Cybersecurity

2025-08-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by James Bonifield

Ansible Cloud Computing Data Engineering ELK Kafka Linux Logstash PowerShell Redis Cyber Security Data Streaming data +1 more

Security teams rely on telemetry—the continuous stream of logs, events, metrics, and signals that reveal what’s happening across systems, endpoints, and cloud services. But that data doesn’t organize itself. It has to be collected, normalized, enriched, and secured before it becomes useful. That’s where data engineering comes in. In this hands-on guide, cybersecurity engineer James Bonifield teaches you how to design and build scalable, secure data pipelines using free, open source tools such as Filebeat, Logstash, Redis, Kafka, and Elasticsearch and more. You’ll learn how to collect telemetry from Windows including Sysmon and PowerShell events, Linux files and syslog, and streaming data from network and security appliances. You’ll then transform it into structured formats, secure it in transit, and automate your deployments using Ansible. You’ll also learn how to: Encrypt and secure data in transit using TLS and SSH Centrally manage code and configuration files using Git Transform messy logs into structured events Enrich data with threat intelligence using Redis and Memcached Stream and centralize data at scale with Kafka Automate with Ansible for repeatable deployments Whether you’re building a pipeline on a tight budget or deploying an enterprise-scale system, this book shows you how to centralize your security data, support real-time detection, and lay the groundwork for incident response and long-term forensics.

Using Git for Collaboration

2025-08-07 · XRAI Hack Pre-Hack Online Workshop Day

talk

#86 What’s Next for Kubernetes? KubeCon 2025 Recap with Nick Schouten

2025-07-17 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

by Nick Schouten (dataroots)

AI/ML Kubernetes LLM MLOps YAML

Send us a text Welcome to the cozy corner of the tech world! Datatopics is your go-to spot for relaxed discussions around tech, news, data, and society. In this episode of Data Topics, we sit down with Nick Schouten — data engineer at dataroots — for a full recap of KubeCon Europe 2025 and a deep dive into the current and future state of Kubernetes. We talk through what’s actually happening in the Kubernetes ecosystem — from platform engineering trends to AI infra challenges — and why some teams are doubling down while others are stepping away. Here’s what we cover: What Kubernetes actually is, and how to explain it beyond the buzzwordWhen Kubernetes is the right choice (e.g., hybrid environments, GPU-heavy workloads) — and when it’s overkillHow teams are trying to host LLMs and AI models on Kubernetes, and the blockers they’re hitting (GPUs, complexity, cost)GitOps innovations spotted at KubeCon — like tools that convert UI clicks into Git commits for infrastructure-as-codeWhy observability is still one of Kubernetes’ biggest weaknesses, and how a wave of new startups are trying to solve itThe push to improve developer experience for ML and data teams (no more YAML overload)The debate around abstraction vs control — and how some teams are turning away from Kubernetes entirely in favor of simpler toolsWhat “vibe coding” means in an LLM-driven world, and how voice-to-code workflows are changing how we write infrastructureWhether the future of Kubernetes is more “visible and accessible,” or further under the hoodIf you're a data engineer, MLOps practitioner, platform lead, or simply trying to stay ahead of the curve in infrastructure and AI — this episode is packed with relevant insights from someone who's hands-on with both the tools and the teaching.

marimo: an open-source reactive Python notebook

2025-07-11 · SciPy 2025

talk

by Akshay Agrawal (Marimo)

Dataflow Python

Python notebooks are a workhorse of scientific computing. But traditional notebooks have problems — they suffer from a reproducibility crisis; they are difficult to use with interactive widgets; their file format does not play well with Git; and they aren't reusable like regular Python scripts or modules.

This talk presents a marimo, an open-source reactive Python notebook that addresses these concerns by modeling notebooks as dataflow graphs and storing them as Python files. We discuss design decisions and their tradeoffs, and show how these decisions make marimo notebooks reproducible in execution and packaging, Git-friendly, executable as scripts, and shareable as apps.

SQLMesh in Action: Git, Testing, and Deployments for SQL

2025-07-10 · London Analytics Engineering Meetup #19

talk

by Joseph Lane (IAG)

SQLMesh deployments testing

SQLMesh in Action: Git, Testing, and Deployments for SQL

(Pre-)Commit to Better Code

2025-07-08 · SciPy 2025

talk

by Stefanie Molin

Maintaining code quality can be challenging, no matter the size of your project or number of contributors. Different team members may have different opinions on code styling and preferences for code structure, while solo contributors might find themselves spending a considerable amount of time making sure the code conforms to accepted conventions. However, manually inspecting and fixing issues in files is both tedious and error-prone. As such, computers are much more suited to this task than humans. Pre-commit hooks are a great way to have a computer handle this for you.

Pre-commit hooks are code checks that run whenever you attempt to commit your changes with Git. They can detect and, in some cases, automatically correct code-quality issues before they make it to your codebase. In this tutorial, you will learn how to install and configure pre-commit hooks for your repository to ensure that only code that passes your checks makes it into your code base. We will also explore how to build custom pre-commit hooks for novel use cases.

Simplifying DAG creation with an AI-powered IDE for Airflow

2025-07-01 · Airflow Summit 2025

session

by Julian LaNeve (Astronomer)

AI/ML Airflow Data Engineering

As the demand for data products grows, data engineering teams face mounting pressure to deliver more and even faster, often becoming bottlenecks. Astro IDE changes the game. Astro IDE is an AI-powered code editor built for Apache Airflow. It helps data teams go from idea to production in minutes—generating production-ready DAGs, enabling in-browser testing, and integrating directly with Git. In this session, see how Astro IDE accelerates DAG creation, debugging, and deployment so data engineering teams can deliver more, 10x faster.

Simplifying DAG creation with an AI-powered IDE for Airflow

2025-07-01 · Airflow Summit 2025

session

by Julian LaNeve (Astronomer)

AI/ML Airflow Data Engineering

As the demand for data products grows, data engineering teams face mounting pressure to deliver more and even faster, often becoming bottlenecks. Astro IDE changes the game. Astro IDE is an AI-powered code editor built for Apache Airflow. It helps data teams go from idea to production in minutes—generating production-ready DAGs, enabling in-browser testing, and integrating directly with Git. In this session, see how Astro IDE accelerates DAG creation, debugging, and deployment so data engineering teams can deliver more, 10x faster.

Transforming Data Engineering: Achieving Efficiency and Ease with an Intuitive Orchestration Solution

2025-07-01 · Airflow Summit 2025

session

by Mili Tripathi , Rakesh Kumar Tai

Airflow CI/CD Data Engineering Data Science PySpark Python SQL

In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.

Vayu: The Airflow Copilot

2025-07-01 · Airflow Summit 2025

session

by Sanchit Sreekanth , Muhammed Irshad

AI/ML Airflow LLM

Vayu is a conversational copilot for Apache Airflow, developed at Prevalent AI to help data engineers manage, troubleshoot, and fix pipelines using natural language. Deployments often fail silently due to misconfigurations, missing connections, or runtime issues impossible to identify in unit tests. Vayu tackles these via a troubleshooting agent that inspects logs, metrics, configs, and runtime state to find root causes and suggest fixes saving engineers significant troubleshooting time. It can also apply approved fixes to DAG code and commit them to your version control system. Key Capabilities: Troubleshooting Agent: Inspects logs, configs, variables, and connections to find root causes and suggest fixes. Pipeline Mechanic Agent: Suggests code-level fixes e.g., missing connections or bad imports and, once approved, commits them to version control. DAG Manager Agent: Understands DAG logic, suggests improvements, and can trigger DAGs conversationally. Architecture: Built with open-source tools including Google ADK as the orchestration layer and a custom Airflow MCP server based on the FastMCP framework. LLMs never access Airflow directly. The full codebase will be open-sourced.

Lakeflow in Production: CI/CD, Testing and Monitoring at Scale

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Adriana Ispas (Databricks) , Lennart Kats (Databricks)

CI/CD Data Engineering Data Quality Databricks

Building robust, production-grade data pipelines goes beyond writing transformation logic — it requires rigorous testing, version control, automated CI/CD workflows and a clear separation between development and production. In this talk, we’ll demonstrate how Lakeflow, paired with Databricks Asset Bundles (DABs), enables Git-based workflows, automated deployments and comprehensive testing for data engineering projects. We’ll share best practices for unit testing, CI/CD automation, data quality monitoring and environment-specific configurations. Additionally, we’ll explore observability techniques and performance tuning to ensure your pipelines are scalable, maintainable and production-ready.

Comprehensive Guide to MLOps on Databricks

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Arpit Jasapara (Databricks) , Eric Golinko (Databricks)

AI/ML CI/CD Databricks GenAI LLM MLOps

This in-depth session explores advanced MLOps practices for implementing production-grade machine learning workflows on Databricks. We'll examine the complete MLOps journey from foundational principles to sophisticated implementation patterns, covering essential tools including MLflow, Unity Catalog, Feature Stores and version control with Git. Dive into Databricks' latest MLOps capabilities including MLflow 3.0, which enhances the entire ML lifecycle from development to deployment with particular focus on generative AI applications. Key session takeaways include: Advanced MLflow 3.0 features for LLM management and deployment Enterprise-grade governance with Unity Catalog integration Robust promotion patterns across development, staging and production CI/CD pipeline automation for continuous deployment GenAI application evaluation and streamlined deployment

talk-data.com

Activity Trend

Top Events

Top Speakers

Best Practices for Scaling End-to-End ML Workloads in Snowflake ML

Best Practices for Scaling End-to-End ML Workloads in Snowflake ML

Unlocking dbt: Design and Deploy Transformations in Your Cloud Data Warehouse

Model Context Protocol: Principles and Practice

From Stored Procedures to Scalable Data

Actionable Techniques for Finding Performance Regressions

Why Data Is Killing Your AI Project and What to Do About It

Marimo - a reactive, git-friendly Python notebook

Data Engineering for Cybersecurity

Using Git for Collaboration

#86 What’s Next for Kubernetes? KubeCon 2025 Recap with Nick Schouten

marimo: an open-source reactive Python notebook

SQLMesh in Action: Git, Testing, and Deployments for SQL

(Pre-)Commit to Better Code

Simplifying DAG creation with an AI-powered IDE for Airflow

Simplifying DAG creation with an AI-powered IDE for Airflow

Transforming Data Engineering: Achieving Efficiency and Ease with an Intuitive Orchestration Solution

Vayu: The Airflow Copilot

Lakeflow in Production: CI/CD, Testing and Monitoring at Scale

Comprehensive Guide to MLOps on Databricks