DevOps

dbt as a Serverless Service

2022-10-25 · dbt Coalesce 2022 Watch

video

by Kiran Sheshadri (4 Mile Analytics)

Cloud Computing dbt GCP Cloud Run

Are you sold on dbt, but unsure as to how you’ll handle deployment, orchestration and job scheduling? Are you evaluating dbt and looking for an easy way to spin up a proof of concept while seeking buy in from stakeholders? Look no further! In this workshop we will show you how to containerize your dbt project and execute jobs using GCP’s serverless computing products Cloud Run, Build and Scheduler. If you have an interest in dbt orchestration, devops, or serverless cloud architecture, this workshop is for you!

Check the slides here: https://docs.google.com/presentation/d/1NiG0MFkOvw5MNpCZFF74VDuX-jHZpO4a8bHUadukoPI/edit?usp=sharing

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

Practical Database Auditing for Microsoft SQL Server and Azure SQL: Troubleshooting, Regulatory Compliance, and Governance

2022-09-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Josephine Bush

AWS Amazon RDS Azure Cloud Computing Microsoft SQL SQL Server data data-engineering microsoft-sql-server relational-databases

Know how to track changes and key events in your SQL Server databases in support of application troubleshooting, regulatory compliance, and governance. This book shows how to use key features in SQL Server ,such as SQL Server Audit and Extended Events, to track schema changes, permission changes, and changes to your data. You’ll even learn how to track queries run against specific tables in a database. Not all changes and events can be captured and tracked using SQL Server Audit and Extended Events, and the book goes beyond those features to also show what can be captured using common criteria compliance, change data capture, temporal tables, or querying the SQL Server log. You will learn how to audit just what you need to audit, and how to audit pretty much anything that happens on a SQL Server instance. This book will also help you set up cloud auditing with an emphasis on Azure SQL Database, Azure SQL Managed Instance, and AWS RDS SQL Server. You don’t need expensive, third-party auditing tools to make auditing work for you, and to demonstrate and provide value back to your business. This book will help you set up an auditing solution that works for you and your needs. It shows how to collect the audit data that you need, centralize that data for easy reporting, and generate audit reports using built-in SQL Server functionality for use by your own team, developers, and organization’s auditors. What You Will Learn Understand why auditing is important for troubleshooting, compliance, and governance Track changes and key events using SQL Server Audit and Extended Events Track SQL Server configuration changes for governance and troubleshooting Utilize change data capture and temporal tables to track data changes in SQL Server tables Centralize auditing data from all yourdatabases for easy querying and reporting Configure auditing on Azure SQL, Azure SQL Managed Instance, and AWS RDS SQL Server Who This Book Is For Database administrators who need to know what’s changing on their database servers, and those who are making the changes; database-savvy DevOps engineers and developers who are charged with troubleshooting processes and applications; developers and administrators who are responsible for generating reports in support of regulatory compliance reporting and auditing

Comet for Data Science

2022-08-26 · O'Reilly Data Science Books O'Reilly Amazon

book

by Angelica Lo Duca

AI/ML Data Science GitLab NLP data data-science

Discover how to manage and optimize the life cycle of your data science projects with Comet! By the end of this book, you will master preparing, analyzing, building, and deploying models, as well as integrating Comet into your workflow. What this Book will help me do Master managing data science workflows with Comet. Confidently prepare and analyze your data for effective modeling. Deploy and monitor machine learning models using Copet tools. Integrate Comet with DevOps and GitLab workflows for production readiness. Apply Comet to advanced topics like NLP, deep learning, and time series analysis. Author(s) Angelica Lo Duca is an experienced author and data scientist with years of expertise in data science workflows and tools. She brings practical insights into integrating platforms like Comet into modern data science tasks. Who is it for? If you are a data science practitioner or programmer looking to understand and implement efficient project lifecycles using Comet, this book is tailored for you. A basic backdrop in data science and programming is highly recommended, but prior expertise in Comet is unnecessary.

Microsoft Power Platform Solution Architect's Handbook

2022-07-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Hugo Herrera

Azure Azure DevOps Microsoft business-intelligence data data-science microsoft-power-platform

Microsoft Power Platform Solution Architect's Handbook is your definitive resource for mastering Enterprise-grade solution architecture using Microsoft Power Platform. By covering both practical examples and theoretical best practices, this book ensures you are well-prepared to tackle real-world challenges and excel in the PL-600 certification exam. What this Book will help me do Master the essential practices of solution architecture for optimal design. Develop secure integrations and data strategies for cutting-edge applications. Learn sophisticated lifecycle and compliance management using Azure DevOps. Build impactful, compliant, and flexible solutions using Power Platform. Prepare effectively for the PL-600 certification exam and excel in your field. Author(s) Hugo Herrera is a respected technology expert specializing in solution architecture and enterprise-grade IT solutions, particularly with Microsoft Power Platform. Drawing from years of experience, Hugo emphasizes practical, actionable strategies to elevate professionals. Through this book, Hugo shares his deep expertise and makes complex concepts accessible. Who is it for? This book is perfect for solution architects, enterprise architects, IT consultants, and analysts focused on Microsoft Power Platform and related technologies. It provides insight and tools for professionals looking to enhance their competencies, advance their careers, and prepare for the PL-600 exam. The reader should have a solid understanding of Power Platform fundamentals.

A Low-Code Approach to 10x Data Engineering

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Analytics Data Engineering Databricks Git PySpark Spark

Can we take Data Engineering on Spark 10x beyond where it is today?

Yes, we can enable 10x more users on Spark, and make them 10x more productive from day 1. Data engineering can run at scale, and it can still be 10x simpler and faster to develop, deploy, and manage pipelines.

Low code is the key. A modern data engineering platform built on low code will enable all data users, from new graduates to experts, to visually develop high-quality pipelines. With Visual = Code, the visual elements will be stored as PySpark code on Git and deployed using the best software practices taken from DevOps. Search and lineage help data engineers and their customers in analytics understand how each column value was produced, when it was updated, and the associated quality metric.

See how a complete, low-code data engineering platform can reduce complexity and effort, enabling you to rapidly deploy, scale, and use Spark, making data and analytics a strategic asset in your company.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Amgen’s Journey To Building a Global 360 View of its Customers with the Lakehouse

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Agile/Scrum AI/ML Analytics AWS Cloud Computing Data Engineering Data Lakehouse Data Quality Databricks Delta Git Spark +1 more

Serving patients in over 100 countries, Amgen is a leading global biotech company focused on developing therapies that have the power to save lives. Delivering on this mission requires our commercial teams to regularly meet with healthcare providers to discuss new treatments that can help patients in need. With the onset of the pandemic, where face-to-face interactions with doctors and other Healthcare Providers (HCPs) were severely impacted, Amgen had to rethink these interactions. With that in mind, the Amgen Commercial Data and Analytics team leveraged a modern data and AI architecture built on the Databricks Lakehouse to help accelerate its digital and data insights capabilities. This foundation enabled Amgen’s teams to develop a comprehensive, customer-centric view to support flexible go-to-market models and provide personalized experiences to our customers. In this presentation, we will share our recent journey of how we took an agile approach to bringing together over 2.2 petabytes of internally generated and externally sourced vendor data , and onboard into our AWS Cloud and Databricks environments to enable a standardized, scalable and robust capabilities to meet the business requirements in our fast-changing life sciences environment. We will share use cases of how we harmonized and managed our diverse sets of data to deliver efficiency, simplification, and performance outcomes for the business. We will cover the following aspects of our journey along with best practices we learned over time: • Our architecture to support Amgen’s Commercial Data & Analytics constant processing around the globe • Engineering best practices for building large scale Data Lakes and Analytics platforms such as Team organization, Data Ingestion and Data Quality Frameworks, DevOps Toolkit and Maturity Frameworks, and more • Databricks capabilities adopted such as Delta Lake, Workspace policies, SQL workspace endpoints, and MLflow for model registry and deployment. Also, various tools were built for Databricks workspace administration • Databricks capabilities being explored for future, such as Multi-task Orchestration, Container-based Apache Spark Processing, Feature Store, Repos for Git integration, etc. • The types of commercial analytics use cases we are building on the Databricks Lakehouse platform Attendees building global and Enterprise scale data engineering solutions to meet diverse sets of business requirements will benefit from learning about our journey. Technologists will learn how we addressed specific Business problems via reusable capabilities built to maximize value.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Announcing General Availability of Databricks Terraform Provider

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

CI/CD Data Engineering Data Science Databricks Cyber Security Terraform

We all live in the exciting times and the hype of Distributed Data Mesh (or just mess). This talk will cover a couple architectural and organizational approaches on achieving Distributed Data Mesh, which is essentially a combination of mindset, fully automated infrastructure, continuous integration for data pipelines, dedicated team collaborative environments, and security enforcement. As a Data Leader, you’ll learn what kinds of things you’d need to pay attention to, when starting (or reviving) a modern Data Engineering and Data Science strategy and how Databricks Unity Catalog may help you automating that. As DevOps, you’ll learn about the best practices and pitfalls of Continuous Deployment on Databricks With Terraform and Continuous Integration with Databricks Repos. You’ll be excited how you can automate Data Security with Unity Catalog and Terraform. As a Data Scientist, you’ll learn how you can get relevant infrastructure into “production” relatively faster.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Rethinking Orchestration as Reconciliation: Software-Defined Assets in Dagster

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Dagster Data Management Databricks dbt DWH Git IaC Modern Data Stack

This talk discusses “software-defined assets”, a declarative approach to orchestration and data management that makes it drastically easier to trust and evolve datasets and ML models. Dagster is an open source orchestrator built for maintaining software-defined assets.

In traditional data platforms, code and data are only loosely coupled. As a consequence, deploying changes to data feels dangerous, backfills are error-prone and irreversible, and it’s difficult to trust data, because you don’t know where it comes from or how it’s intended to be maintained. Each time you run a job that mutates a data asset, you add a new variable to account for when debugging problems.

Dagster proposes an alternative approach to data management that tightly couples data assets to code - each table or ML model corresponds to the function that’s responsible for generating it. This results in a “Data as Code” approach that mimics the “Infrastructure as Code” approach that’s central to modern DevOps. Your git repo becomes your source of truth on your data, so pushing data changes feels as safe as pushing code changes. Backfills become easy to reason about. You trust your data assets because you know how they’re computed and can reproduce them at any time. The role of the orchestrator is to ensure that physical assets in the data warehouse match the logical assets that are defined in code, so each job run is a step towards order.

Software-defined assets is a natural approach to orchestration for the modern data stack, in part because dbt models are a type of software-defined asset.

Attendees of this session will learn how to build and maintain lakehouses of software-defined assets with Dagster.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

MLflow Pipelines: Accelerating MLOps from Development to Production

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Data Science Databricks MLOps

Despite being an emerging topic, MLOps is hard and there are no widely established approaches for MLOps. What makes it even harder is that in many companies the ownership of MLOps usually falls through the cracks between data science teams and production engineering teams. Data scientists are mostly focused on modeling the business problems and reasoning about data, features, and metrics, while the production engineers/ops are mostly focused on traditional DevOps for software development, ignoring ML-specific Ops like ML development cycles, experiment tracking, data/model validation, etc. In this talk, we will introduce MLflow Pipelines, an opinionated approach for MLOps. It provides predefined ML pipeline templates for common ML problems and opinionated development workflows to help data scientists bootstrap ML projects, accelerate model development, and ship production-grade code with little help from production engineers.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Beyond Testing: How to Build Circuit Breakers with Airflow

2022-07-01 · Airflow Summit 2022

session

by Prateek Chawla (Monte Carlo)

Airflow Data Quality DataOps Monte Carlo Python

Testing is an important part of the DataOps life cycle, giving teams confidence in the integrity of their data as it moves downstream to production systems. But what happens when testing doesn’t catch all of your bad data and “unknown unknown” data quality issues fall through the cracks? Fortunately, data engineers can apply a thing or two from DevOps best practices to tackle data quality at scale with circuit breakers, a novel approach to stopping bad data from actually entering your pipelines in the first place. In this talk, Prateek Chawla, Founding Team Member and Technical Lead at Monte Carlo, will discuss what circuit breakers are, how to integrate them with your Airflow DAGs, and what this looks like in practice. Time permitting, Prateek will also walk through how to build and automate Airflow circuit breakers across multiple cascading pipelines with Python and other common tools.

Storytime for DataOps - Christopher Bergh

2022-04-22 · DataTalks.Club Listen

podcast_episode

by Christopher Bergh (DataKitchen)

Agile/Scrum Analytics Chef Data Science DataOps HTML MLOps

We talked about:

Christopher’s background The essence of DataOps Also known as Agile Analytics Operations or DevOps for Data Science Defining processes and automating them (defining “done” and “good”) The balance between heroism and fear (avoiding deferred value) The Lean approach Avoiding silos The 7 steps to DataOps Wanting to become replaceable DataOps is doable Testing tools DataOps vs MLOps The Head Chef at Data Kitchen What’s grilling at Data Kitchen? The DataOps Cookbook

Links:

DataOps Manifesto website: https://dataopsmanifesto.org/en/ DataOps Cookbook: https://dataops.datakitchen.io/pf-cookbook Recipes for DataOps Success: https://dataops.datakitchen.io/pf-recipes-for-dataops-success DataOps Certification Course: https://info.datakitchen.io/training-certification-dataops-fundamentals DataOps Blog: https://datakitchen.io/blog/ DataOps Maturity Model: https://datakitchen.io/dataops-maturity-model/ DataOps Webinars: https://datakitchen.io/webinars/

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

What Does It Really Mean To Do MLOps And What Is The Data Engineer's Role?

2022-04-16 · Data Engineering Podcast Listen

podcast_episode

by David Aponte , Demetrios Brinkmann (MLOps Community) , Tobias Macey

AI/ML CDP Cloud Computing Data Engineering Data Lake Data Management DataOps ETL/ELT Kubernetes Modern Data Stack MLOps SaaS +1 more

Summary Putting machine learning models into production and keeping them there requires investing in well-managed systems to manage the full lifecycle of data cleaning, training, deployment and monitoring. This requires a repeatable and evolvable set of processes to keep it functional. The term MLOps has been coined to encapsulate all of these principles and the broader data community is working to establish a set of best practices and useful guidelines for streamlining adoption. In this episode Demetrios Brinkmann and David Aponte share their perspectives on this rapidly changing space and what they have learned from their work building the MLOps community through blog posts, podcasts, and discussion forums.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Your host is Tobias Macey and today I’m interviewing Demetrios Brinkmann and David Aponte about what you need to know about MLOps as a data engineer

Interview

Introduction How did you get involved in the area of data management? Can you describe what MLOps is?

How does it relate to DataOps? DevOps? (is it just another buzzword?)

What is your interest and involvement in the space of MLOps? What are the open and active questions in the MLOps community? Who is responsible for MLOps in an organization?

What is the role of the data engineer in that process?

What are the core capabilities that are necessary to support an "MLOps" workflow? How do the current platform technologies support the adoption of MLOps workflows?

What are the areas that are currently underdeveloped/underserved?

Can you describe the technical and organizational design/architecture decisions that need to be made when endeavoring to adopt MLOps practices? What are some of the common requirements for supporting ML workflows?

What are some of the ways that requirements become bespoke to a given organization or project?

What are the opportunities for standardization or consolidation in the tooling for MLOps?

What are the pieces that are always going to require custom engineering?

What are the most interesting, innovative, or unexpected approaches to MLOps workflows/platforms that you have seen? What are the most interesting, unexpected, or challenging lessons that you

CockroachDB: The Definitive Guide

2022-04-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jesse Seldess , Ben Darnell , Guy Harrison

Cloud Computing Data Modelling SQL cockroachdb data data-engineering relational-databases

Get the lowdown on CockroachDB, the distributed SQL database built to handle the demands of today's data-driven cloud applications. In this hands-on guide, software developers, architects, and DevOps/SRE teams will learn how to use CockroachDB to create applications that scale elastically and provide seamless delivery for end users while remaining indestructible. Teams will also learn how to migrate existing applications to CockroachDB's performant, cloud native data architecture. If you're familiar with distributed systems, you'll quickly discover the benefits of strong data correctness and consistency guarantees as well as optimizations for delivering ultra low latencies to globally distributed end users. You'll learn how to: Design and build applications for distributed infrastructure, including data modeling and schema design Migrate data into CockroachDB Read and write data and run ACID transactions across distributed infrastructure Plan a CockroachDB deployment for resiliency across single region and multi-region clusters Secure, monitor, and optimize your CockroachDB deployment

Data Engineering on Azure

2021-08-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vlad Riscutia

AI/ML Analytics Azure Big Data Cloud Computing Data Engineering Data Governance Data Management Data Modelling Data Quality Microsoft data +1 more

Build a data platform to the industry-leading standards set by Microsoft’s own infrastructure. In Data Engineering on Azure you will learn how to: Pick the right Azure services for different data scenarios Manage data inventory Implement production quality data modeling, analytics, and machine learning workloads Handle data governance Using DevOps to increase reliability Ingesting, storing, and distributing data Apply best practices for compliance and access control Data Engineering on Azure reveals the data management patterns and techniques that support Microsoft’s own massive data infrastructure. Author Vlad Riscutia, a data engineer at Microsoft, teaches you to bring an engineering rigor to your data platform and ensure that your data prototypes function just as well under the pressures of production. You'll implement common data modeling patterns, stand up cloud-native data platforms on Azure, and get to grips with DevOps for both analytics and machine learning. About the Technology Build secure, stable data platforms that can scale to loads of any size. When a project moves from the lab into production, you need confidence that it can stand up to real-world challenges. This book teaches you to design and implement cloud-based data infrastructure that you can easily monitor, scale, and modify. About the Book In Data Engineering on Azure you’ll learn the skills you need to build and maintain big data platforms in massive enterprises. This invaluable guide includes clear, practical guidance for setting up infrastructure, orchestration, workloads, and governance. As you go, you’ll set up efficient machine learning pipelines, and then master time-saving automation and DevOps solutions. The Azure-based examples are easy to reproduce on other cloud platforms. What's Inside Data inventory and data governance Assure data quality, compliance, and distribution Build automated pipelines to increase reliability Ingest, store, and distribute data Production-quality data modeling, analytics, and machine learning About the Reader For data engineers familiar with cloud computing and DevOps. About the Author Vlad Riscutia is a software architect at Microsoft. Quotes A definitive and complete guide on data engineering, with clear and easy-to-reproduce examples. - Kelum Prabath Senanayake, Echoworx An all-in-one Azure book, covering all a solutions architect or engineer needs to think about. - Albert Nogués, Danone A meaningful journey through the Azure ecosystem. You’ll be building pipelines and joining components quickly! - Todd Cook, Appen A gateway into the world of Azure for machine learning and DevOps engineers. - Krzysztof Kamyczek, Luxoft

Developing Modern Database Applications with PostgreSQL

2021-08-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Quan Ha Le , Marcelo Diaz

API Cloud Computing Linux data data-engineering postgresql relational-databases

In "Developing Modern Database Applications with PostgreSQL", you will master the art of building database applications with the highly available and scalable PostgreSQL. Walk through a series of real-world projects that fully explore both the developmental and administrative aspects of PostgreSQL, all tied together through the example of a banking application. What this Book will help me do Set up high-availability PostgreSQL clusters using modern best practices. Monitor and tune database performance to handle enterprise-level workloads seamlessly. Automate testing and implement test-driven development strategies for robust applications. Leverage PostgreSQL along with DevOps pipelines to deploy applications on cloud platforms. Develop APIs and geospatial databases using popular tools like PostgREST and PostGIS. Author(s) The authors of this book, None Le and None Diaz, are experienced professionals in database technologies and software development. With a passion for PostgreSQL and its applications in modern computing, they bring a wealth of expertise and a practical approach to this book. Their methods focus on real-world applicability, ensuring that readers gain hands-on skills and practical knowledge. Who is it for? This book is perfect for database developers, administrators, and architects who want to advance their expertise in PostgreSQL. It is also suitable for software engineers and IT professionals aiming to tackle end-to-end database development projects. A basic knowledge of PostgreSQL and Linux will help you dive into the hands-on projects easily. If you're looking to take your PostgreSQL skills to the next level, this book is for you.

The Definitive Guide to Azure Data Engineering: Modern ELT, DevOps, and Analytics on the Azure Cloud Platform

2021-08-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ron C. L'Esteve

Analytics Azure ADF Azure DevOps CI/CD Cloud Computing Cosmos Data Engineering Data Governance Data Lake Databricks ETL/ELT +9 more

Build efficient and scalable batch and real-time data ingestion pipelines, DevOps continuous integration and deployment pipelines, and advanced analytics solutions on the Azure Data Platform. This book teaches you to design and implement robust data engineering solutions using Data Factory, Databricks, Synapse Analytics, Snowflake, Azure SQL database, Stream Analytics, Cosmos database, and Data Lake Storage Gen2. You will learn how to engineer your use of these Azure Data Platform components for optimal performance and scalability. You will also learn to design self-service capabilities to maintain and drive the pipelines and your workloads. The approach in this book is to guide you through a hands-on, scenario-based learning process that will empower you to promote digital innovation best practices while you work through your organization’s projects, challenges, and needs. The clear examples enable you to use this book as a reference and guide for building data engineering solutions in Azure. After reading this book, you will have a far stronger skill set and confidence level in getting hands on with the Azure Data Platform. What You Will Learn Build dynamic, parameterized ELT data ingestion orchestration pipelines in Azure Data Factory Create data ingestion pipelines that integrate control tables for self-service ELT Implement a reusable logging framework that can be applied to multiple pipelines Integrate Azure Data Factory pipelines with a variety of Azure data sources and tools Transform data with Mapping Data Flows in Azure Data Factory Apply Azure DevOps continuous integration and deployment practices to your Azure Data Factory pipelines and development SQL databases Design and implement real-time streaming and advanced analytics solutions using Databricks, Stream Analytics, and Synapse Analytics Get started with a variety of Azure data services through hands-on examples Who This Book Is For Data engineers and data architects who are interested in learning architectural and engineering best practices around ELT and ETL on the Azure Data Platform, those who are creating complex Azure data engineering projects and are searching for patterns of success, and aspiring cloud and data professionals involved in data engineering, data governance, continuous integration and deployment of DevOps practices, and advanced analytics who want a full understanding of the many different tools and technologies that Azure Data Platform provides

Build Your Own Data Pipeline - Andreas Kretz

2021-07-02 · DataTalks.Club Listen

podcast_episode

by Andreas Kretz

Cloud Computing Data Engineering Data Science Docker Hadoop HTML

We talked about:

Andreas’s background Why data engineering is becoming more popular Who to hire first – a data engineer or a data scientist? How can I, as a data scientist, learn to build pipelines? Don’t use too many tools What is a data pipeline and why do we need it? What is ingestion? Can just one person build a data pipeline? Approaches to building data pipelines for data scientists Processing frameworks Common setup for data pipelines — car price prediction Productionizing the model with the help of a data pipeline Scheduling Orchestration Start simple Learning DevOps to implement data pipelines How to choose the right tool Are Hadoop, Docker, Cloud necessary for a first job/internship? Is Hadoop still relevant or necessary? Data engineering academy How to pick up Cloud skills Avoid huge datasets when learning Convincing your employer to do data science How to find Andreas

Links:

LinkedIn: https://www.linkedin.com/in/andreas-kretz Data engieering cookbook: https://cookbook.learndataengineering.com/ Course: https://learndataengineering.com/

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

Data Pipelines with Apache Airflow

2021-05-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Julian de Ruiter (Xebia) , Bas Harenslak (Astronomer)

AI/ML Airflow Cloud Computing Data Management Python Snowflake apache-airflow data data-engineering

A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack. About the Technology Data pipelines manage the flow of data from initial collection through consolidation, cleaning, analysis, visualization, and more. Apache Airflow provides a single platform you can use to design, implement, monitor, and maintain your pipelines. Its easy-to-use UI, plug-and-play options, and flexible Python scripting make Airflow perfect for any data management task. About the Book Data Pipelines with Apache Airflow teaches you how to build and maintain effective data pipelines. You’ll explore the most common usage patterns, including aggregating multiple data sources, connecting to and from data lakes, and cloud deployment. Part reference and part tutorial, this practical guide covers every aspect of the directed acyclic graphs (DAGs) that power Airflow, and how to customize them for your pipeline’s needs. What's Inside Build, test, and deploy Airflow pipelines as DAGs Automate moving and transforming data Analyze historical datasets using backfilling Develop custom components Set up Airflow in production environments About the Reader For DevOps, data engineers, machine learning engineers, and sysadmins with intermediate Python skills. About the Authors Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies. Bas is also an Airflow committer. Quotes An Airflow bible. Useful for all kinds of users, from novice to expert. - Rambabu Posa, Sai Aashika Consultancy An easy-to-follow exploration of the benefits of orchestrating your data pipeline jobs with Airflow. - Daniel Lamblin, Coupang The one reference you need to create, author, schedule, and monitor workflows with Apache Airflow. Clear recommendation. - Thorsten Weber, bbv Software Services AG By far the best resource for Airflow. - Jonathan Wood, LexisNexis

The Grand Vision And Present Reality of DataOps

2021-05-04 · Data Engineering Podcast Listen

podcast_episode

by Kevin Stumpf (Tecton) , Maxime Beauchemin (Preset) , Tobias Macey , Lior Gavish (Monte Carlo)

Airflow BI BigQuery CI/CD Cloud Computing Data Engineering Data Management Data Quality Datafold DataOps dbt DWH +7 more

Summary The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. More than just a collection of tools, there are a number of organizational and conceptual changes that a proper DataOps approach depends on. In this episode Kevin Stumpf, CTO of Tecton, Maxime Beauchemin, CEO of Preset, and Lior Gavish, CTO of Monte Carlo, discuss the grand vision and present realities of DataOps. They explain how to think about your data systems in a holistic and maintainable fashion, the security challenges that threaten to derail your efforts, and the power of using metadata as the foundation of everything that you do. If you are wondering how to get control of your data platforms and bring all of your stakeholders onto the same page then this conversation is for you.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Max Beauchemin, Lior Gavish, and Kevin Stumpf about the real world challenges of embracing DataOps practices and systems, and how to keep things secure as you scale

Interview

Introduction How did you get involved in the area of data management? Before we get started, can you each give your definition of what "DataOps" means to you?

How does this differ from "business as usual" in the data industry? What are some of the things that DataOps isn’t (despite what marketers might say)?

What are the biggest difficulties that you have faced in going from concept to production with a workflow or system intended to power self-serve access to other membe

Building Custom Tasks for SQL Server Integration Services: The Power of .NET for ETL for SQL Server 2019 and Beyond

2021-02-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Andy Leonard

Azure ADF Azure DevOps Cloud Computing ETL/ELT Microsoft SQL SSIS data data-engineering etl

Build custom SQL Server Integration Services (SSIS) tasks using Visual Studio Community Edition and C#. Bring all the power of Microsoft .NET to bear on your data integration and ETL processes, and for no added cost over what you’ve already spent on licensing SQL Server. New in this edition is a demonstration deploying a custom SSIS task to the Azure Data Factory (ADF) Azure-SSIS Integration Runtime (IR). All examples in this new edition are implemented in C#. Custom task developers are shown how to implement custom tasks using the widely accepted and default language for .NET development. Why are custom components necessary? Because even though the SSIS catalog of built-in tasks and components is a marvel of engineering, gaps remain in the available functionality. One such gap is a constraint of the built-in SSIS Execute Package Task, which does not allow SSIS developers to select SSIS packages from other projects in the SSIS Catalog. Examples in this bookshow how to create a custom Execute Catalog Package task that allows SSIS developers to execute tasks from other projects in the SSIS Catalog. Building on the examples and patterns in this book, SSIS developers may create any task to which they aspire, custom tailored to their specific data integration and ETL needs. What You Will Learn Configure and execute Visual Studio in the way that best supports SSIS task development Create a class library as the basis for an SSIS task, and reference the needed SSIS assemblies Properly sign assemblies that you create in order to invoke them from your task Implement source code control via Azure DevOps, or your own favorite tool set Troubleshoot and execute custom tasks as part of your own projects Create deployment projects (MSIs) for distributing code-complete tasks Deploy custom tasks to Azure Data Factory Azure-SSIS IRs in the cloud Create advanced editors for custom task parameters Who This Book Is For For database administrators and developers who are involved in ETL projects built around SQL Server Integration Services (SSIS). Readers do not need a background in software development with C#. Most important is a desire to optimize ETL efforts by creating custom-tailored tasks for execution in SSIS packages, on-premises or in ADF Azure-SSIS IRs.

talk-data.com

Activity Trend

Top Events

Top Speakers

dbt as a Serverless Service

Practical Database Auditing for Microsoft SQL Server and Azure SQL: Troubleshooting, Regulatory Compliance, and Governance

Comet for Data Science

Microsoft Power Platform Solution Architect's Handbook

A Low-Code Approach to 10x Data Engineering

Amgen’s Journey To Building a Global 360 View of its Customers with the Lakehouse

Announcing General Availability of Databricks Terraform Provider

Rethinking Orchestration as Reconciliation: Software-Defined Assets in Dagster

MLflow Pipelines: Accelerating MLOps from Development to Production

Beyond Testing: How to Build Circuit Breakers with Airflow

Storytime for DataOps - Christopher Bergh

What Does It Really Mean To Do MLOps And What Is The Data Engineer's Role?

CockroachDB: The Definitive Guide

Data Engineering on Azure

Developing Modern Database Applications with PostgreSQL

The Definitive Guide to Azure Data Engineering: Modern ELT, DevOps, and Analytics on the Azure Cloud Platform

Build Your Own Data Pipeline - Andreas Kretz

Data Pipelines with Apache Airflow

The Grand Vision And Present Reality of DataOps

Building Custom Tasks for SQL Server Integration Services: The Power of .NET for ETL for SQL Server 2019 and Beyond