talk-data.com talk-data.com

Topic

Cloud Computing

infrastructure saas iaas

4055

tagged

Activity Trend

471 peak/qtr
2020-Q1 2026-Q1

Activities

4055 activities · Newest first

Summary Putting machine learning models into production and keeping them there requires investing in well-managed systems to manage the full lifecycle of data cleaning, training, deployment and monitoring. This requires a repeatable and evolvable set of processes to keep it functional. The term MLOps has been coined to encapsulate all of these principles and the broader data community is working to establish a set of best practices and useful guidelines for streamlining adoption. In this episode Demetrios Brinkmann and David Aponte share their perspectives on this rapidly changing space and what they have learned from their work building the MLOps community through blog posts, podcasts, and discussion forums.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Your host is Tobias Macey and today I’m interviewing Demetrios Brinkmann and David Aponte about what you need to know about MLOps as a data engineer

Interview

Introduction How did you get involved in the area of data management? Can you describe what MLOps is?

How does it relate to DataOps? DevOps? (is it just another buzzword?)

What is your interest and involvement in the space of MLOps? What are the open and active questions in the MLOps community? Who is responsible for MLOps in an organization?

What is the role of the data engineer in that process?

What are the core capabilities that are necessary to support an "MLOps" workflow? How do the current platform technologies support the adoption of MLOps workflows?

What are the areas that are currently underdeveloped/underserved?

Can you describe the technical and organizational design/architecture decisions that need to be made when endeavoring to adopt MLOps practices? What are some of the common requirements for supporting ML workflows?

What are some of the ways that requirements become bespoke to a given organization or project?

What are the opportunities for standardization or consolidation in the tooling for MLOps?

What are the pieces that are always going to require custom engineering?

What are the most interesting, innovative, or unexpected approaches to MLOps workflows/platforms that you have seen? What are the most interesting, unexpected, or challenging lessons that you

Visualizing Google Cloud

Easy-to-follow visual walkthrough of every important part of the Google Cloud Platform The Google Cloud Platform incorporates dozens of specialized services that enable organizations to offload technological needs onto the cloud. From routine IT operations like storage to sophisticated new capabilities including artificial intelligence and machine learning, the Google Cloud Platform offers enterprises the opportunity to scale and grow efficiently. In Visualizing Google Cloud: Illustrated References for Cloud Engineers & Architects, Google Cloud expert Priyanka Vergadia delivers a fully illustrated, visual guide to matching the best Google Cloud Platform services to your own unique use cases. After a brief introduction to the major categories of cloud services offered by Google, the author offers approximately 100 solutions divided into eight categories of services included in Google Cloud Platform: Compute Storage Databases Data Analytics Data Science, Machine Learning and Artificial Intelligence Application Development and Modernization with Containers Networking Security You’ll find richly illustrated flowcharts and decision diagrams with straightforward explanations in each category, making it easy to adopt and adapt Google’s cloud services to your use cases. With coverage of the major categories of cloud models—including infrastructure-, containers-, platforms-, functions-, and serverless—and discussions of storage types, databases and Machine Learning choices, Visualizing Google Cloud: Illustrated References for Cloud Engineers & Architects is perfect for Every Google Cloud enthusiast, of course. It is for anyone who is planning a cloud migration or new cloud deployment. It is for anyone preparing for cloud certification, and for anyone looking to make the most of Google Cloud. It is for cloud solutions architects, IT decision-makers, and cloud data and ML engineers. In short, this book is for YOU.

CockroachDB: The Definitive Guide

Get the lowdown on CockroachDB, the distributed SQL database built to handle the demands of today's data-driven cloud applications. In this hands-on guide, software developers, architects, and DevOps/SRE teams will learn how to use CockroachDB to create applications that scale elastically and provide seamless delivery for end users while remaining indestructible. Teams will also learn how to migrate existing applications to CockroachDB's performant, cloud native data architecture. If you're familiar with distributed systems, you'll quickly discover the benefits of strong data correctness and consistency guarantees as well as optimizations for delivering ultra low latencies to globally distributed end users. You'll learn how to: Design and build applications for distributed infrastructure, including data modeling and schema design Migrate data into CockroachDB Read and write data and run ACID transactions across distributed infrastructure Plan a CockroachDB deployment for resiliency across single region and multi-region clusters Secure, monitor, and optimize your CockroachDB deployment

Summary Any time that you are storing data about people there are a number of privacy and security considerations that come with it. Privacy engineering is a growing field in data management that focuses on how to protect attributes of personal data so that the containing datasets can be shared safely. In this episode Gretel co-founder and CTO John Myers explains how they are building tools for data engineers and analysts to incorporate privacy engineering techniques into their workflows and validate the safety of their data against re-identification attacks.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Your host is Tobias Macey and today I’m interviewing John Myers about privacy engineering and use cases for synthetic data

Interview

Introduction How did you get involved in the area of data management? Can you describe what Gretel is and the story behind it? How do you define "privacy engineering"?

In an organization or data team, who is typically responsible for privacy engineering?

How would you characterize the current state of the art and adoption for privacy engineering? Who are the target users of Gretel and how does that inform the features and design of the product? What are the stages of the data lifecycle where Gretel is used? Can you describe a typical workflow for integrating Gretel into data pipelines for business analytics or ML model training? How is the Gretel platform implemented?

How have the design and goals of the system changed or evolved since you started working on it?

What are some of the nuances of synthetic data generation or masking that data engineers/data analysts need to be aware of as they start using Gretel? What are the most interesting, innovative, or unexpected ways that you have seen Gretel used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Gretel? When is Gretel the wrong choice? What do you have planned for the future of Gretel?

Contact Info

LinkedIn @jtm_tech on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

Gretel Privacy Engineering Weights and Biases Red Team/Blue Team Generative Adversarial Network Capture The Flag in application security CVE == Common Vulnerabilities and Exposures Machine Learning Cold Start Problem Faker Mockaroo Kaggle Sentry

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Logging in Action

Make log processing a real asset to your organization with powerful and free open source tools. In Logging in Action you will learn how to: Deploy Fluentd and Fluent Bit into traditional on-premises, IoT, hybrid, cloud, and multi-cloud environments, both small and hyperscaled Configure Fluentd and Fluent Bit to solve common log management problems Use Fluentd within Kubernetes and Docker services Connect a custom log source or destination with Fluentd’s extensible plugin framework Logging best practices and common pitfalls Logging in Action is a guide to optimize and organize logging using the CNCF Fluentd and Fluent Bit projects. You’ll use the powerful log management tool Fluentd to solve common log management, and learn how proper log management can improve performance and make management of software and infrastructure solutions easier. Through useful examples like sending log-driven events to Slack, you’ll get hands-on experience applying structure to your unstructured data. About the Technology Don’t fly blind! An effective logging system can help you see and correct problems before they cripple your software. With the Fluentd log management tool, it’s a snap to monitor the behavior and health of your software and infrastructure in real time. Designed to collect and process log data from multiple sources using the industry-standard JSON format, Fluentd delivers a truly unified logging layer across all your systems. About the Book Logging in Action teaches you to record and analyze application and infrastructure data using Fluentd. Using clear, relevant examples, it shows you exactly how to transform raw system data into a unified stream of actionable information. You’ll discover how logging configuration impacts the way your system functions and set up Fluentd to handle data from legacy IT environments, local data centers, and massive Kubernetes-driven distributed systems. You’ll even learn how to implement complex log parsing with RegEx and output events to MongoDB and Slack. What's Inside Capture log events from a wide range of systems and software, including Kubernetes and Docker Connect to custom log sources and destinations Employ Fluentd’s extensible plugin framework Create a custom plugin for niche problems About the Reader For developers, architects, and operations professionals familiar with the basics of monitoring and logging. About the Author Phil Wilkins has spent over 30 years in the software industry. Has worked for small startups through to international brands. Quotes I highly recommend using Logging in Action as a getting-started guide, a refresher, or as a way to optimize your logging journey. - From the Foreword by Anurag Gupta, Fluent maintainer and Cofounder, Calyptia Covers everything you need if you want to implement a logging system using open source technology such as Fluentd and Kubernetes. - Alex Saez, Naranja X A great exploration of the features and capabilities of Fluentd, along with very useful hands-on exercises. - George Thomas, Manhattan Associates A practical holistic guide to integrating logging into your enterprise architecture. - Satej Sahu, Honeywell

In this episode of SaaS Scaled, we’re talking to Brian Dreyer, VP of Product Management at SightCall. Brian is here to talk about his experience in SaaS product management, share what he’s learned over the years, and tell us how things have changed. Brian talks about how he would do product management today if he had to start a company from scratch, and why. We talk about how to successfully pivot and restart products and the challenges involved. Brian also mentions how SaaS has changed over the last couple of decades and the new challenges that have arisen. We also dive into how the relationship between product and marketing has changed over the years, and Brian talks about how cloud computing has evolved and where it’s headed. Finally, he shares some recommendations for further reading for anyone interested in SaaS product management.   This episode is brought to you by Qrvey The tools you need to take action with your data, on a platform built for maximum scalability, security, and cost efficiencies. If you’re ready to reduce complexity and dramatically lower costs, contact us today at qrvey.com. Qrvey, the modern no-code analytics solution for SaaS companies on AWS.

Data Engineering with Google Cloud Platform

In 'Data Engineering with Google Cloud Platform', you'll explore how to construct efficient, scalable data pipelines using GCP services. This hands-on guide covers everything from building data warehouses to deploying machine learning pipelines, helping you master GCP's ecosystem. What this Book will help me do Build comprehensive data ingestion and transformation pipelines using BigQuery, Cloud Storage, and Dataflow. Design end-to-end orchestration flows with Airflow and Cloud Composer for automated data processing. Leverage Pub/Sub for building real-time event-driven systems and streaming architectures. Gain skills to design and manage secure data systems with IAM and governance strategies. Prepare for and pass the Professional Data Engineer certification exam to elevate your career. Author(s) Adi Wijaya is a seasoned data engineer with significant experience in Google Cloud Platform products and services. His expertise in building data systems has equipped him with insights into the real-world challenges data engineers face. Adi aims to demystify technical topics and deliver practical knowledge through his writing, helping tech professionals excel. Who is it for? This book is tailored for data engineers and data analysts who want to leverage GCP for building efficient and scalable data systems. Readers should have a beginner-level understanding of topics like data science, Python, and Linux to fully benefit from the material. It is also suitable for individuals preparing for the Google Professional Data Engineer exam. The book is a practical companion for enhancing cloud and data engineering skills.

PostgreSQL 14 Administration Cookbook

PostgreSQL 14 Administration Cookbook provides a hands-on guide to mastering the administration of PostgreSQL 14. With over 175 recipes, this book equips you with practical techniques to manage, secure, and optimize your PostgreSQL databases, ensuring they are robust and high-performing. What this Book will help me do Master managing PostgreSQL databases both on-premises and in the cloud efficiently. Implement effective backup and recovery strategies to secure your data. Leverage the latest features of PostgreSQL 14 to enhance your database workflows. Understand and apply best practices for maintaining high availability and performance. Troubleshoot real-world challenges with guided solutions and expert insights. Author(s) Simon Riggs and Gianni Ciolli are seasoned database experts with years of experience working with PostgreSQL. Simon is a PostgreSQL core team member, contributing his technical knowledge towards building robust database solutions, while Gianni brings a wealth of expertise in database administration and support. Together, they share a passion for making complex database concepts accessible and actionable. Who is it for? This book is for database administrators, data architects, and developers who manage PostgreSQL databases and are looking to deepen their knowledge. It is suitable for professionals with some experience in PostgreSQL who aim to maximize their database's performance and security, as well as for those new to the system seeking a comprehensive start. Readers with an interest in practical, problem-solving approaches to database management will greatly benefit from this cookbook.

Data Science on the Google Cloud Platform, 2nd Edition

Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build using Google Cloud Platform (GCP). This hands-on guide shows data engineers and data scientists how to implement an end-to-end data pipeline with cloud native tools on GCP. Throughout this updated second edition, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by building a data pipeline in your own project on GCP, and discover how to solve data science problems in a transformative and more collaborative way. You'll learn how to: Employ best practices in building highly scalable data and ML pipelines on Google Cloud Automate and schedule data ingest using Cloud Run Create and populate a dashboard in Data Studio Build a real-time analytics pipeline using Pub/Sub, Dataflow, and BigQuery Conduct interactive data exploration with BigQuery Create a Bayesian model with Spark on Cloud Dataproc Forecast time series and do anomaly detection with BigQuery ML Aggregate within time windows with Dataflow Train explainable machine learning models with Vertex AI Operationalize ML with Vertex AI Pipelines

Summary Data governance is a practice that requires a high degree of flexibility and collaboration at the organizational and technical levels. The growing prominence of cloud and hybrid environments in data management adds additional stress to an already complex endeavor. Privacera is an enterprise grade solution for cloud and hybrid data governance built on top of the robust and battle tested Apache Ranger project. In this episode Balaji Ganesan shares how his experiences building and maintaining Ranger in previous roles helped him understand the needs of organizations and engineers as they define and evolve their data governance policies and practices.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Balaji Ganesan about his work at Privacera and his view on the state of data governance, access control, and security in the cloud

Interview

Introduction How did you get involved in the area of data management? Can you describe what Privacera is and the story behind it? What is your working definition of "data governance" and how does that influence your product focus and priorities? What are some of the lessons that you learned from your work on Apache Ranger that helped with your efforts at Privacera? How would you characterize your position in the market for data governance/data security tools? What are the unique constraints and challenges that come into play when managing data in cloud platforms? Can you explain how the Privacera platform is architected?

How have the design and goals of the system changed or evolved since you started working on it?

What is the workflow for an operator integrating Privacera into a data platform?

How do you provide feedback to users about the level of coverage for discovered data assets?

How does Privacera fit into the workflow of the different personas working with data?

What are some of the security and privacy controls that Privacera introduces?

How do you mitigate the potential for anyone to bypass Privacera’s controls by interacting directly with the underlying systems? What are the most interesting, innovative, or unexpected ways that you have seen Privacera used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Privacera? When is Privacera the wrong choice? What do you have planned for the future of Privacera?

Contact Info

LinkedIn @Balaji_Blog on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

Privacera Hadoop Hortonworks Apache Ranger Oracle Teradata Presto/Trino Starburst

Podcast Episode

Ahana

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By: Acryl: Acryl

The modern data stack needs a reimagined metadata management platform. Acryl Data’s vision is to bring clarity to your data through its next generation multi-cloud metadata management platform. Founded by the leaders that created projects like LinkedIn DataHub and Airbnb Dataportal, Acryl Data enables delightful search and discovery, data observability, and federated governance across data ecosystems. Signup for the SaaS product today at dataengineeringpodcast.com/acrylSupport Data Engineering Podcast

Leading Data Science Teams

Compared to other functions of an organization, data science is highly speculative. Data science teams are often tasked with last-minute must-have deliverables that are well beyond their ability to produce. Data might be missing or have no signal, or the data models themselves might be impractical. This hands-on reference guides team leaders through the types of challenges you might face and the tools you need to work through them. Author Jacqueline Nolis, head of data science at Saturn Cloud, helps team leaders think through the various issues you'll encounter when running a data science team. You'll learn ways to set up your team, manage data scientists to promote their success, and collaborate with external stakeholders. Once you finish this report, you'll be ready to work through the challenges your current team faces or start a new data science team in an organization that needs one. Determine the scope of work before choosing your team of data scientists and support positions Successfully manage your relationship with stakeholders by providing your team with clear, achievable goals Create an environment to help data scientists and other team members succeed Choose a technical infrastructure for your team, including programming languages, databases, and deployment models

Simplify Big Data Analytics with Amazon EMR

Simplify Big Data Analytics with Amazon EMR is a thorough guide to harnessing Amazon's EMR service for big data processing and analytics. From distributed computation pipelines to real-time streaming analytics, this book provides hands-on knowledge and actionable steps for implementing data solutions efficiently. What this Book will help me do Understand the architecture and key components of Amazon EMR and how to deploy it effectively. Learn to configure and manage distributed data processing pipelines using Amazon EMR. Implement security and data governance best practices within the Amazon EMR ecosystem. Master batch ETL and real-time analytics techniques using technologies like Apache Spark. Apply optimization and cost-saving strategies to scalable data solutions. Author(s) Sakti Mishra is a seasoned data professional with extensive expertise in deploying scalable analytics solutions on cloud platforms like AWS. With a background in big data technologies and a passion for teaching, Sakti ensures practical insights accompany every concept. Readers will find his approach thorough, hands-on, and highly informative. Who is it for? This book is perfect for data engineers, data scientists, and other professionals looking to leverage Amazon EMR for scalable analytics. If you are familiar with Python, Scala, or Java and have some exposure to Hadoop or AWS ecosystems, this book will empower you to design and implement robust data pipelines efficiently.

Summary Data assets and the pipelines that create them have become critical production infrastructure for companies. This adds a requirement for reliability and management of up-time similar to application infrastructure. In this episode Francisco Alberini and Mei Tao share their insights on what incident management looks like for data platforms and the teams that support them.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! Your host is Tobias Macey and today I’m interviewing Francisco Alberini and Mei Tao about patterns and practices for incident management in data teams

Interview

Introduction How did you get involved in the area of data management? Can you start by describing some of the ways that an "incident" can manifest in a data system?

At a high level, what are the steps and participants required to bring an incident to resolution?

The principle of incident management is familiar to application/site reliability teams. What is the current state of the art/adoption for these practices among data teams? What are the signals that teams should be monitoring to identify and alert on potential incidents?

Alerting is a subjective and nuanced practice, regardless of the context. What are some useful practices that you have seen and enacted to reduce alert fatigue

Reproducible Data Science with Pachyderm

Dive into the world of reproducible data science with Pachyderm, a specialized platform designed for version-controlled data pipelines. By following this book, 'Reproducible Data Science with Pachyderm,' you'll gain the skills to implement robust, scalable machine learning workflows with Pachyderm 2.0, covering setup, integration, and advanced use cases. What this Book will help me do Build scalable, version-controlled data pipelines with Pachyderm's unique features. Understand the principles behind reproducible data science and implement them effectively. Deploy Pachyderm on AWS, Google Cloud, and Azure while integrating with popular tools. Create and manage end-to-end machine learning workflows, including hyperparameter tuning. Leverage advanced integrations, such as Pachyderm Notebooks and language clients like Python and Go. Author(s) Svetlana Karslioglu is a seasoned data scientist with extensive experience in constructing scalable machine learning and data processing systems. With years in both practical implementation and educational endeavors, she has a talent for breaking down complex concepts into accessible learning paths. Her approach is hands-on and results-oriented, aimed at empowering professionals to excel in the field of data science. Who is it for? This book is intended for data scientists, machine learning engineers, and data engineers who are keen to ensure reproducibility in their workflows. Ideal readers may have familiarity with data science basics and some exposure to Kubernetes and programming languages like Python. By studying the book, learners will establish confidence in implementing Pachyderm for scalable and reliable data pipelines.

Data Lakehouse in Action

"Data Lakehouse in Action" provides a comprehensive exploration of the Data Lakehouse architecture, a modern solution for scalable and effective large-scale analytics. This book guides you through understanding the principles and components of the architecture, and its implementation using cloud platforms like Azure. Learn the practical techniques for designing robust systems tailored to organizational needs and maturity. What this Book will help me do Understand the evolution and need for modern data architecture patterns like Data Lakehouse. Learn how to design systems for data ingestion, storage, processing, and serving in a Data Lakehouse. Develop best practices for data governance and security in the Data Lakehouse architecture. Discover various analytics workflows enabled by the Data Lakehouse, including real-time and batch approaches. Implement practical Data Lakehouse patterns on a cloud platform, and integrate them with macro-patterns such as Data Mesh. Author(s) Pradeep Menon is a seasoned data architect and engineer with extensive experience implementing data analytics solutions for leading companies. With a penchant for simplifying complex architectures, Pradeep has authored several technical publications and frequently shares his expertise at industry conferences. His hands-on approach and passion for teaching shine through in his practical guides. Who is it for? This book is ideal for data professionals including architects, engineers, and data strategists eager to enhance their knowledge in modern analytics platforms. If you have a basic understanding of data architecture and are curious about implementing systems governed by the Data Lakehouse paradigm, this book is for you. It bridges foundational concepts with advanced practices, making it suitable for learners aiming to contribute effectively to their organization's analytics efforts.

IBM Spectrum Virtualize, IBM FlashSystem, and IBM SAN Volume Controller Security Feature Checklist

IBM Spectrum® Virtualize based storage systems are secure storage platforms that implement various security-related features, in terms of system-level access controls and data-level security features. This document outlines the available security features and options of IBM Spectrum Virtualize based storage systems. It is not intended as a "how to" or best practice document. Instead, it is a checklist of features that can be reviewed by a user security team to aid in the definition of a policy to be followed when implementing IBM FlashSystem®, IBM SAN Volume Controller, and IBM Spectrum Virtualize for Public Cloud. The topics that are discussed in this paper can be broadly split into two categories: System security This type of security encompasses the first three lines of defense that prevent unauthorized access to the system, protect the logical configuration of the storage system, and restrict what actions users can perform. It also ensures visibility and reporting of system level events that can be used by a Security Information and Event Management (SIEM) solution, such as IBM QRadar®. Data security This type of security encompasses the fourth line of defense. It protects the data that is stored on the system against theft, loss, or attack. These data security features include, but are not limited to, encryption of data at rest (EDAR) or IBM Safeguarded Copy (SGC). This document is correct as of IBM Spectrum Virtualize version 8.5.0.

Data Analysis with Python and PySpark

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines. In Data Analysis with Python and PySpark you will learn how to: Manage your data as it scales across multiple machines Scale up your data programs with full confidence Read and write data to and from a variety of sources and formats Deal with messy data with PySpark’s data manipulation functionality Discover new data sets and perform exploratory data analysis Build automated data pipelines that transform, summarize, and get insights from data Troubleshoot common PySpark errors Creating reliable long-running jobs Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required. About the Technology The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem. About the Book Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code. What's Inside Organizing your PySpark code Managing your data, no matter the size Scale up your data programs with full confidence Troubleshooting common data pipeline problems Creating reliable long-running jobs About the Reader Written for data scientists and data engineers comfortable with Python. About the Author As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts. Quotes A clear and in-depth introduction for truly tackling big data with Python. - Gustavo Patino, Oakland University William Beaumont School of Medicine The perfect way to learn how to analyze and master huge datasets. - Gary Bake, Brambles Covers both basic and more advanced topics of PySpark, with a good balance between theory and hands-on. - Philippe Van Bergenl, P² Consulting For beginner to pro, a well-written book to help understand PySpark. - Raushan Kumar Jha, Microsoft

Summary The modern data stack is a constantly moving target which makes it difficult to adopt without prior experience. In order to accelerate the time to deliver useful insights at organizations of all sizes that are looking to take advantage of these new and evolving architectures Tarush Aggarwal founded 5X Data. In this episode he explains how he works with these companies to deploy the technology stack and pairs them with an experienced engineer who assists with the implementation and training to let them realize the benefits of this architecture. He also shares his thoughts on the current state of the ecosystem for modern data vendors and trends to watch as we move into the future.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Tarush Agarwal about how he and his team are helping organizations streamline adoption of the modern data stack

Interview

Introduction How did you get involved in the area of data management? Can you describe what you are doing at 5x and the story behind it? How has your focus and operating model shifted since we spoke a year ago?

What are the biggest shifts in the market for data management that you have seen in that time?

What are the main challenges that your customers are facing when they start working with you? What are the components that you are relying on to build repeatable data platforms for your customers?

What are the sharp edges that you have had to smooth out to scale your implementation of those

Summary Data observability is a term that has been co-opted by numerous vendors with varying ideas of what it should mean. At Acceldata, they view it as a holistic approach to understanding the computational and logical elements that power your analytical capabilities. In this episode Tristan Spaulding, head of product at Acceldata, explains the multi-dimensional nature of gaining visibility into your running data platform and how they have architected their platform to assist in that endeavor.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale Your host is Tobias Macey and today I’m interviewing Tristan Spaulding about Acceldata, a platform offering multidimensional data observability for modern data infrastructure

Interview

Introduction How did you get involved in the area of data? Can you describe what Acceldata is and the story behind it? What does it mean for a data observability platform to be "multidimensional"? How do the architectural characteristics of the "modern data stack" influence the requirements and implementation of data observability strategies? The data observability ecosystem has seen a lot of activity over the past ~2-3 years. What are the unique capabilities/use cases that Acceldata supports? Who are your target users and how does that focus influence the way that you have approached feature and design priorities? What are some of the ways that you are using the Acceldata platform to run Acceldata? Can you describe how the Acceldata platform is implemented?

How have the design and goals of the system changed or evolved since you started working on it?

How are you man