talk-data.com talk-data.com

Topic

Cloud Computing

infrastructure saas iaas

4055

tagged

Activity Trend

471 peak/qtr
2020-Q1 2026-Q1

Activities

4055 activities · Newest first

Summary Businesses often need to be able to ingest data from their customers in order to power the services that they provide. For each new source that they need to integrate with it is another custom set of ETL tasks that they need to maintain. In order to reduce the friction involved in supporting new data transformations David Molot and Hassan Syyid built the Hotlue platform. In this episode they describe the data integration challenges facing many B2B companies, how their work on the Hotglue platform simplifies their efforts, and how they have designed the platform to make these ETL workloads embeddable and self service for end users.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. This episode of Data Engineering Podcast is sponsored by Datadog, a unified monitoring and analytics platform built for developers, IT operations teams, and businesses in the cloud age. Datadog provides customizable dashboards, log management, and machine-learning-based alerts in one fully-integrated platform so you can seamlessly navigate, pinpoint, and resolve performance issues in context. Monitor all your databases, cloud services, containers, and serverless functions in one place with Datadog’s 400+ vendor-backed integrations. If an outage occurs, Datadog provides seamless navigation between your logs, infrastructure metrics, and application traces in just a few clicks to minimize downtime. Try it yourself today by starting a free 14-day trial and receive a Datadog t-shirt after installing the agent. Go to dataengineeringpodcast.com/datadog today to see how you can enhance visibility into your stack with Datadog. Your host is Tobias Macey and today I’m interviewing David Molot and Hassan Syyid about Hotglue, an embeddable data integration tool for B2B developers built on the Python ecosystem.

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Hotglue?

What was your motivation for starting a business to address this particular problem?

Who is the target user of Hotglue and what are their biggest data problems?

What are the types and sources of data that they are likely to be working with? How are they currently handling solutions for those problems? How does the introduction of Hotglue simplify or improve their work?

What is involved in getting Hotglue integrated into a given customer’s environment? How is Hotglue itself implemented?

How has the design or goals of the platform evolved since you first began building it? What were some of the initial assumptions that you had at the outset and how well have they held up as you progressed?

Once a customer has set up Hotglue what is their workflow for building and executing an ETL workflow?

What are their options for working with sources that aren’t supported out of the box?

What are the biggest design and implementation challenges that you are facing given the need for your product to be embedded in customer platforms and exposed to their end users? What are some of the most interesting, innovative, or unexpected ways that you have seen Hotglue used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Hotglue? When is Hotglue the wrong choice? What do you have planned for the future of the product?

Contact Info

David

@davidmolot on Twitter LinkedIn

Hassan

hsyyid on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Hotglue Python

The Python Podcast.init

B2B == Business to Business Meltano

Podcast Episode

Airbyte Singer

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

IBM Power Systems H922 and H924 Technical Overview and Introduction

This IBM® Redpaper™ publication is a comprehensive guide that covers the IBM Power System H922 (9223-22S), and IBM Power System H924 (9223-42S) servers that support memory-intensive workloads, such as SAP HANA, and deliver superior price and performance for mission-critical applications in IBM AIX®, IBM i, and Linux® operating systems. The goal of this paper is to provide a hardware architecture analysis and highlight the changes, new technologies, and major features that are being introduced in these systems' 2020 release, such as the following examples: Availability of new IBM POWER9™ processor configurations for the number of cores per socket. More performance by using industry-leading IBM Peripheral Component Interconnect® Express (PCIe) Gen4 slots. Enhanced internal disk configuration options, with up to 14 NVMe adapters (four U.2 NVMe plus up to 10 PCIe add-in cards). Twice as fast back-end I/O enables seamless maximum speed and throughput between on-premises and multiple public cloud infrastructures with high availability (HA). This publication is for professionals who want to acquire a better understanding of IBM Power Systems products. The intended audience includes the following roles: Clients Sales and marketing professionals Technical support professionals IBM Business Partners Independent software vendors (ISVs) This paper expands the current set of IBM Power Systems documentation by providing a desktop reference that offers a detailed technical description of the Power H922 and Power H924 systems.

Summary The data warehouse has become the central component of the modern data stack. Building on this pattern, the team at Hightouch have created a platform that synchronizes information about your customers out to third party systems for use by marketing and sales teams. In this episode Tejas Manohar explains the benefits of sourcing customer data from one location for all of your organization to use, the technical challenges of synchronizing the data to external systems with varying APIs, and the workflow for enabling self-service access to your customer data by your marketing teams. This is an interesting conversation about the importance of the data warehouse and how it can be used beyond just internal analytics.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. This episode of Data Engineering Podcast is sponsored by Datadog, a unified monitoring and analytics platform built for developers, IT operations teams, and businesses in the cloud age. Datadog provides customizable dashboards, log management, and machine-learning-based alerts in one fully-integrated platform so you can seamlessly navigate, pinpoint, and resolve performance issues in context. Monitor all your databases, cloud services, containers, and serverless functions in one place with Datadog’s 400+ vendor-backed integrations. If an outage occurs, Datadog provides seamless navigation between your logs, infrastructure metrics, and application traces in just a few clicks to minimize downtime. Try it yourself today by starting a free 14-day trial and receive a Datadog t-shirt after installing the agent. Go to dataengineeringpodcast.com/datadog today to see how you can enhance visibility into your stack with Datadog. Your host is Tobias Macey and today I’m interviewing Tejas Manohar about Hightouch, a data platform that helps you sync your customer data from your data warehouse to your CRM, marketing, and support tools

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Hightouch and your motivation for creating it? What are the main points of friction for teams who are trying to make use of customer data? Where is Hightouch positioned in the ecosystem of customer data tools such as Segment, Mixpanel

Modernizing Applications with IBM CICS

IBM® CICS® is a mixed language application server that runs on IBM Z®. Over the 50 years since CICS was introduced in 1969, enterprises have used the qualities of service (QoSs) that CICS provides to allow them to create high throughput and secure transactional applications that have powered their business. As the IT landscape has evolved, so has CICS to allow these applications to integrate with new platforms and still provide value to the rest of the business. Because of this capability, many businesses still rely on CICS to power their core applications. This IBM Redpaper publication focuses on modernizing these CICS applications, allowing them to integrate with cloud-native applications. This modernization can be achieved either by constructing application programming interfaces (APIs) that allow new cloud-native applications to connect to your existing assets, rewriting parts of your application in newer languages and hosting them back on CICS, or by using CICS capabilities to extend your applications to provide new capabilities and functions. The paper takes a traditional example application and shows you how it works. Then, the paper extends the example, rewrites portions of its functions, and enables its APIs. It also explains how CICS applications can use continuous integration (CI) and continuous delivery (CD) to deliver, test, and deploy code into CICS easily and with quality.

Beginning T-SQL: A Step-by-Step Approach

Get a performance-oriented introduction to the T-SQL language underlying the Microsoft SQL Server and Azure SQL database engines. This fourth edition is updated to include SQL Notebooks as well as up-to-date syntax and features for T-SQL on-premises and in the Azure cloud. Exercises and examples now include the WideWorldImporters database, the newest sample database from Microsoft for SQL Server. Also new in this edition is coverage of JSON from T-SQL, news about performance enhancements called Intelligent Query Processing, and an appendix on running SQL Server in a container on macOS or Linux. Beginning T-SQL starts you on the path to mastering T-SQL with an emphasis on best practices. Using the sound coding techniques taught in this book will lead to excellent performance in the queries that you write in your daily work. Important techniques such as windowing functions are covered to help you write fast-executing queries that solve real business problems.The book begins with an introduction to databases, normalization, and to setting up your learning environment. You will learn about the tools you need to use such as SQL Server Management Studio, Azure Data Studio, and SQL Notebooks. Each subsequent chapter teaches an aspect of T-SQL, building on the skills learned in previous chapters. Exercises in most chapters provide an opportunity for the hands-on practice that leads to true learning and distinguishes the competent professional. A stand-out feature in this book is that most chapters end with a Thinking About Performance section. These sections cover aspects of query performance relative to the content just presented, including the new Intelligent Query Processing features that make queries faster without changing code. They will help you avoid beginner mistakes by knowing about and thinking about performance from day 1. What You Will Learn Install a sandboxed SQL Server instance for learning Understand how relational databases are designed Create objects such as tables and stored procedures Query a SQL Server table Filter and order the results of a query Query and work with specialized data types such as XML and JSON Apply modern features such as window functions Choose correct techniques so that your queries perform well Who This Book Is For Anyone who wants to learn T-SQL from the beginning or improve their T-SQL skills; those who need T-SQL as an additional skill; and those who write queries such as application developers, database administrators, business intelligence developers, and data scientists. The book is also helpful for anyone who must retrieve data from a SQL Server database.

MongoDB Fundamentals

This book, "MongoDB Fundamentals", is the ideal hands-on guide to learning MongoDB. By starting from the basics of NoSQL databases and progressing to cloud integration using MongoDB Atlas, you will gain practical experience managing, querying, and visualizing data effectively for real-world applications. What this Book will help me do Set up and manage a MongoDB database with both local and cloud environments. Master querying and modifying data using the aggregation framework for complex operations. Implement effective database architecture with replication and sharding techniques. Ensure data security and resilience through user management and efficient backup/restore methods. Visualize data insights through dynamic reports and charts using MongoDB Charts. Author(s) Amit Phaltankar, Juned Ahsan, Michael Harrison, and Liviu Nedov are seasoned professionals in the field of database management systems, each bringing extensive experience working with MongoDB and cloud technologies. They excel at translating technical concepts into accessible, actionable insights, and have a passion for enabling IT professionals to create high-performance database solutions. Who is it for? "MongoDB Fundamentals" is tailored for developers, database administrators, system administrators, and cloud architects who are new to MongoDB but are looking to integrate it into their data processing workflows. It's perfect for those who aim to enhance their skills in handling data within cloud computing environments and have some basic programming or database experience.

IBM Storage for Red Hat OpenShift Blueprint

This IBM® Blueprint is intended to facilitate the deployment of IBM Storage for Red Hat OpenShift Container Platform by using detailed hardware specifications to build a system. It describes the associated parameters for configuring persistent storage within a Red Hat OpenShift Container Platform environment. To complete the tasks, you must understand Red Hat OpenShift, IBM Storage, the IBM block storage Container Storage Interface (CSI) driver, and the IBM Spectrum Scale CSI driver. The information in this document is distributed on an "as is" basis without any warranty that is either expressed or implied. Support assistance for the use of this material is limited to situations where IBM Storwize® or IBM FlashSystem® storage devices, Enterprise Storage Server®, and IBM Spectrum® Scale are supported and entitled, and where the issues are not specific to a blueprint implementation. IBM Storage Suite for IBM Cloud® Paks is an offering bundle that includes software-defined storage from IBM and Red Hat. Use this document for more information about how to deploy IBM Storage product licenses that are obtained through Storage Suite for Cloud Paks (IBM Spectrum Virtualize and IBM Spectrum Scale).

Kimball in the context of the modern data warehouse: what's worth keeping, and what's not

Dimensional modeling described in the Kimball Toolbook was in its 3rd edition 15 years ago yet is still the latest in data modeling advice. So much is different in cloud warehouses that many of those best practices are now bad practices. In this video Dave Fowler, the founder of Chartio and author of Cloud Data Management goes over what no longer applies, and what does.

Summary As a data engineer you’re familiar with the process of collecting data from databases, customer data platforms, APIs, etc. At YipitData they rely on a variety of alternative data sources to inform investment decisions by hedge funds and businesses. In this episode Andrew Gross, Bobby Muldoon, and Anup Segu describe the self service data platform that they have built to allow data analysts to own the end-to-end delivery of data projects and how that has allowed them to scale their output. They share the journey that they went through to build a scalable and maintainable system for web scraping, how to make it reliable and resilient to errors, and the lessons that they learned in the process. This was a great conversation about real world experiences in building a successful data-oriented business.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. Your host is Tobias Macey and today I’m interviewing Andrew Gross, Bobby Muldoon, and Anup Segu about they are building pipelines at Yipit Data

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what YipitData does? What kinds of data sources and data assets are you working with? What is the composition of your data teams and how are they structured? Given the use of your data products in the financial sector how do you handle monitoring and alerting around data qualit

Beyond the Modern Data Stack: dbt Cloud Metadata in Mode

In this video, President and Founder of Mode, Benn Stancil discusses new ways to align the optimal application boundaries in the modern data stack, providing a set of guidelines for determining how and where to draw the lines between your many tools. He also motivates an example of these boundaries by demonstrating how metadata surfaced in an analytics tool like Mode can increase overall data confidence.

The COVID shock forces enterprises in every market to accelerate and reshape their data analytics strategies. This trend is likely to continue. “Data Elite” enterprises survived this year through a mix of agility, efficiency, and intelligence. They met these requirements of survival as they accelerated their digital transformations, adopted cloud data platforms and embraced advanced analytics. As these data leaders continue their momentum in 2021, the data laggards will strive to catch up.

In this episode, Kevin Petrie, VP of Research at Eckerson Group, interviews Sumeet Agrawal, VP of Product Management at Informatica, to discuss the impact of COVID on enterprises. Sumeet talks about the trends of adoption during the onslaught of COVID and how enterprises are navigating in the post-pandemic era.

Pro SQL Server Relational Database Design and Implementation: Best Practices for Scalability and Performance

Learn effective and scalable database design techniques in SQL Server 2019 and other recent SQL Server versions. This book is revised to cover additions to SQL Server that include SQL graph enhancements, in-memory online transaction processing, temporal data storage, row-level security, and other design-related features. This book will help you design OLTP databases that are high-quality, protect the integrity of your data, and perform fast on-premises, in the cloud, or in hybrid configurations. Designing an effective and scalable database using SQL Server is a task requiring skills that have been around for well over 30 years, using technology that is constantly changing. This book covers everything from design logic that business users will understand to the physical implementation of design in a SQL Server database. Grounded in best practices and a solid understanding of the underlying theory, author Louis Davidson shows you how to "getit right" in SQL Server database design and lay a solid groundwork for the future use of valuable business data. What You Will Learn Develop conceptual models of client data using interviews and client documentation Implement designs that work on premises, in the cloud, or in a hybrid approach Recognize and apply common database design patterns Normalize data models to enhance integrity and scalability of your databases for the long-term use of valuable data Translate conceptual models into high-performing SQL Server databases Secure and protect data integrity as part of meeting regulatory requirements Create effective indexing to speed query performance Understand the concepts of concurrency Who This Book Is For Programmers and database administrators of all types who want to use SQL Server to store transactional data. The book is especially useful to those wanting to learn the latest database design features in SQL Server 2019 (features that include graph objects, in-memory OLTP, temporal data support, and more). Chapters on fundamental concepts, the language of database modeling, SQL implementation, and the normalization process lay a solid groundwork for readers who are just entering the field of database design. More advanced chapters serve the seasoned veteran by tackling the latest in physical implementation features that SQL Server has to offer. The book has been carefully revised to cover all the design-related features that are new in SQL Server 2019.

Exam Ref PL-900 Microsoft Power Platform Fundamentals

Prepare for Microsoft Exam PL-900: Demonstrate your real-world knowledge of the fundamentals of Microsoft Power Platform, including its business value, core components, and the capabilities and advantages of Power BI, Power Apps, Power Automate, and Power Virtual Agents. Designed for business users, functional consultants, and other professionals, this Exam Ref focuses on the critical thinking and decision-making acumen needed for success at the Microsoft Certified: Power Platform Fundamentals level. Focus on the expertise measured by these objectives: Describe the business value of Power Platform Identify the Core Components of Power Platform Demonstrate the capabilities of Power BI Demonstrate the capabilities of Power Apps Demonstrate the capabilities of Power Automate Demonstrate the capabilities of Power Virtual Agents This Microsoft Exam Ref: Organizes its coverage by exam objectives Features strategic, what-if scenarios to challenge you Assumes you are a business user, functional consultant, or other professional who wants to improve productivity by automating business processes, analyzing data, creating simple app experiences, or developing business enhancements to Microsoft cloud solutions. About the Exam Exam PL-900 focuses on knowledge needed to describe the value of Power Platform services and of extending solutions; describe Power Platform administration and security; describe Common Data Service, Connectors, and AI Builder; identify common Power BI components; connect to and consume data; build basic dashboards with Power BI; identify common Power Apps components; build basic canvas and model-driven apps; describe Power Apps portals; identify common Power Automate components; build basic flows; describe Power Virtual Agents capabilities; and build and publish basic chatbots. About Microsoft Certification Passing this exam fulfills your requirements for the Microsoft Certified: Power Platform Fundamentals certification, demonstrating your understanding of Power Platforms core capabilitiesfrom business value and core product capabilities to building simple apps, connecting data sources, automating basic business processes, creating dashboards, and creating chatbots. With this certification, you can move on to earn specialist certifications covering more advanced aspects of Power Apps and Power BI, including Microsoft Certified: Power Platform App Maker Associate and Power Platform Data Analyst Associate. See full details at: microsoft.com/learn

Exam Ref PL-900 Microsoft Power Platform Fundamentals

Prepare for Microsoft Exam PL-900: Demonstrate your real-world knowledge of the fundamentals of Microsoft Power Platform, including its business value, core components, and the capabilities and advantages of Power BI, Power Apps, Power Automate, and Power Virtual Agents. Designed for business users, functional consultants, and other professionals, this Exam Ref focuses on the critical thinking and decision-making acumen needed for success at the Microsoft Certified: Power Platform Fundamentals level. Focus on the expertise measured by these objectives: Describe the business value of Power Platform Identify the Core Components of Power Platform Demonstrate the capabilities of Power BI Demonstrate the capabilities of Power Apps Demonstrate the capabilities of Power Automate Demonstrate the capabilities of Power Virtual Agents This Microsoft Exam Ref: Organizes its coverage by exam objectives Features strategic, what-if scenarios to challenge you Assumes you are a business user, functional consultant, or other professional who wants to improve productivity by automating business processes, analyzing data, creating simple app experiences, or developing business enhancements to Microsoft cloud solutions. About the Exam Exam PL-900 focuses on knowledge needed to describe the value of Power Platform services and of extending solutions; describe Power Platform administration and security; describe Common Data Service, Connectors, and AI Builder; identify common Power BI components; connect to and consume data; build basic dashboards with Power BI; identify common Power Apps components; build basic canvas and model-driven apps; describe Power Apps portals; identify common Power Automate components; build basic flows; describe Power Virtual Agents capabilities; and build and publish basic chatbots. About Microsoft Certification Passing this exam fulfills your requirements for the Microsoft Certified: Power Platform Fundamentals certification, demonstrating your understanding of Power Platforms core capabilitiesfrom business value and core product capabilities to building simple apps, connecting data sources, automating basic business processes, creating dashboards, and creating chatbots. With this certification, you can move on to earn specialist certifications covering more advanced aspects of Power Apps and Power BI, including Microsoft Certified: Power Platform App Maker Associate and Power Platform Data Analyst Associate. See full details at: microsoft.com/learn

Summary Building data products are complicated by the fact that there are so many different stakeholders with competing goals and priorities. It is also challenging because of the number of roles and capabilities that are necessary to go from idea to delivery. Different organizations have tried a multitude of organizational strategies to improve the success rate of these data teams with varying levels of success. In this episode Jesse Anderson shares the lessons that he has learned while working with dozens of businesses across industries to determine the team structures and communication styles that have generated the best results. If you are struggling to deliver value from big data, or just starting down the path of building the organizational capacity to turn raw information into valuable products then this is a conversation that you don’t want to miss.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. Your host is Tobias Macey and today I’m interviewing Jesse Anderson about best practices for organizing and managing data teams

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of how you view the mission and responsibilities of a data team?

What are the critical elements of a successful data team? Beyond the core pillars of data science, data engineering, and operations, what other specialized roles do you find hel

IBM FlashSystem 9100 Architecture, Performance, and Implementation

IBM® FlashSystem 9100 combines the performance of flash and Non-Volatile Memory Express (NVMe) with the reliability and innovation of IBM FlashCore® technology and the rich features of IBM Spectrum™ Virtualize — all in a powerful 2U storage system. Providing intensive data driven multi-cloud storage capacity, FlashSystem 9100 is deeply integrated with the software-defined capabilities of IBM Spectrum Storage™, which allows you to easily add the multi-cloud solutions that best support your business. In this IBM Redbooks® publication, we discuss the product's features and planning steps, architecture, installation, configuration, and hints and tips.

Summary The first stage of every good pipeline is to perform data integration. With the increasing pace of change and the need for up to date analytics the need to integrate that data in near real time is growing. With the improvements and increased variety of options for streaming data engines and improved tools for change data capture it is possible for data teams to make that goal a reality. However, despite all of the tools and managed distributions of those streaming engines it is still a challenge to build a robust and reliable pipeline for streaming data integration, especially if you need to expose those capabilities to non-engineers. In this episode Ido Friedman, CTO of Equalum, explains how they have built a no-code platform to make integration of streaming data and change data capture feeds easier to manage. He discusses the challenges that are inherent in the current state of CDC technologies, how they have architected their system to integrate well with existing data platforms, and how to build an appropriate level of abstraction for such a complex problem domain. If you are struggling with streaming data integration and change data capture then this interview is definitely worth a listen.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unloc

What Is a Data Lake?

A revolution is occurring in data management regarding how data is collected, stored, processed, governed, managed, and provided to decision makers. The data lake is a popular approach that harnesses the power of big data and marries it with the agility of self-service. With this report, IT executives and data architects will focus on the technical aspects of building a data lake for your organization. Alex Gorelik from Facebook explains the requirements for building a successful data lake that business users can easily access whenever they have a need. You'll learn the phases of data lake maturity, common mistakes that lead to data swamps, and the importance of aligning data with your company's business strategy and gaining executive sponsorship. You'll explore: The ingredients of modern data lakes, such as the use of different ingestion methods for different data formats, and the importance of the three Vs: volume, variety, and velocity Building blocks of successful data lakes, including data ingestion, integration, persistence, data governance, and business intelligence and self-service analytics State-of-the-art data lake architectures offered by Amazon Web Services, Microsoft Azure, and Google Cloud

Summary One of the oldest aphorisms about data is "garbage in, garbage out", which is why the current boom in data quality solutions is no surprise. With the growth in projects, platforms, and services that aim to help you establish and maintain control of the health and reliability of your data pipelines it can be overwhelming to stay up to date with how they all compare. In this episode Egor Gryaznov, CTO of Bigeye, joins the show to explore the landscape of data quality companies, the general strategies that they are using, and what problems they solve. He also shares how his own product is designed and the challenges that are involved in building a system to help data engineers manage the complexity of a data platform. If you are wondering how to get better control of your own pipelines and the traps to avoid then this episode is definitely worth a listen.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Your host is Tobias Macey and today I’m interviewing Egor Gryaznov about the state of the industry for data quality management and what he is building at B

At How Music Charts, we try to showcase those pushing the edge of music and data, and today we talk with Joel T. Jordan, Founder and President at Synchtank. Headquartered in London with offices in New York and Los Angeles, Synchtank offers a range of cloud-based SaaS solutions for managing digital entertainment assets, intellectual property, metadata and royalties. Getting his start in the music industry at ripe age of 13, Jordan started one of the hardcore punk scene’s seminal labels, Watermark Records, with his brother Jason in 1991. It was there in New Jersey where the visually-oriented Jordan began his career as an Art Director, which eventually led him to co-found the creative design film Earthprogram in New York City, where he served as Lead Designer and Creative Director from 1996 to 2008. It was then when Jordan founded Synchtank, where, as a 2018 Pop Disciple interview describes, Synchtank “serves over 150 high profile clients including Disney Music, 20th Century Fox, Reservoir Media, Spirit Music Group, Concord Music, BT Sport, Red Bull Media House, Primary Wave, and peermusic.” Check out music licensing software platform Synchtank here.Check out Earth Program's homepage here. Check out Earth Program's QnAnR podcast here.Connect With Us (@chartmetric)http://chartmetric.com/https://blog.chartmetric.comhttps://smarturl.it/chartmetric_social