talk-data.com talk-data.com

Topic

Cloud Computing

infrastructure saas iaas

4055

tagged

Activity Trend

471 peak/qtr
2020-Q1 2026-Q1

Activities

4055 activities · Newest first

Summary The reason for collecting, cleaning, and organizing data is to make it usable by the organization. One of the most common and widely used methods of access is through a business intelligence dashboard. Superset is an open source option that has been gaining popularity due to its flexibility and extensible feature set. In this episode Maxime Beauchemin discusses how data engineers can use Superset to provide self service access to data and deliver analytics. He digs into how it integrates with your data stack, how you can extend it to fit your use case, and why open source systems are a good choice for your business intelligence. If you haven’t already tried out Superset then this conversation is well worth your time. Give it a listen and then take it for a test drive today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Max Beauchemin about Superset, an open source platform for data exploration, dashboards, and business intelligence

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Superset is? Superset is becoming part of the reference architecture for a modern data stack. What are the factors that have contributed to its popularity over other tools such as Redash, Metabase, Looker, etc.? Where do dashboarding and exploration tools like Superset fit in the responsibilities and workflow of a data engineer? What are some of the challenges that Superset faces in being performant when working with large data sources?

Which data sources have you found to be the most challenging to work with?

What are some anti-patterns that users of Superset mig

We covered:

Barr’s background Market gaps in data reliability Observability in engineering Data downtime Data quality problems and the five pillars of data observability Example: job failing because of a schema change Three pillars of observability (good pipelines and bad data) Observability vs monitoring Finding the root cause Who is accountable for data quality? (the RACI framework) Service level agreements Inferring the SLAs from the historical data Implementing data observability Data downtime maturity curve Monte carlo: data observability solution Open source tools Test-driven development for data Is data observability cloud agnostic? Centralizing data observability Detecting downstream and upstream data usage Getting bad data vs getting unusual data

Links:

Learn more about Monte Carlo: https://www.montecarlodata.com/ The Data Engineer's Guide to Root Cause Analysis: https://www.montecarlodata.com/the-data-engineers-guide-to-root-cause-analysis/ Why You Need to Set SLAs for Your Data Pipelines: https://www.montecarlodata.com/how-to-make-your-data-pipelines-more-reliable-with-slas/ Data Observability: The Next Frontier of Data Engineering: https://www.montecarlodata.com/data-observability-the-next-frontier-of-data-engineering/ To get in touch with Barr, ping her in the DataTalks.Club group or use [email protected]

Join DataTalks.Club: https://datatalks.club/slack.html

Adding AI Cloud Services to Your On-Prem Data Workflows for NLP & Content Enrichment -Daniel Wrigley

Big Data Europe Onsite and online on 22-25 November in 2022 Learn more about the conference: https://bit.ly/3BlUk9q

Join our next Big Data Europe conference on 22-25 November in 2022 where you will be able to learn from global experts giving technical talks and hand-on workshops in the fields of Big Data, High Load, Data Science, Machine Learning and AI. This time, the conference will be held in a hybrid setting allowing you to attend workshops and listen to expert talks on-site or online.

Summary Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Tal Galfsky about how Cherre is bringing order to the messy problem of physical addresses and entity resolution in their data pipelines.

Interview

Introduction How did you get involved in the area of data management? Started as physicist and evolved into Data Science Can you start by giving a brief recap of what Cherre is and the types of data that you deal with? Cherre is a company that connects data We’re not a data vendor, in that we don’t sell data, primarily We help companies connect and make sense of their data The real estate market is historically closed, gut let, behind on tech What are the biggest challenges that you deal with in your role when working with real estate data? Lack of a standard domain model in real estate. Ontology. What is a property? Each data source, thinks about properties in a very different way. Therefore, yielding similar, but completely different data. QUALITY (Even if the dataset are talking about the same thing, there are different levels of accuracy, freshness). HIREARCHY. When is one source better than another What are the teams and systems that rely on address information? Any company that needs to clean or organize (make sense) their data, need to identify, people, companies, and properties. Our clients use Address resolution in multiple ways. Via the UI or via an API. Our service is both external and internal so what I build has to be good enough for the demanding needs of our data science team, robust enough for our engineers, and simple enough that non-expert clients can use it. Can you give an example for the problems involved in entity resolution Known entity example. Empire state buidling. To resolve addresses in a way that makes sense for the client you need to capture the real world entities. Lots, buildings, units.

Identify the type of the object (lot, building, unit) Tag the object with all the relevant addresses Relations to other objects (lot, building, unit)

What are some examples of the kinds of edge cases or messiness that you encounter in addresses? First class is string problems. Second class component problems. third class is geocoding. I understand that you have developed a service for normalizing addresses and performing entity resolution to provide canonical references for downstream analyses. Can you give an overview of what is involved? What is the need for the service. The main requirement here is connecting an address to lot, building, unit with latitude and longitude coordinates

How were you satisfying this requirement previously? Before we built our model and dedicated service we had a basic prototype for pipeline only to handle NYC addresses. What were the motivations for designing and implementing this as a service? Need to expand nationwide and to deal with client queries in real time. What are some of the other data sources that you rely on to be able to perform this normalization and resolution? Lot data, building data, unit data, Footprints and address points datasets. What challenges do you face in managing these other sources of information? Accuracy, hirearchy, standardization, unified solution, persistant ids and primary keys

Digging into the specifics of your solution, can you talk through the full lifecycle of a request to resolve an address and the various manipulations that are performed on it? String cleaning, Parse and tokenize, standardize, Match What are some of the other pieces of information in your system that you would like to see addressed in a similar fashion? Our named entity solution with connection to knowledge graph and owner unmasking. What are some of the most interesting, unexpected, or challenging lessons that you learned while building this address resolution system? Scaling nyc geocode example. The NYC model was exploding a subset of the options for messing up an address. Flexibility. Dependencies. Client exposure. Now that you have this system running in production, if you were to start over today what would you do differently? a lot but at this point the module boundaries and client interface are defined in such way that we are able to make changes or completely replace any given part of it without breaking anything client facing What are some of the other projects that you are excited to work on going forward? Named entity resolution and Knowledge Graph

Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today? BigQuery is huge asset and in particular UDFs but they don’t support API calls or python script

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Cherre

Podcast Episode

Photonics Knowledge Graph Entity Resolution BigQuery NLP == Natural Language Processing dbt

Podcast Episode

Airflow

Podcast.init Episode

Datadog

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary "Business as usual" is changing, with more companies investing in data as a first class concern. As a result, the data team is growing and introducing more specialized roles. In this episode Josh Benamram, CEO and co-founder of Databand, describes the motivations for these emerging roles, how these positions affect the team dynamics, and the types of visibility that they need into the data platform to do their jobs effectively. He also talks about how his experience working with these teams informs his work at Databand. If you are wondering how to apply your talents and interests to working with data then this episode is a must listen.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Josh Benamram about the continued evolution of roles and responsibilities in data teams and their varied requirements for visibility into the data stack

Interview

Introduction How did you get involved in the area of data management? Can you start by discussing the set of roles that you see in a majority of data teams? What new roles do you see emerging, and what are the motivating factors?

Which of the more established positions are fracturing or merging to create these new responsibilities?

What are the contexts in which you are seeing these role definitions used? (e.g. small teams, large orgs, etc.) How do the increased granularity/specialization of responsibilities across data teams change the ways that data and platform architects need to think about technology investment?

What are the organizational impacts of these new types of data work?

How do these shifts in role definition change the ways that the individuals in th

Data Science on AWS

With this practical book, AI and machine learning practitioners will learn how to successfully build and deploy data science projects on Amazon Web Services. The Amazon AI and machine learning stack unifies data science, data engineering, and application development to help level up your skills. This guide shows you how to build and run pipelines in the cloud, then integrate the results into applications in minutes instead of days. Throughout the book, authors Chris Fregly and Antje Barth demonstrate how to reduce cost and improve performance. Apply the Amazon AI and ML stack to real-world use cases for natural language processing, computer vision, fraud detection, conversational devices, and more Use automated machine learning to implement a specific subset of use cases with SageMaker Autopilot Dive deep into the complete model development lifecycle for a BERT-based NLP use case including data ingestion, analysis, model training, and deployment Tie everything together into a repeatable machine learning operations pipeline Explore real-time ML, anomaly detection, and streaming analytics on data streams with Amazon Kinesis and Managed Streaming for Apache Kafka Learn security best practices for data science projects and workflows including identity and access management, authentication, authorization, and more

Summary One of the biggest obstacles to success in delivering data products is cross-team collaboration. Part of the problem is the difference in the information that each role requires to do their job and where they expect to find it. This introduces a barrier to communication that is difficult to overcome, particularly in teams that have not reached a significant level of maturity in their data journey. In this episode Prukalpa Sankar shares her experiences across multiple attempts at building a system that brings everyone onto the same page, ultimately bringing her to found Atlan. She explains how the design of the platform is informed by the needs of managing data projects for large and small teams across her previous roles, how it integrates with your existing systems, and how it can work to bring everyone onto the same page.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Prukalpa Sankar about Atlan, a modern data workspace that makes collaboration among data stakeholders easier, increasing efficiency and agility in data projects

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Atlan and some of the story behind it? Who are the target users of Atlan? What portions of the data workflow is Atlan responsible for?

What components of the data stack might Atlan replace?

How would you characterize Atlan’s position in the current data ecosystem?

What makes Atlan stand out from other systems for data cataloguing, metadata management, or data governance? What types of data assets (e.g. structured vs unstructured, textual

Azure Data Engineering Cookbook

Dive into the world of data engineering with 'Azure Data Engineering Cookbook' to master building efficient ETL workflows using Microsoft Azure Data services. Whether you're working on batch processing solutions or real-time analytics, this book is your guide to implementing effective, scalable data operations. What this Book will help me do Design and implement efficient ETL pipelines for batch and real-time processing on MS Azure. Understand the use of Azure Blob storage for managing large data sets. Ingest, process, and analyze data using tools like Azure Synapse and Databricks. Develop and secure automation pipelines using Azure Data Factory. Leverage Azure Stream Analytics for real-time data processing workflows. Author(s) Ahmad Osama and Nagaraj Venkatesan bring years of expertise in cloud solutions and data engineering. Renowned for their practical teaching approach, they have helped countless professionals master the intricacies of Azure. Their focus is on equipping readers with actionable skills for real-world data challenges. Who is it for? This book is ideal for data engineers and database professionals aiming to hone their expertise in advanced Azure data engineering tasks. Readers should have a working knowledge of Azure fundamentals and basic data engineering concepts. If you're a technical architect or ETL developer seeking to transition or enhance your skills in Azure's ecosystem, you'll find immense value here.

High Performant File System Workloads for AI and HPC on AWS using IBM Spectrum Scale

This IBM® Redpaper® publication is intended to facilitate the deployment and configuration of the IBM Spectrum® Scale based high-performance storage solutions for the scalable data and AI solutions on Amazon Web Services (AWS). Configuration, testing results, and tuning guidelines for running the IBM Spectrum Scale based high-performance storage solutions for the data and AI workloads on AWS are the focus areas of the paper. The LAB Validation was conducted with the Red Hat Linux nodes to IBM Spectrum Scale by using the various Amazon Elastic Compute Cloud (EC2) instances. Simultaneous workloads are simulated across multiple Amazon EC2 nodes running with Red Hat Linux to determine scalability against the IBM Spectrum Scale clustered file system. Solution architecture, configuration details, and performance tuning demonstrate how to maximize data and AI application performance with IBM Spectrum Scale on AWS.

Summary Data quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impact on the business if something goes wrong. In this episode Maarten Masschelein and Tom Baeyens share the work they are doing at Soda to bring everyone on board to make your data clean and reliable. They explain how they started down the path of building a solution for managing data quality, their philosophy of how to empower data engineers with well engineered open source tools that integrate with the rest of the platform, and how to bring all of the stakeholders onto the same page to make your data great. There are many aspects of data quality management and it’s always a treat to learn from people who are dedicating their time and energy to solving it for everyone.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Maarten Masschelein and Tom Baeyens about the work are doing at Soda to power data quality management

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Soda? What problem are you trying to solve? And how are you solving that problem?

What motivated you to start a business focused on data monitoring and data quality?

The data monitoring and broader data quality space is a segment of the industry that is seeing a huge increase in attention recently. Can you share your perspective on the current state of the ecosystem and how your approach compares to other tools and products? who have you cr

Effortless App Development with Oracle Visual Builder

In "Effortless App Development with Oracle Visual Builder," you will explore how to quickly design, develop, and deploy robust web and mobile applications using Oracle Visual Builder's intuitive drag-and-drop features. This book equips you with the know-how to simplify application development tasks, making it perfect for professionals looking to boost productivity. What this Book will help me do Master the core architecture and features of Oracle Visual Builder to develop real-world applications effectively. Learn to create, manage, and leverage business objects and connect to various SaaS APIs within your applications. Build scalable and secure web and mobile applications using practical examples and clear implementation guidelines. Discover best practices for application lifecycle management, debugging, and troubleshooting VB applications. Extend Oracle and non-Oracle SaaS applications through hands-on knowledge tailored to real-world scenarios. Author(s) None Jain is an experienced developer and technical writer specializing in Oracle Visual Builder and cloud-based application development. With years of hands-on experience building and deploying cloud applications, they bring expertise and a practical approach to education. Their engaging writing style focuses on enabling readers to learn and apply new skills confidently. Who is it for? This book is perfectly suited for developers, UI designers, and IT professionals who want to master Oracle Visual Builder for developing web and mobile applications. If you already have experience with technologies like JavaScript, UI frameworks, and REST APIs, and seek to create intuitive applications using a simplified interface, this book is for you. Whether you're in the early stages of learning VB or looking to refine your skills, this book serves as a valuable guide.

Automating the Modern Data Warehouse

The opportunity to modernize and improve the enterprise data warehouse is one of the best reasons for moving your application to the cloud. A data warehouse can access a greater diversity of use cases and practices than is possible in an existing environment. In this report, researcher and analyst Stephen Swoyer offers a comprehensive overview of the benefits and challenges of implementing a cloud-based data warehouse. Senior IT decision makers, chief data officers, and data professionals will learn about the shifts and new trends in the data management landscape. Explore ways to improve data management, build a data warehouse strategy, and learn how to modernize a data warehouse effectively. Understand how AI, machine learning, self-service data integration, and built-in developer-oriented services have transformed the data warehouse role Use data warehouses to work with cloud-based data lakes for end-to-end data management and data governance Explore how data warehouse platforms as a service (PaaS) pave the way to automation Migrate, manage, and secure a data warehouse in a hybrid or multicloud environment

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

This week on Making Data Simple, we have Robin Hernandez, Robin is the VP of Offering Management Cloud Data for Watson AI Ops. Robin started out in software development, and then worked in Technical Sales, IBM Cloud Garage, and then Product Management.

Show Notes 3:46 – What is AI Ops? 6:31 – What is the ROI? 12:00 – What does it take to setup AI? 17:48 – How does this really work? 21:34 – I am in Services what do I get? 26:06 – What king of grantees does AI offer? 28:12 – Who is monitoring AI? IBM AI Ops    Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter.  Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Summary The world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems. In this episode Raghu Murthy, founder and CEO of Datacoral, does a deep dive on how he and his team manage change data capture pipelines in production.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Raghu Murthy about his recent work of making change data capture more accessible and maintainable

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what CDC is and when it is useful? What are the alternatives to CDC?

What are the cases where a more batch-oriented approach would be preferable?

What are the factors that you need to consider when deciding whether to implement a CDC system for a given data integration?

What are the barriers to entry?

What are some of the common mistakes or misconceptions about CDC that you have encountered in your own work and while working with customers? How does CDC fit into a broader data platform, particularly where there are likely to be other data integration pipelines in operation? (e.g. Fivetran/Airbyte/Meltano/custom scripts) What are the moving pieces in a CDC workflow that need to be considered as you are designing the system?

What are some examples of the configuration changes necessary in source systems to provide

IBM TS7700 Series DS8000 Object Store User's Guide Version 2.0

The IBM® TS7700 features a functional enhancement that allows for the TS7700 to act as an object store for transparent cloud tiering with IBM DS8000® (DS8K), DFSMShsm (HSM), and native DFSMSdss (DSS). This function can be used to move data sets directly from DS8000 to TS7700. This IBM Redpaper publication describes the client value, and how DFSMS, DS8000, and TS7700 are set up to enable and use the function.

Summary The team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner. In this episode the head of data platform for DoorDash, Sudhir Tonse, discusses the technologies that they are using, the approach that they take to adding new systems, and how they think about priorities for what to support for the whole company vs what to leave as a specialized concern for a single team. This is a valuable look at how to manage a large and growing data platform with that supports a variety of teams with varied and evolving needs.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Sudhir Tonse about how the team at DoorDash designed their data platform

Interview

Introduction How did you get involved in the area of data management? Can you start by giving a quick overview of what you do at DoorDash?

What are some of the ways that data is used to power the business?

How has the pandemic affected the scale and volatility of the data that you are working with? Can you describe the type(s) of data that you are working with?

What are the primary sources of data that you collect?

What secondary or third party sources of information do you rely on?

Can you give an overview of the collection process for that data?

In selecting the technologies for the various components in your data stack, what are the primary factors that you consider when evaluating

podcast_episode
by Somali Chaterji (Purdue University) , Kyle Polich , Karthick Shankar (Carnegie Mellon University)

Karthick Shankar, Masters Student at Carnegie Mellon University, and Somali Chaterji, Assistant Professor at Purdue University, join us today to discuss the paper "JANUS: Benchmarking Commercial and Open-Source Cloud and Edge Platforms for Object and Anomaly Detection Workloads" Works Mentioned: https://ieeexplore.ieee.org/abstract/document/9284314 "JANUS: Benchmarking Commercial and Open-Source Cloud and Edge Platforms for Object and Anomaly Detection Workloads." by: Karthick Shankar, Pengcheng Wang, Ran Xu, Ashraf Mahgoub, Somali ChaterjiSocial Media Karthick Shankar https://twitter.com/karthick_sh Somali Chaterji https://twitter.com/somalichaterji?lang=en https://schaterji.io/

Summary A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way. After tasking some of his top engineers to consider the problem in a new light they created the Pilosa engine. In this episode H.O. explains how using Pilosa as the core he built the Molecula platform to eliminate the need to copy data between systems in able to make it accessible for analytical and machine learning purposes. He also discusses the challenges that he faces in helping potential users and customers understand the shift in thinking that this creates, and how the system is architected to make it possible. This is a fascinating conversation about what the future looks like when you revisit your assumptions about how systems are designed.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing H.O. Maycotte about Molecula, a cloud based feature store based on the open source Pilosa project

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Molecula and the story behind it?

What are the additional capabilities that Molecula offers on top of the open source Pilosa project?

What are the problems/use cases that Molecula solves for? What are some of the technologies or architectural patterns that Molecula might replace in a companies data platform? One of the use cases that is mentioned on the Molecula site is as a feature store for ML and AI. This is a category that has been seeing a lot of growth recently. Can you provide some context how Molecula fits in that market and how it compares to options such as Tecton, Iguazio, Feast, etc.?

What are the benefits of using a bitmap index for identifying and computing features?

Can you describe how the Molecula platform is architected?

How has the design and goal of Molecula changed or evolved since you first began working on it?

For someone who is using Molecula, can you describe the process of integrating it with their existing data sources? Can you describe the internal data model of Pilosa/Molecula?

How should users think about data modeling and architecture as they are loading information into the platform?

Once a user has data in Pilosa, what are the available mechanisms for performing analyses or feature engineering? What are some of the most underutilized or misunderstood capabilities of Molecula? What are some of the most interesting, unexpected, or innovative ways that you have seen the Molecula platform used? What are the most interesting, unexpected, or challenging lessons that you have learned from building and scaling Molecula? When is Molecula the wrong choice? What do you have planned for the future of the platform and business?

Contact Info

LinkedIn @maycotte on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Molecula Pilosa

Podcast Episode

The Social Dilemma Feature Store Cassandra Elasticsearch

Podcast Episode

Druid MongoDB SwimOS

Podcast Episode

Kafka Kafka Schema Registry

Podcast Episode

Homomorphic Encryption Lucene Solr

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Professional Azure SQL Managed Database Administration - Third Edition

Professional Azure SQL Managed Database Administration is a comprehensive guide to mastering data management with Azure's managed database services. Packed with real-world exercises and updated to cover the latest Azure features, this book provides actionable insights into migration, performance tuning, scaling, and securing Azure SQL databases. What this Book will help me do Master the configuration and pricing options for Azure SQL databases to make cost-effective choices. Learn the processes to provision new SQL databases or migrate existing on-premises SQL databases to Azure. Acquire skills in implementing high availability and disaster recovery for ensuring data resilience. Understand the strategies for monitoring, tuning, and optimizing the performance of Azure SQL databases. Discover techniques for scaling uses through elastic pools and securing databases comprehensively. Author(s) Ahmad Osama and Shashikant Shakya are experienced professionals in SQL Server and Azure SQL technologies. With decades of combined experience in database administration and cloud computing, they bring a depth of understanding to the content of this book. Their hands-on teaching approach is evident in the practical exercises and real-world scenarios included. Who is it for? This book is specifically tailored for database administrators, developers, and application developers looking to leverage Azure SQL databases. If you are tasked with migrating applications to the cloud or ensuring top performance and resilience for cloud databases, you will find this book highly valuable. Prior experience with on-premises SQL services will help contextualize the content, making it suitable for professionals with intermediate SQL experience. Readers aiming to deepen their Azure SQL expertise will also greatly benefit.

Enhanced Cyber Resilience Solution by Threat Detection using IBM Cloud Object Storage System and IBM QRadar SIEM

This Solution Redpaper™ publication explains how the features of IBM Cloud® Object Storage System reduces the effect of incidents on business data when combined with log analysis, deep inspection, and detection of threats that IBM QRadar SIEM provides. This paper also demonstrates how to integrate IBM Cloud Object Storage's access logs with IBM QRadar SIEM. An administrator can monitor, inspect, detect, and derive insights for identifying potential threats to the data that is stored on IBM Cloud Object Storage. Also, IBM QRadar SIEM can proactively trigger cyber resiliency workflow in IBM Cloud Object Storage remotely to protect the data based on threat detection. This publication is intended for chief technology officers, solution and security architects, and systems administrators.