Cloud Computing

Self Service Data Management From Ingest To Insights With Isima

2020-11-17 · Data Engineering Podcast Listen

podcast_episode

by Darshan Rawal (Isima) , Tobias Macey

AI/ML Airflow Analytics API BI CI/CD Data Analytics Data Engineering Data Governance Data Management Data Quality Datafold +4 more

Summary The core mission of data engineers is to provide the business with a way to ask and answer questions of their data. This often takes the form of business intelligence dashboards, machine learning models, or APIs on top of a cleaned and curated data set. Despite the rapid progression of impressive tools and products built to fulfill this mission, it is still an uphill battle to tie everything together into a cohesive and reliable platform. At Isima they decided to reimagine the entire ecosystem from the ground up and built a single unified platform to allow end-to-end self service workflows from data ingestion through to analysis. In this episode CEO and co-founder of Isima Darshan Rawal explains how the biOS platform is architected to enable ease of use, the challenges that were involved in building an entirely new system from scratch, and how it can integrate with the rest of your data platform to allow for incremental adoption. This was an interesting and contrarian take on the current state of the data management industry and is worth a listen to gain some additional perspective.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Follow go.datafold.com/dataengineeringpodcast to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help y

The State of Music Synchronization With Synchtank Founder Joel T. Jordan, Part 1

2020-11-16 · How Music Charts Listen

podcast_episode

by Joel T. Jordan (Synchtank)

SaaS

At How Music Charts, we try to showcase those pushing the edge of music and data, and today we talk with Joel T. Jordan, Founder and President at Synchtank. Headquartered in London with offices in New York and Los Angeles, Synchtank offers a range of cloud-based SaaS solutions for managing digital entertainment assets, intellectual property, metadata and royalties. Getting his start in the music industry at ripe age of 13, Jordan started one of the hardcore punk scene’s seminal labels, Watermark Records, with his brother Jason in 1991. It was there in New Jersey where the visually-oriented Jordan began his career as an Art Director, which eventually led him to co-found the creative design film Earthprogram in New York City, where he served as Lead Designer and Creative Director from 1996 to 2008. It was then when Jordan founded Synchtank, where, as a 2018 Pop Disciple interview describes, Synchtank “serves over 150 high profile clients including Disney Music, 20th Century Fox, Reservoir Media, Spirit Music Group, Concord Music, BT Sport, Red Bull Media House, Primary Wave, and peermusic.” Check out music licensing software plaform Synchtank here. Connect With Us (@chartmetric)http://chartmetric.com/https://blog.chartmetric.comhttps://smarturl.it/chartmetric_social

Cyber Resilience Solution Across Hybrid Cloud Using IBM Storage Solutions

2020-11-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by IBM

IBM data data-engineering

In today's data driven world, the information and data of an organization is considered as the most important asset to its business. It can serve as key asset for growth of an organization. As more data are collected by organizations, it is growing at a staggering pace. With this exponential data growth, there is an increase need to protect the data from the various cyberattacks in the form of malware and ransomware that is trying to steal precious data and information. These cyberattacks can have catastrophic impact on the organization and result in devastating financial losses and affect the organization's reputation for years. This document is intended to facilitate the deployment of the Hybrid Cloud Cyber Resilience solution for storage system data that it backed up in IBM Spectrum Protect Plus from external cyberattacks or insider attacks by using its integration with IBM Cloud Object Storage. You must understand IBM FlashSystem, IBM Spectrum Protect Plus, and IBM Cloud Object Storage architecture concepts and its configuration across hybrid cloud. The information in this document is distributed on an as-is basis without any warranty that is either expressed or implied. Support assistance for the use of this material is limited to situations where IBM FlashSystem, IBM Spectrum Protect Plus or IBM Cloud Object Storage are supported and entitled, and where the issues are specific to a solution technical paper implementation.

Building A Cost Effective Data Catalog With Tree Schema

2020-11-10 · Data Engineering Podcast Listen

podcast_episode

by Grant Seward (Tree Schema) , Tobias Macey

Airflow Analytics BI CI/CD Data Analytics Data Engineering Data Governance Data Management Data Quality Datafold dbt ETL/ELT +2 more

Summary A data catalog is a critical piece of infrastructure for any organization who wants to build analytics products, whether internal or external. While there are a number of platforms available for building that catalog, many of them are either difficult to deploy and integrate, or expensive to use at scale. In this episode Grant Seward explains how he built Tree Schema to be an easy to use and cost effective option for organizations to build their data catalogs. He also shares the internal architecture, how he approached the design to make it accessible and easy to use, and how it autodiscovers the schemas and metadata for your source systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Follow go.datafold.com/dataengineeringpodcast to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Your host is Tobias Macey and today I’m interviewing Grant Seward about Tree Schema, a human friendly data catalog

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you have built at Tree Schema?

What was your motivation for creating it?

At what stage of maturity should a team or organization

Practical Azure SQL Database for Modern Developers: Building Applications in the Microsoft Cloud

2020-11-05 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Anna Hoffman (Azure Data) , Davide Mauri , Silvano Coriani , Jovan Popovic , Sanjay Mishra (Google)

API Azure CI/CD Data Management JSON Microsoft SQL azure-sql-database data data-engineering relational-databases

Here is the expert-level, insider guidance you need on using Azure SQL Database as your back-end data store. This book highlights best practices in everything ranging from full-stack projects to mobile applications to critical, back-end APIs. The book provides instruction on accessing your data from any language and platform. And you learn how to push processing-intensive work into the database engine to be near the data and avoid undue networking traffic. Azure SQL is explained from a developer's point of view, helping you master its feature set and create applications that perform well and delight users. Core to the book is showing you how Azure SQL Database provides relational and post-relational support so that any workload can be managed with easy accessibility from any platform and any language. You will learn about features ranging from lock-free tables to columnstore indexes, and about support for data formats ranging from JSON and key-values to the nodes and edges in the graph database paradigm. Reading this book prepares you to deal with almost all data management challenges, allowing you to create lean and specialized solutions having the elasticity and scalability that are needed in the modern world. What You Will Learn Master Azure SQL Database in your development projects from design to the CI/CD pipeline Access your data from any programming language and platform Combine key-value, JSON, and relational data in the same database Push data-intensive compute work into the database for improved efficiency Delight your customers by detecting and improving poorly performing queries Enhance performance through features such as columnstore indexes and lock-free tables Build confidence in your mastery of Azure SQL Database's feature set Who This Book Is For Developers of applications and APIs that benefit from cloud database support, developers who wish to master their tools (including Azure SQL Database, and those who want their applications to be known for speedy performance and the elegance of their code

Add Version Control To Your Data Lake With LakeFS

2020-11-03 · Data Engineering Podcast Listen

podcast_episode

by Oz Katz (Treeverse) , Einat Orr (Treeverse) , Tobias Macey

Analytics Data Analytics Data Engineering Data Governance Data Lake Data Management Datadog Git Kubernetes S3 SaaS Cyber Security

Summary Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. In order to address these challenges the team at Treeverse created LakeFS to introduce version control capabilities to your storage layer. In this episode Einat Orr and Oz Katz explain how they implemented branching and merging capabilities for object storage, best practices for how to use versioning primitives to introduce changes to your data lake, how LakeFS is architected, and how you can start using it for your own data platform.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. Your host is Tobias Macey and today I’m interviewing Einat Orr and Oz Katz about their work at Treeverse on the LakeFS system for versioning your data lakes the same way you version your code.

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what LakeFS is and why you built it?

There are a number of tools and platforms that support data virtualization and data versioning. How does LakeFS compare to the available options? (e.g. Alluxio, Denodo, Pachyderm, DVC, etc.)

What are the primary use cases that LakeFS enables? For someone who wants to use LakeFS what is involved in getting it set up? How is LakeFS implemented?

How has the design of the system changed or evolved since you began working on it? What assumptions did you have going into it which have since been invalidated or modified?

How does the workflow for an engineer or analyst change from working directly against S3 to running against the LakeFS interface? How do you handle merge conflicts and resolution?

What

Azure SQL Revealed: A Guide to the Cloud for SQL Server Professionals

2020-10-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bob Ward (Azure Data)

Azure Microsoft Cyber Security SQL data data-engineering microsoft-sql-server relational-databases

Access detailed content and examples on Azure SQL, a set of cloud services that allows for SQL Server to be deployed in the cloud. This book teaches the fundamentals of deployment, configuration, security, performance, and availability of Azure SQL from the perspective of these same tasks and capabilities in SQL Server. This distinct approach makes this book an ideal learning platform for readers familiar with SQL Server on-premises who want to migrate their skills toward providing cloud solutions to an enterprise market that is increasingly cloud-focused. If you know SQL Server, you will love this book. You will be able to take your existing knowledge of SQL Server and translate that knowledge into the world of cloud services from the Microsoft Azure platform, and in particular into Azure SQL. This book provides information never seen before about the history and architecture of Azure SQL. Author Bob Ward is a leading expert with access to and support fromthe Microsoft engineering team that built Azure SQL and related database cloud services. He presents powerful, behind-the-scenes insights into the workings of one of the most popular database cloud services in the industry. What You Will Learn Know the history of Azure SQL Deploy, configure, and connect to Azure SQL Choose the correct way to deploy SQL Server in Azure Migrate existing SQL Server instances to Azure SQL Monitor and tune Azure SQL’s performance to meet your needs Ensure your data and application are highly available Secure your data from attack and theft Who This Book Is For This book is designed to teach SQL Server in the Azure cloud to the SQL Server professional. Anyone who operates, manages, or develops applications for SQL Server will benefit from this book. Readers will be able to translate their current knowledge of SQL Server—especially of SQL Server 2019—directly to Azure. This book is ideal for database professionals looking to remain relevant as their customer base moves into the cloud.

Cloud Native Data Security As Code With Cyral

2020-10-26 · Data Engineering Podcast Listen

podcast_episode

by Manav Mital (Cyral) , Tobias Macey

Analytics Big Data Data Analytics Data Engineering Data Governance Data Management Datadog Kubernetes SaaS Cyber Security Data Streaming

Summary One of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are going to implement data security, including access controls and auditing. Different databases and storage systems all have their own method of restricting access, and they are not all compatible with each other. In order to simplify the process of securing your data in the Cloud Manav Mital created Cyral to provide a way of enforcing security as code. In this episode he explains how the system is architected, how it can help you enforce compliance, and what is involved in getting it integrated with your existing systems. This was a good conversation about an aspect of data management that is too often left as an afterthought.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!

Hybrid Multicloud Business Continuity for OpenShift Workloads with IBM Spectrum Virtualize in AWS

2020-10-20 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by IBM

AWS IBM MySQL data data-engineering

This publication is intended to facilitate the deployment of the hybrid cloud business continuity solution with Red Hat OpenShift Container Platform and IBM® block CSI (Container Storage Interface) driver plug-in for IBM Spectrum® Virtualize on Public Cloud AWS (Amazon Web Services). This solution is designed to protect the data by using IBM Storage-based Global Mirror replication. For demonstration purposes, MySQL containerized database is installed on the on-premises IBM FlashSystem® that is connected to the Red Hat OpenShift Container Platform (OCP) cluster in the vSphere environment through the IBM block CSI driver. The volume (LUN) on IBM FlashSystem storage system is replicated by using global mirror on IBM Spectrum Virtualize for Public Cloud on AWS. Red Hat OpenShift cluster (OCP cluster) and the IBM block CSI driver plug-in are installed on AWS by using Installer-Provisioned Infrastructure (IPI) methodology. The information in this document is distributed on an as-is basis without any warranty that is either expressed or implied. Support assistance for the use of this material is limited to situations where IBM Spectrum Virtualize for Public Cloud is supported and entitled, and where the issues are specific to this Blueprint implementation.

Better Data Quality Through Observability With Monte Carlo

2020-10-19 · Data Engineering Podcast Listen

podcast_episode

by Barr Moses (Monte Carlo) , Tobias Macey , Lior Gavish (Monte Carlo)

AI/ML Analytics Big Data Data Analytics Data Engineering Data Governance Data Management Data Quality Datadog Kubernetes Monte Carlo SaaS +2 more

Summary In order for analytics and machine learning projects to be useful, they require a high degree of data quality. To ensure that your pipelines are healthy you need a way to make them observable. In this episode Barr Moses and Lior Gavish, co-founders of Monte Carlo, share the leading causes of what they refer to as data downtime and how it manifests. They also discuss methods for gaining visibility into the flow of data through your infrastructure, how to diagnose and prevent potential problems, and what they are building at Monte Carlo to help you maintain your data’s uptime.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Barr Moses and Lior Gavish about observability for your data pipelines and how they are addressing it at Monte Carlo.

Interview

Introduction How did you get involved in the area of data management? H

Making Data Smarter with IBM Spectrum Discover: Practical AI Solutions

2020-10-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Maxime Deloche , Ivaylo B. Bozhinov , Kiran Ghag , Mathias Defiebre , Isom Crawford Jr. , Xin Liu , Gauthier Siri , Joseph Dain , Gucer Vasfi , Christopher Vollmar , Abeer Selim

AI/ML Analytics IBM data data-engineering

More than 80% of all data that is collected by organizations is not in a standard relational database. Instead, it is trapped in unstructured documents, social media posts, machine logs, and so on. Many organizations face significant challenges to manage this deluge of unstructured data, such as the following examples: Pinpointing and activating relevant data for large-scale analytics Lacking the fine-grained visibility that is needed to map data to business priorities Removing redundant, obsolete, and trivial (ROT) data Identifying and classifying sensitive data IBM® Spectrum Discover is a modern metadata management software that provides data insight for petabyte-scale file and Object Storage, storage on-premises, and in the cloud. This software enables organizations to make better business decisions and gain and maintain a competitive advantage. IBM Spectrum® Discover provides a rich metadata layer that enables storage administrators, data stewards, and data scientists to efficiently manage, classify, and gain insights from massive amounts of unstructured data. It improves storage economics, helps mitigate risk, and accelerates large-scale analytics to create competitive advantage and speed critical research. This IBM Redbooks® publication presents several use cases that are focused on artificial intelligence (AI) solutions with IBM Spectrum Discover. This book helps storage administrators and technical specialists plan and implement AI solutions by using IBM Spectrum Discover and several other IBM Storage products.

SQL Server Data Automation Through Frameworks: Building Metadata-Driven Frameworks with T-SQL, SSIS, and Azure Data Factory

2020-10-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Kent Bradshaw , Andy Leonard

Azure ADF DevOps ETL/ELT Microsoft SQL SSIS data data-engineering microsoft-sql-server relational-databases

Learn to automate SQL Server operations using frameworks built from metadata-driven stored procedures and SQL Server Integration Services (SSIS). Bring all the power of Transact-SQL (T-SQL) and Microsoft .NET to bear on your repetitive data, data integration, and ETL processes. Do this for no added cost over what you’ve already spent on licensing SQL Server. The tools and methods from this book may be applied to on-premises and Azure SQL Server instances. The SSIS framework from this book works in Azure Data Factory (ADF) and provides DevOps personnel the ability to execute child packages outside a project—functionality not natively available in SSIS. Frameworks not only reduce the time required to deliver enterprise functionality, but can also accelerate troubleshooting and problem resolution. You'll learn in this book how frameworks also improve code quality by using metadata to drive processes. Much of the work performed by data professionals can be classified as “drudge work”—tasks that are repetitive and template-based. The frameworks-based approach shown in this book helps you to avoid that drudgery by turning repetitive tasks into "one and done" operations. Frameworks as described in this book also support enterprise DevOps with built-in logging functionality. What You Will Learn Create a stored procedure framework to automate SQL process execution Base your framework on a working system of stored procedures and execution logging Create an SSIS framework to reduce the complexity of executing multiple SSIS packages Deploy stored procedure and SSIS frameworks to Azure Data Factory environments in the cloud Who This Book Is For Database administrators and developers who are involved in enterprise data projects built around stored procedures and SQL Server Integration Services (SSIS). Readersshould have a background in programming along with a desire to optimize their data efforts by implementing repeatable processes that support enterprise DevOps.

Security and Privacy Issues in IoT Devices and Sensor Networks

2020-10-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Narayan C. Debnath , Sudhir Kumar Sharma , Bharat Bhushan

AI/ML Analytics IoT Cyber Security data data-engineering data-security-privacy data security & privacy

Security and Privacy Issues in IoT Devices and Sensor Networks investigates security breach issues in IoT and sensor networks, exploring various solutions. The book follows a two-fold approach, first focusing on the fundamentals and theory surrounding sensor networks and IoT security. It then explores practical solutions that can be implemented to develop security for these elements, providing case studies to enhance understanding. Machine learning techniques are covered, as well as other security paradigms, such as cloud security and cryptocurrency technologies. The book highlights how these techniques can be applied to identify attacks and vulnerabilities, preserve privacy, and enhance data security. This in-depth reference is ideal for industry professionals dealing with WSN and IoT systems who want to enhance the security of these systems. Additionally, researchers, material developers and technology specialists dealing with the multifarious aspects of data privacy and security enhancement will benefit from the book's comprehensive information. Provides insights into the latest research trends and theory in the field of sensor networks and IoT security Presents machine learning-based solutions for data security enhancement Discusses the challenges to implement various security techniques Informs on how analytics can be used in security and privacy

Rapid Delivery Of Business Intelligence Using Power BI

2020-10-12 · Data Engineering Podcast Listen

podcast_episode

by Rob Collie (Power Pivot Pro) , Tobias Macey

AI/ML Analytics BI Big Data Data Analytics Data Engineering Data Governance Data Management Kafka Kubernetes Microsoft Power BI +3 more

Summary Business intelligence efforts are only as useful as the outcomes that they inform. Power BI aims to reduce the time and effort required to go from information to action by providing an interface that encourages rapid iteration. In this episode Rob Collie shares his enthusiasm for the Power BI platform and how it stands out from other options. He explains how he helped to build the platform during his time at Microsoft, and how he continues to support users through his work at Power Pivot Pro. Rob shares some useful insights gained through his consulting work, and why he considers Power BI to be the best option on the market today for business analytics.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Equalum’s end to end data ingestion platform is relied upon by enterprises across industries to seamlessly stream data to operational, real-time analytics and machine learning environments. Equalum combines streaming Change Data Capture, replication, complex transformations, batch processing and full data management using a no-code UI. Equalum also leverages open source data frameworks by orchestrating Apache Spark, Kafka and others under the hood. Tool consolidation and linear scalability without the legacy platform price tag. Go to dataengineeringpodcast.com/equalum today to start a free 2 week test run of their platform, and don’t forget to tell them that we sent you. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Rob Collie about Microsoft’s Power BI platform and his

Self Service Real Time Data Integration Without The Headaches With Meroxa

2020-10-05 · Data Engineering Podcast Listen

podcast_episode

by Ali Hamidi (Meroxa) , DeVaris Brown (Meroxa) , Tobias Macey

Analytics Big Data Data Analytics Data Engineering Data Governance Data Management Datadog Kubernetes SaaS Cyber Security Data Streaming

Summary Analytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data warehouses is a non-trivial effort, requiring a substantial investment of time and energy. Meroxa is a new platform that aims to automate the heavy lifting of change data capture, monitoring, and data loading. In this episode founders DeVaris Brown and Ali Hamidi explain how their tenure at Heroku informed their approach to making data integration self service, how the platform is architected, and how they have designed their system to adapt to the continued evolution of the data ecosystem.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing DeVaris Brown and Ali Hamidi about Meroxa, a new platform as a service for dat

AI and Machine Learning for Coders

2020-10-01 · O'Reilly AI & ML Books O'Reilly Amazon

book

by Laurence Moroney

AI/ML NLP TensorFlow ai-ml data machine-learning

If you're looking to make a career move from programmer to AI specialist, this is the ideal place to start. Based on Laurence Moroney's extremely successful AI courses, this introductory book provides a hands-on, code-first approach to help you build confidence while you learn key topics. You'll understand how to implement the most common scenarios in machine learning, such as computer vision, natural language processing (NLP), and sequence modeling for web, mobile, cloud, and embedded runtimes. Most books on machine learning begin with a daunting amount of advanced math. This guide is built on practical lessons that let you work directly with the code. You'll learn: How to build models with TensorFlow using skills that employers desire The basics of machine learning by working with code samples How to implement computer vision, including feature detection in images How to use NLP to tokenize and sequence words and sentences Methods for embedding models in Android and iOS How to serve models over the web and in the cloud with TensorFlow Serving

BigQuery for Data Warehousing: Managed Data Analysis in the Google Cloud

2020-09-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Mark Mucchetti

AI/ML BigQuery Data Science DWH GCP data data-engineering google-bigquery

Create a data warehouse, complete with reporting and dashboards using Google’s BigQuery technology. This book takes you from the basic concepts of data warehousing through the design, build, load, and maintenance phases. You will build capabilities to capture data from the operational environment, and then mine and analyze that data for insight into making your business more successful. You will gain practical knowledge about how to use BigQuery to solve data challenges in your organization. BigQuery is a managed cloud platform from Google that provides enterprise data warehousing and reporting capabilities. Part I of this book shows you how to design and provision a data warehouse in the BigQuery platform. Part II teaches you how to load and stream your operational data into the warehouse to make it ready for analysis and reporting. Parts III and IV cover querying and maintaining, helping you keep your information relevant with other Google Cloud Platform services and advanced BigQuery. Part V takes reporting to the next level by showing you how to create dashboards to provide at-a-glance visual representations of your business situation. Part VI provides an introduction to data science with BigQuery, covering machine learning and Jupyter notebooks. What You Will Learn Design a data warehouse for your project or organization Load data from a variety of external and internal sources Integrate other Google Cloud Platform services for more complex workflows Maintain and scale your data warehouse as your organization grows Analyze, report, and create dashboards on the information in the warehouse Become familiar with machine learning techniques using BigQuery ML Who This Book Is For Developers who want to provide business users with fast, reliable, and insightful analysis from operational data, and data analysts interested in a cloud-based solution that avoids the pain of provisioning their own servers.

ETL with Azure Cookbook

2020-09-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Christian Cote , Madina Saitakhmetova , Matija Lah

Azure Big Data Data Engineering Databricks ETL/ELT SQL data data-engineering etl

ETL with Azure Cookbook is a comprehensive guide to building effective and scalable ETL solutions using the Azure cloud platform. Through hands-on recipes, this book explores the features and capabilities of Azure services for data integration and transformation, guiding you in creating efficient processes for moving and handling data. What this Book will help me do Master the basics and advanced techniques for building ETL processes on Azure. Learn practical skills in designing solutions that integrate multiple Azure services. Understand how to migrate existing on-premises ETL solutions to Azure successfully. Acquire knowledge of SQL Server and Azure Big Data Clusters for data integration. Gain experience in automating and optimizing data processes with BIML and Azure Databricks. Author(s) The authors of ETL with Azure Cookbook are experienced data engineers and Azure specialists with years of expertise in designing and implementing robust data solutions. Their professional journey includes hands-on work with SQL Server, Azure services, and scalable ETL frameworks. They aim to provide practical insights and actionable guidance to help readers achieve success in data engineering projects. Who is it for? This book is ideal for data architects, ETL developers, and IT professionals seeking to enhance their skills in data integration and transformation, particularly within the Azure ecosystem. It's suitable for individuals with some knowledge of data engineering principles, SQL, and familiarity with ETL processes who aim to adopt modern cloud-based approaches.

Speed Up And Simplify Your Streaming Data Workloads With Red Panda

2020-09-29 · Data Engineering Podcast Listen

podcast_episode

by Alexander Gallego (Vectorized) , Tobias Macey

Analytics API BI Big Data Data Analytics Data Engineering Data Governance Data Management Kafka Kubernetes Redpanda Cyber Security +1 more

Summary Kafka has become a de facto standard interface for building decoupled systems and working with streaming data. Despite its widespread popularity, there are numerous accounts of the difficulty that operators face in keeping it reliable and performant, or trying to scale an installation. To make the benefits of the Kafka ecosystem more accessible and reduce the operational burden, Alexander Gallego and his team at Vectorized created the Red Panda engine. In this episode he explains how they engineered a drop-in replacement for Kafka, replicating the numerous APIs, that can scale more easily and deliver consistently low latencies with a much lower hardware footprint. He also shares some of the areas of innovation that they have found to help foster the next wave of streaming applications while working within the constraints of the existing Kafka interfaces. This was a fascinating conversation with an energetic and enthusiastic engineer and founder about the challenges and opportunities in the realm of streaming data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. If you’re looking for a way to optimize your data engineering pipeline – with instant query performance – look no further than Qubz. Qubz is next-generation OLAP technology built for the scale of Big Data from UST Global, a renowned digital services provider. Qubz lets users and enterprises analyze data on the cloud and on-premise, with blazing speed, while eliminating the complex engineering required to operationalize analytics at scale. With an emphasis on visual data engineering, connectors for all major BI tools and data sources, Qubz allow users to query OLAP cubes with sub-second response times on hundreds of billions of rows. To learn more, and sign up for a free demo, visit dataengineeringpodcast.com/qubz. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to s

Empower Decision Makers with SAP Analytics Cloud: Modernize BI with SAP's Single Platform for Analytics

2020-09-28 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vinayak Gole , Shreekant Shiralkar

Analytics BI SAP Cyber Security data data-engineering

Discover the capabilities and features of SAP Analytics Cloud to draw actionable insights from a variety of data, as well as the functionality that enables you to meet typical business challenges. With this book, you will work with SAC and enable key decision makers within your enterprise to deliver crucial business decisions driven by data and key performance indicators. Along the way you’ll see how SAP has built a strong repertoire of analytics products and how SAC helps you analyze data to derive better business solutions. This book begins by covering the current trends in analytics and how SAP is re-shaping its solutions. Next, you will learn to analyze a typical business scenario and map expectations to the analytics solution including delivery via a single platform. Further, you will see how SAC as a solution meets each of the user expectations, starting with creation of a platform for sourcing data from multiple sources, enabling self-service for a spectrum of business roles, across time zones and devices. There’s a chapter on advanced capabilities of predictive analytics and custom analytical applications. Later there are chapters explaining the security aspects and their technical features before concluding with a chapter on SAP’s roadmap for SAC. Empower Decision Makers with SAP Analytics Cloud takes a unique approach of facilitating learning SAP Analytics Cloud by resolving the typical business challenges of an enterprise. These business expectations are mapped to specific features and capabilities of SAC, while covering its technical architecture block by block. What You Will Learn Work with the features and capabilities of SAP Analytics Cloud Analyze the requirements of a modern decision-support system Use the features of SAC that make it a single platform for decision support in a modern enterprise. See how SAC provides a secure and scalable platform hosted on the cloud Who This Book Is For Enterprise architects, SAP BI analytic solution architects, and developers.

talk-data.com

Activity Trend

Top Events

Top Speakers

Self Service Data Management From Ingest To Insights With Isima

The State of Music Synchronization With Synchtank Founder Joel T. Jordan, Part 1

Cyber Resilience Solution Across Hybrid Cloud Using IBM Storage Solutions

Building A Cost Effective Data Catalog With Tree Schema

Practical Azure SQL Database for Modern Developers: Building Applications in the Microsoft Cloud

Add Version Control To Your Data Lake With LakeFS

Azure SQL Revealed: A Guide to the Cloud for SQL Server Professionals

Cloud Native Data Security As Code With Cyral

Hybrid Multicloud Business Continuity for OpenShift Workloads with IBM Spectrum Virtualize in AWS

Better Data Quality Through Observability With Monte Carlo

Making Data Smarter with IBM Spectrum Discover: Practical AI Solutions

SQL Server Data Automation Through Frameworks: Building Metadata-Driven Frameworks with T-SQL, SSIS, and Azure Data Factory

Security and Privacy Issues in IoT Devices and Sensor Networks

Rapid Delivery Of Business Intelligence Using Power BI

Self Service Real Time Data Integration Without The Headaches With Meroxa

AI and Machine Learning for Coders

BigQuery for Data Warehousing: Managed Data Analysis in the Google Cloud

ETL with Azure Cookbook

Speed Up And Simplify Your Streaming Data Workloads With Red Panda

Empower Decision Makers with SAP Analytics Cloud: Modernize BI with SAP's Single Platform for Analytics