Cloud Computing

Securing Your Critical Workloads with IBM Hyper Protect Services

2021-03-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sarath Chandra Mekala , Matthew Arnold , Qi Ye , Jin VanStee , Madhuri Gangireddy , Ravi Kumar Gullapalli , Diana Henderson , Sandeep Sarkar , Jordan Cartwright , Gucer Vasfi , Barry Silliman , Elton de Souza , Sandeep Ambekar , Matt Mondics

IBM Cyber Security data data-engineering

Many organizations must protect their mission-critical applications in production, but security threats can also surface during the development and pre-production phases. Also, during deployment and production, insiders who manage the infrastructure that hosts critical applications can pose a threat given their super-user credentials and level of access to secrets or encryption keys. Organizations must incorporate secure design practices in their development operations and embrace DevSecOps to protect their applications from the vulnerabilities and threat vectors that can compromise their data and potentially threaten their business. IBM® Cloud Hyper Protect Services provide built-in data-at-rest and data-in-flight protection to help developers easily build secure cloud applications by using a portfolio of cloud services that are powered by IBM LinuxONE. The LinuxONE platform ensures that client data is always encrypted, whether at rest or in transit. This feature gives customers complete authority over sensitive data and associated workloads (which restricts access, even for cloud admins) and helps them meet regulatory compliance requirements. LinuxONE also allows customers to build mission-critical applications that require quick time to market and dependable rapid expansion. The purpose of this IBM Redbooks® publication is to: Introduce the IBM Hyper Protect Services that are running on IBM LinuxONE on the IBM Cloud™ and on-premises Provide high-level design architectures Describe deployment best practices Provide guides to getting started and examples of the use of the Hyper Protect Services The target audience for this book is IBM Hyper Protect Virtual Services technical specialists, IT architects, and system administrators.

This week Al and Rob Thomas discuss leadership, hybrid cloud, and data platforms

2021-03-03 · Making Data Simple Listen

podcast_episode

by Rob Thomas , Al Martin (IBM)

Big Data IBM

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

This week on Making Data Simple, we have Rob Thomas. Rob is Senior VP of IBM Cloud and Data Platform, a Coach and has a Mentor newsletter, an author, and a Blogger. Rob has been at IBM his whole career, started in consulting moved to microelectronics then moved to software in 2006.

Show Notes 3:26 – What are your learning's from 2020? 7:42 – What’s your vision and objectives for Cloud and Data platform? 11:44 – Discuss the rules of your organization 14:08 - What’s your view on hybrid cloud? 18:30 – What does IBM Cloud and data platform do that nobody else can do? 21:51 – What should companies be looking at to build their competitiveness? 28:18 – What was your reason for writing about “Just Show Up”? The Mentor Newsletter It Takes What It Takes Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Bridging The Gap Between Machine Learning And Operations At Iguazio

2021-03-02 · Data Engineering Podcast Listen

podcast_episode

by Yaron Haviv (Iguazio) , Tobias Macey

AI/ML Airflow Analytics BI BigQuery CI/CD Data Engineering Data Management Data Quality Data Science Datafold dbt +7 more

Summary The process of building and deploying machine learning projects requires a staggering number of systems and stakeholders to work in concert. In this episode Yaron Haviv, co-founder of Iguazio, discusses the complexities inherent to the process, as well as how he has worked to democratize the technologies necessary to make machine learning operations maintainable.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Yaron Haviv about Iguazio, a platform for end to end automation of machine learning applications using MLOps principles.

Interview

Introduction How did you get involved in the area of data science & analytics? Can you start by giving an overview of what Iguazio is and the story of how it got started? How would you characterize your target or typical customer? What are the biggest challenges that you see around building production grade workflows for machine learning?

How does Iguazio help to address those complexities?

For customers who have already invested in the technical and organizational capacity for data science and data engineering, how does Iguazio integrate with their environments? What are the responsibilities of a data engineer throughout the different stages of the lifecycle for a machine learning application? Can you describe how the Iguazio platform is architected?

How has the design of the platform evolved since you first began working on it? How have the industry best practices around bringing machine learning to production changed?

How do you approach testing/validation of machine learning applications and releasing them to production environments? (e.g. CI/CD) Once a model is in

The ABC’s of Data Science - Danny Ma

2021-02-26 · DataTalks.Club Listen

podcast_episode

by Danny Ma

Analytics Data Analytics Data Science HTML Python SQL

Did you know that there are 3 types different types of data scientists? A for analyst, B for builder, and C for consultant - we discuss the key differences between each one and some learning strategies you can use to become A, B, or C.

We talked about:

Inspirations for memes Danny's background and career journey The ABCs of data science - the story behind the idea Data scientist type A - Analyst Skills, responsibilities, and background for type A Transitioning from data analytics to type A data scientist (that's the path Danny took) How can we become more curious? Data scientist B - Builder Responsibilities and background for type B Transitioning from type A to type B Most important skills for type B Why you have to learn more about cloud Data scientist type C - consultant Skills, responsibilities, and background for type C Growing into the C type Ideal data science team Important business metrics Getting a job - easier as type A or type B? Looking for a job without experience Two approaches for job search: "apply everywhere" and "apply nowhere" Are bootcamps useful? Learning path to becoming a data scientist Danny's data apprenticeship program and "Serious SQL" course Why SQL is the most important skill R vs Python Importance of Masters and PhD

Links:

Danny's profile on LinkedIn: https://linkedin.com/in/datawithdanny Danny's course: https://datawithdanny.com/ Trailer: https://www.linkedin.com/posts/datawithdanny_datascientist-data-activity-6767988552811847680-GzUK/ Technical debt paper: https://proceedings.neurips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html

Join DataTalks.Club: https://datatalks.club/slack.html

Snowflake Cookbook

2021-02-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Hamid Mahmood Qureshi , Hammad Sharif

Analytics DWH Snowflake Spark SQL data data-engineering

The "Snowflake Cookbook" is your guide to mastering Snowflake's unique cloud-centric architecture. This book provides detailed recipes for building modern data pipelines, configuring efficient virtual warehouses, ensuring robust data protection, and optimizing cost-performance-all while leveraging Snowflake's distinctive features such as data sharing and time travel. What this Book will help me do Set up and configure Snowflake's architecture for optimized performance and cost efficiency. Design and implement robust data pipelines using SQL and Snowflake's specialized features. Secure, manage, and share data efficiently with built-in Snowflake capabilities. Apply performance tuning techniques to enhance your Snowflake implementations. Extend Snowflake's functionality with tools like Spark Connector for advanced workflows. Author(s) Hamid Mahmood Qureshi and Hammad Sharif are both seasoned experts in data warehousing and cloud computing technologies. With extensive experience implementing analytics solutions, they bring a hands-on approach to teaching Snowflake. They are ardent proponents of empowering readers towards creating effective and scalable data solutions. Who is it for? This book is perfect for data warehouse developers, data analysts, cloud architects, and anyone managing cloud data solutions. If you're familiar with basic database concepts or just stepping into Snowflake, you'll find practical guidance here to deepen your understanding and functional expertise in cloud data warehousing.

This week Al and Jeff Richardson discuss Jeff’s new role at Accelerated Enrollment Solutions and how data affects Jeff's role

2021-02-24 · Making Data Simple Listen

podcast_episode

by Al Martin (IBM) , Jeff Richardson (Accelerated Enrollment Solutions)

AI/ML Analytics Big Data Data Lake IBM

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

This week on Making Data Simple, we have Jeff Richardson. Jeff has a history of database data, information management, and he is now the Chief Information Officer at Accelerated Enrollment Solutions. Jeff was also at Bentley Systems for 17 ½ years as Chief Data Officer.

Show Notes 5:41 – What does it mean to be a technology nerd? 6:53 – What technologies as a CDO or CIO are you addressing on a regular bases? 13:04 – How are you going to tackle the culture and the politics? 17:25 – Is it Cloud or Hybrid to drive the new data lake? 24:03 – What is your plan to get to desired state? 27:04 – Does AI have a role in your new position? 31:44 – Fighting the Infodemic what made you write this article? Fighting the Infodemic Jeff’s podcast list Analytics on Fire Dissecting popular IT Nerds The Data Chief Data Crunch

Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Getting Started with SAS Programming

2021-02-24 · O'Reilly Data Science Books O'Reilly Amazon

book

by Ron Cody

CSV SAS analytics-platforms data data-science

Get up and running with SAS using Ron Cody’s easy-to-follow, step-by-step guide. Aimed at beginners, Getting Started with SAS Programming: Using SAS Studio in the Cloud uses short examples to teach SAS programming from the basics to more advanced topics in the point-and-click interactive environment of SAS Studio. To begin, you will learn how to register for SAS OnDemand for Academics, an online delivery platform for teaching and learning statistical analysis that provides free access to SAS software via the cloud. The first part of the book shows you how to use SAS Studio built-in tasks to produce a report, summarize data, and create charts and graphs. It also describes how you can perform basic statistical tests using the interactive point-and-click environment. The second part of the book uses easy-to-follow examples to show you how to write your own SAS programs and how to use SAS procedures to perform a variety of tasks. This part of the book also explains how to read data from a variety of sources: text files, Excel workbooks, and CSV files. In order to get familiar with the SAS Studio environment, this book also shows you how to access dozens of interesting data sets that are included with the SAS OnDemand for Academics platform.

Self Service Open Source Data Integration With AirByte

2021-02-23 · Data Engineering Podcast Listen

podcast_episode

by Michel Tricot (Airbyte) , John Lafleur (Airbyte) , Tobias Macey

Airbyte Airflow Avro BI BigQuery CI/CD Dagster Data Engineering Data Management Data Quality Datacoral Datafold +21 more

Summary Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Michel Tricot and John Lafleur about Airbyte, an open source framework for building data integration pipelines.

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Airbyte is and the story behind it? Businesses and data engineers have a variety of options for how to manage their data integration. How would you characterize the overall landscape and how does Airbyte distinguish itself in that space? How would you characterize your target users?

How have those personas instructed the priorities and design of Airbyte? What do you see as the benefits and tradeoffs of a UI oriented data integration platform as compared to a code first approach?

what are the complex/challenging elements of data integration that makes it such a slippery problem? motivation for creating open source ELT as a business Can you describe how the Airbyte platform is implemented?

What was your motivation for choosing Java as the primary language?

incidental complexity of forcing all connectors to be packaged as containers shortcomings of the Singer specification/motivation for creating a backwards incompatible interface perceived potential for community adoption of Airbyte specification tradeoffs of using JSON as interchange format vs. e.g. protobuf/gRPC/Avro/etc.

information lost when converting records to JSON types/how to preserve that information (e.g. field constraints, valid enums, etc.)

interfaces/extension points for integrating with other tools, e.g. Dagster abstraction layers for simplifying implementation of new connectors tradeoffs of storing all connectors in a monorepo with the Airbyte core

impact of community adoption/contributions

What is involved in setting up an Airbyte installation? What are the available axes for scaling an Airbyte deployment? challenges of setting up and maintaining CI environment for Airbyte How are you managing governance and long term sustainability of the project? What are some of the most interesting, unexpected, or innovative ways that you have seen Airbyte used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Airbyte? When is Airbyte the wrong choice? What do you have planned for the future of the project?

Contact Info

Michel

LinkedIn @MichelTricot on Twitter michel-tricot on GitHub

John

LinkedIn @JeanLafleur on Twitter johnlafleur on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Airbyte Liveramp Fivetran

Podcast Episode

Stitch Data Matillion DataCoral

Podcast Episode

Singer Meltano

Podcast Episode

Airflow

Podcast.init Episode

Kotlin Docker Monorepo Airbyte Specification Great Expectations

Podcast Episode

Dagster

Data Engineering Podcast Episode Podcast.init Episode

Prefect

Podcast Episode

DBT

Podcast Episode

Kubernetes Snowflake

Podcast Episode

Redshift Presto Spark Parquet

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Building Custom Tasks for SQL Server Integration Services: The Power of .NET for ETL for SQL Server 2019 and Beyond

2021-02-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Andy Leonard

Azure ADF Azure DevOps DevOps ETL/ELT Microsoft SQL SSIS data data-engineering etl

Build custom SQL Server Integration Services (SSIS) tasks using Visual Studio Community Edition and C#. Bring all the power of Microsoft .NET to bear on your data integration and ETL processes, and for no added cost over what you’ve already spent on licensing SQL Server. New in this edition is a demonstration deploying a custom SSIS task to the Azure Data Factory (ADF) Azure-SSIS Integration Runtime (IR). All examples in this new edition are implemented in C#. Custom task developers are shown how to implement custom tasks using the widely accepted and default language for .NET development. Why are custom components necessary? Because even though the SSIS catalog of built-in tasks and components is a marvel of engineering, gaps remain in the available functionality. One such gap is a constraint of the built-in SSIS Execute Package Task, which does not allow SSIS developers to select SSIS packages from other projects in the SSIS Catalog. Examples in this bookshow how to create a custom Execute Catalog Package task that allows SSIS developers to execute tasks from other projects in the SSIS Catalog. Building on the examples and patterns in this book, SSIS developers may create any task to which they aspire, custom tailored to their specific data integration and ETL needs. What You Will Learn Configure and execute Visual Studio in the way that best supports SSIS task development Create a class library as the basis for an SSIS task, and reference the needed SSIS assemblies Properly sign assemblies that you create in order to invoke them from your task Implement source code control via Azure DevOps, or your own favorite tool set Troubleshoot and execute custom tasks as part of your own projects Create deployment projects (MSIs) for distributing code-complete tasks Deploy custom tasks to Azure Data Factory Azure-SSIS IRs in the cloud Create advanced editors for custom task parameters Who This Book Is For For database administrators and developers who are involved in ETL projects built around SQL Server Integration Services (SSIS). Readers do not need a background in software development with C#. Most important is a desire to optimize ETL efforts by creating custom-tailored tasks for execution in SSIS packages, on-premises or in ADF Azure-SSIS IRs.

Building The Foundations For Data Driven Businesses at 5xData

2021-02-16 · Data Engineering Podcast Listen

podcast_episode

by Tarush Aggarwal (5xData) , Tobias Macey

Airflow BI BigQuery CI/CD Data Engineering Data Management Data Quality Datafold dbt DWH ETL/ELT Kubernetes +3 more

Summary Every business aims to be data driven, but not all of them succeed in that effort. In order to be able to truly derive insights from the data that an organization collects, there are certain foundational capabilities that they need to have capacity for. In order to help more businesses build those foundations, Tarush Aggarwal created 5xData, offering collaborative workshops to assist in setting up the technical and organizational systems that are necessary to succeed. In this episode he shares his thoughts on the core elements that are necessary for every business to be data driven, how he is helping companies incorporate those capabilities into their structure, and the ongoing support that he is providing through a network of mastermind groups. This is a great conversation about the initial steps that every group should be thinking of as they start down the road to making data informed decisions.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Tarush Aggarwal about his mission at 5xData to teach companies how to build solid foundations for their data capabilities

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at 5xData and the story behind it? impact of industry on challenges in becoming data driven profile of companies that you are trying to work with common mistakes when designing data platform misconceptions that the business has around how to invest in data challenges in attracting/interviewing/hiring data talent What are the core components that you have standardized on for building the foundational layers of t

This week we continue with part 3 of AI Academy discussing AI Ladder and AI Maturity

2021-02-10 · Making Data Simple Listen

podcast_episode

by Kristen Summers (IBM (Expert Labs)) , Al Martin (IBM) , John Thomas

AI/ML Big Data Computer Science Data Science IBM

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Kristen Summers and John Thomas. Kristen is a Distinguished Engineer in Cloud and Cognitive Expert Labs. Kristen has worked in Artificial Intelligence and Data Science, PHD in Computer Science, and leads Data Science within our Expert Labs. John is a Distinguished Engineer in Data and Expert Labs, John leads Services that helps clients establish the AI factory.

Show Notes 3:24 – What is the AI academy and how does it all fit together? 4:34 – AI Ladder and AI Maturity 8:32 – How does the AI Factory make it easier to accomplish the AI Ladder? 12:00 – Why does your team do it better? 17:03 – How do you know your data is ready? 21:22 – What is the most practical use case? 23:02 – What does it really mean to infuse AI? 25:15 – Definition of AI maturity curve 28:25 – How do you know it’s trustworthy? 29:14 – What the most important lesson you’ve learned with AI and what is AI not very good at? In the Dream House Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Data Pipelines Pocket Reference

2021-02-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by James Densmore

Analytics Data Analytics Modern Data Stack Data Streaming data data-engineering

Data pipelines are the foundation for success in data analytics. Moving data from numerous diverse sources and transforming it to provide context is the difference between having data and actually gaining value from it. This pocket reference defines data pipelines and explains how they work in today's modern data stack. You'll learn common considerations and key decision points when implementing pipelines, such as batch versus streaming data ingestion and build versus buy. This book addresses the most common decisions made by data professionals and discusses foundational concepts that apply to open source frameworks, commercial products, and homegrown solutions. You'll learn: What a data pipeline is and how it works How data is moved and processed on modern data infrastructure, including cloud platforms Common tools and products used by data engineers to build pipelines How pipelines support analytics and reporting needs Considerations for pipeline maintenance, testing, and alerting

What to Expect in 2021: Ten Data Analytics Predictions

2021-02-09 · Secrets of Data Analytics Leaders Listen

podcast_episode

by Sean Hewitt (Eckerson Group) , Joe Hilleary (Eckerson Group) , Dave Wells (Eckerson Group) , Kevin Petrie (Eckerson Group) , Andrew Sohn (Crawford & Company)

AI/ML Analytics Data Analytics DataOps

Every December, Eckerson Group fulfills its industry obligation to summon its collective knowledge and insights about data and analytics and speculate about what might happen in the coming year. The diversity of predictions from our research analysts and consultants exemplifies the breadth of their research and consulting experiences and the depth of their thinking. Predictions from Kevin Petrie, Joe Hilleary, Dave Wells, Andrew Sohn, and Sean Hewitt range from data and privacy governance to artificial intelligence with stops along the way for DataOps, data observability, data ethics, cloud platforms, and intelligent robotic automation.

How Shopify Is Building Their Production Data Warehouse Using DBT

2021-02-09 · Data Engineering Podcast Listen

podcast_episode

by Zeeshan Qureshi (Shopify) , Michelle Ark (dbt Labs) , Tobias Macey

Airflow Analytics BI CI/CD Data Engineering Data Management Data Quality Datadog Datafold dbt DWH ETL/ELT +1 more

Summary With all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of the best ways to get a true understanding of how a technology works in practice is to hear from people who are running it in production. In this episode Zeeshan Qureshi and Michelle Ark share their experiences using DBT to manage the data warehouse for Shopify. They explain how the structured the project to allow for multiple teams to collaborate in a scalable manner, the additional tooling that they added to address the edge cases that they have run into, and the optimizations that they baked into their continuous integration process to provide fast feedback and reduce costs. This is a great conversation about the lessons learned from real world use of a specific technology and how well it lives up to its promises.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today. Your host is Tobias Macey and today I’m interviewing Zeeshan Qureshi and Michelle Ark about how Shopify is building their production data warehouse platform with DBT

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what the Shopify platform is? What kinds of data sources are you working with?

Can you share some examples of the types of analysis, decisions, and products that you are building with the data that you manage? How have you structured your data teams to be able to deliver those projects?

What are the systems that you have in place, technological or otherwise, to allow you to support the needs of

The Rise of MLOps - Theofilos Papapanagiotou

2021-02-05 · DataTalks.Club Listen

podcast_episode

by Theofilos Papapanagiotou

AI/ML Azure DataOps GitHub Microsoft MLOps

We covered:

What is MLOps The difference between MLOps and ML Engineering Getting into MLOps Kubeflow and its components, ML Platforms Learning Kubeflow DataOps

And other things

Links:

Microsoft MLOps maturity model: https://docs.microsoft.com/en-us/azure/architecture/example-scenario/mlops/mlops-maturity-model Google MLOps maturity levels: https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning MLOps roadmap 2020-2025: https://github.com/cdfoundation/sig-mlops/blob/master/roadmap/2020/MLOpsRoadmap2020.md Kubeflow website: https://www.kubeflow.org/ TFX Paper: https://research.google/pubs/pub46484/

Join DataTalks.Club: https://datatalks.club

This week we continuing our discussion on data framework and what is meant by data framework

2021-02-03 · Making Data Simple Listen

podcast_episode

by Suj Perepa (IBM) , Richard Darden (IBM) , Al Martin (IBM)

AI/ML Big Data IBM Cyber Security

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

This week on Making Data Simple, we have Suj Perepa and Richard Darden. Suj is a distinguished engineer and specializes AI, machine learning technology in the financial sector. Suj also has been a lead for security programs and a member of the IBM Academy of Technology leadership team. Richard is a distinguished engineer in digital human evangelism for North America government and a distinguished engineer at IBM in Cloud and Cognitive for the public sector and a former chief architect on federal government agencies.

Show Notes 6:30 – What is a data framework? 9:16 – What is framework? 11:00 – Governance and people 14:15 – Process and architecture 19:15 – Bringing it all into a playbook 22:56 – Who is the client? 24:36 – How does bias play into it? 27:02 – What products are you using? 27:54 – Explainability 30:44 – Ethics of AI 37:13 Suj and Richard’s most important lessons in AI The Discipline of Technology Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

System Observability For The Cloud Native Era With Chronosphere

2021-02-02 · Data Engineering Podcast Listen

podcast_episode

by Rob Skillington (Chronosphere) , Tobias Macey

Airflow Analytics BI CI/CD Data Engineering Data Management Data Quality Datadog Datafold dbt ETL/ELT Kubernetes

Summary Collecting and processing metrics for monitoring use cases is an interesting data problem. It is eminently possible to generate millions or billions of data points per second, the information needs to be propagated to a central location, processed, and analyzed in timeframes on the order of milliseconds or single-digit seconds, and the consumers of the data need to be able to query the information quickly and flexibly. As the systems that we build continue to grow in scale and complexity the need for reliable and manageable monitoring platforms increases proportionately. In this episode Rob Skillington, CTO of Chronosphere, shares his experiences building metrics systems that provide observability to companies that are operating at extreme scale. He describes how the M3DB storage engine is designed to manage the pressures of a critical system component, the inherent complexities of working with telemetry data, and the motivating factors that are contributing to the growing need for flexibility in querying the collected metrics. This is a fascinating conversation about an area of data management that is often taken for granted.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today. Your host is Tobias Macey and today I’m interviewing Rob Skillington about Chronosphere, a scalable, reliable and customizable monitoring-as-a-service purpose built for cloud-native applications.

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Chronosphere and your motivation for turning it into a business? What are the

MATLAB Recipes: A Problem-Solution Approach

2021-02-02 · O'Reilly Data Science Books O'Reilly Amazon

book

by Michael Paluszek , Stephanie Thomas

API Big Data Java JSON MATLAB data data-science data-science-tools

Learn from state-of-the-art examples in robotics, motors, detection filters, chemical processes, aircraft, and spacecraft. With this book you will review contemporary MATLAB coding including the latest MATLAB language features and use MATLAB as a software development environment including code organization, GUI development, and algorithm design and testing. Features now covered include the new graph and digraph classes for charts and networks; interactive documents that combine text, code, and output; a new development environment for building apps; locally defined functions in scripts; automatic expansion of dimensions; tall arrays for big data; the new string type; new functions to encode/decode JSON; handling non-English languages; the new class architecture; the Mocking framework; an engine API for Java; the cloud-based MATLAB desktop; the memoize function; and heatmap charts. MATLAB Recipes: A Problem-Solution Approach, Second Edition provides practical, hands-on code snippets and guidance for using MATLAB to build a body of code you can turn to time and again for solving technical problems in your work. Develop algorithms, test them, visualize the results, and pass the code along to others to create a functional code base for your firm. What You Will Learn Get up to date with the latest MATLAB up to and including MATLAB 2020b Code in MATLAB Write applications in MATLAB Build your own toolbox of MATLAB code to increase your efficiency and effectiveness Who This Book Is For Engineers, data scientists, and students wanting a book rich in examples using MATLAB.

This week Al and Kristen Summers discuss culture and talent management in AI

2021-01-27 · Making Data Simple Listen

podcast_episode

by Kristen Summers (IBM (Expert Labs)) , Al Martin (IBM)

AI/ML Big Data Computer Science Data Science IBM

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

This week on Making Data Simple, we have Kristen Summers who is a distinguished Engineer in Cloud and Cognitive Expert Labs. Kristen has worked in Artificial Intelligence and Data Science, PHD in Computer Science, and leads Data Science within our Expert Labs,

Show Notes 2: 08 - More time needs to be spend on culture and talent management. 3:55 - What does data driven culture mean? 8:49 – What do you see driving fundamental culture? 11:14 - What common tool do we have? 12:55 – What is communicate about data? 14:42 – How do you know you’re doing it well? 17:29 - How do you define AI talent? 23:18 - Describe a Data Scientist? 27:25 - Common Organizational Structures 31:49 - How do you manage and grow AI talent? IBM Skills Academy Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

IBM Integrated Synchronization: Incremental Updates Unleashed

2021-01-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Günter Schöllmann , Cüneyt Göksu , Christian Michel

Analytics IBM data data-engineering

The IBM® Db2® Analytics Accelerator (Accelerator) is a logical extension of Db2 for IBM z/OS® that provides a high-speed query engine that efficiently and cost-effectively runs analytics workloads. The Accelerator is an integrated back-end component of Db2 for z/OS. Together, they provide a hybrid workload-optimized database management system that seamlessly manages queries that are found in transactional workloads to Db2 for z/OS and queries that are found in analytics applications to Accelerator. Each query runs in its optimal environment for maximum speed and cost efficiency. The incremental update function of Db2 Analytics Accelerator for z/OS updates Accelerator-shadow tables continually. Changes to the data in original Db2 for z/OS tables are propagated to the corresponding target tables with a high frequency and a brief delay. Query results from Accelerator are always extracted from recent, close-to-real-time data. An incremental update capability that is called IBM InfoSphere® Change Data Capture (InfoSphere CDC) is provided by IBM InfoSphere Data Replication for z/OS up to Db2 Analytics Accelerator V7.5. Since then, an extra new replication protocol between Db2 for z/OS and Accelerator that is called IBM Integrated Synchronization was introduced. With Db2 Analytics Accelerator V7.5, customers can choose which one to use. IBM Integrated Synchronization is a built-in product feature that you use to set up incremental updates. It does not require InfoSphere CDC, which is bundled with IBM Db2 Analytics Accelerator. In addition, IBM Integrated Synchronization has more advantages: Simplified administration, packaging, upgrades, and support. These items are managed as part of the Db2 for z/OS maintenance stream. Updates are processed quickly. Reduced CPU consumption on the mainframe due to a streamlined, optimized design where most of the processing is done on the Accelerator. This situation provides reduced latency. Uses IBM Z® Integrated Information Processor (zIIP) on Db2 for z/OS, which leads to reduced CPU costs on IBM Z and better overall performance data, such as throughput and synchronized rows per second. On z/OS, the workload to capture the table changes was reduced, and the remainder can be handled by zIIPs. With the introduction of an enterprise-grade Hybrid Transactional Analytics Processing (HTAP) enabler that is also known as the Wait for Data protocol, the integrated low latency protocol is now enabled to support more analytical queries running against the latest committed data. IBM Db2 for z/OS Data Gate simplifies delivering data from IBM Db2 for z/OS to IBM Cloud® Pak® for Data for direct access by new applications. It uses the special-purpose integrated synchronization protocol to maintain data currency with low latency between Db2 for z/OS and dedicated target databases on IBM Cloud Pak for Data.

talk-data.com

Activity Trend

Top Events

Top Speakers

Securing Your Critical Workloads with IBM Hyper Protect Services

This week Al and Rob Thomas discuss leadership, hybrid cloud, and data platforms

Bridging The Gap Between Machine Learning And Operations At Iguazio

The ABC’s of Data Science - Danny Ma

Snowflake Cookbook

This week Al and Jeff Richardson discuss Jeff’s new role at Accelerated Enrollment Solutions and how data affects Jeff's role

Getting Started with SAS Programming

Self Service Open Source Data Integration With AirByte

Building Custom Tasks for SQL Server Integration Services: The Power of .NET for ETL for SQL Server 2019 and Beyond

Building The Foundations For Data Driven Businesses at 5xData

This week we continue with part 3 of AI Academy discussing AI Ladder and AI Maturity

Data Pipelines Pocket Reference

What to Expect in 2021: Ten Data Analytics Predictions

How Shopify Is Building Their Production Data Warehouse Using DBT

The Rise of MLOps - Theofilos Papapanagiotou

This week we continuing our discussion on data framework and what is meant by data framework

System Observability For The Cloud Native Era With Chronosphere

MATLAB Recipes: A Problem-Solution Approach

This week Al and Kristen Summers discuss culture and talent management in AI

IBM Integrated Synchronization: Incremental Updates Unleashed