talk-data.com talk-data.com

Topic

Cloud Computing

infrastructure saas iaas

4055

tagged

Activity Trend

471 peak/qtr
2020-Q1 2026-Q1

Activities

4055 activities · Newest first

Information Privacy Engineering and Privacy by Design: Understanding Privacy Threats, Technology, and Regulations Based on Standards and Best Practices

The Comprehensive Guide to Engineering and Implementing Privacy Best Practices As systems grow more complex and cybersecurity attacks more relentless, safeguarding privacy is ever more challenging. Organizations are increasingly responding in two ways, and both are mandated by key standards such as GDPR and ISO/IEC 27701:2019. The first approach, privacy by design, aims to embed privacy throughout the design and architecture of IT systems and business practices. The second, privacy engineering, encompasses the technical capabilities and management processes needed to implement, deploy, and operate privacy features and controls in working systems. In Information Privacy Engineering and Privacy by Design, internationally renowned IT consultant and author William Stallings brings together the comprehensive knowledge privacy executives and engineers need to apply both approaches. Using the techniques he presents, IT leaders and technical professionals can systematically anticipate and respond to a wide spectrum of privacy requirements, threats, and vulnerabilities–addressing regulations, contractual commitments, organizational policies, and the expectations of their key stakeholders. • Review privacy-related essentials of information security and cryptography • Understand the concepts of privacy by design and privacy engineering • Use modern system access controls and security countermeasures to partially satisfy privacy requirements • Enforce database privacy via anonymization and de-identification • Prevent data losses and breaches • Address privacy issues related to cloud computing and IoT • Establish effective information privacy management, from governance and culture to audits and impact assessment • Respond to key privacy rules including GDPR, U.S. federal law, and the California Consumer Privacy Act This guide will be an indispensable resource for anyone with privacy responsibilities in any organization, and for all students studying the privacy aspects of cybersecurity.

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.  Abstract Our guest this week is Scott Hebner, VP and CMO for IBM Data and AI. This episode revolves around the marketing industry and tactics employed by those in the field. Scott walks us through his perspectives and insights. Connect with Scott LinkedIn Twitter Show Notes 03:38 - Unsure what the cloud business is all about? Take a look at IBM's complete guide here. 11:30 - Click here to learn about data silos and why they will harm your business.  13:36 - Learn about how big data and AI work together in this article. 28:44 - Find out more about data lakes and data swamps here. Connect with the Team Producer Liam Seston - LinkedIn. Producer Lana Cosic - LinkedIn. Producer Meighann Helene - LinkedIn.  Producer Mark Simmonds - LinkedIn.  Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Summary Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media and the Python Software Foundation. Upcoming events include the Software Architecture Conference in NYC and PyCOn US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Kent Graziano about SnowflakeDB, the cloud-native data warehouse

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what SnowflakeDB is for anyone who isn’t familiar with it?

How does it compare to the other available platforms for data warehousing? How does it differ from traditional data warehouses?

How does the performance and flexibility affect the data modeling requirements?

Snowflake is one of the data stores that is enabling the shift from an ETL to an ELT workflow. What are the features that allow for that approach and what are some of the challenges that it introduces? Can you describe how the platform is architected and some of the ways that it has evolved as it has grown in popularity?

What are some of the current limitations that you are struggling with?

For someone getting started with Snowflake what is involved with loading data into the platform?

What is their workflow for allocating and scaling compute capacity and running anlyses?

One of the interesting features enabled by your architecture is data sharing. What are some of the most interesting or unexpected uses of that capability that you have seen? What are some other features or use cases for Snowflake that are not as well known or publicized which you think users should know about? When is SnowflakeDB the wrong choice? What are some of the plans for the future of SnowflakeDB?

Contact Info

LinkedIn Website @KentGraziano on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

SnowflakeDB

Free Trial Stack Overflow

Data Warehouse Oracle DB MPP == Massively Parallel Processing Shared Nothing Architecture Multi-Cluster Shared Data Architecture Google BigQuery AWS Redshift AWS Redshift Spectrum Presto

Podcast Episode

SnowflakeDB Semi-Structured Data Types Hive ACID == Atomicity, Consistency, Isolation, Durability 3rd Normal Form Data Vault Modeling Dimensional Modeling JSON AVRO Parquet SnowflakeDB Virtual Warehouses CRM == Customer Relationship Management Master Data Management

Podcast Episode

FoundationDB

Podcast Episode

Apache Spark

Podcast Episode

SSIS == SQL Server Integration Services Talend Informatica Fivetran

Podcast Episode

Matillion Apache Kafka Snowpipe Snowflake Data Exchange OLTP == Online Transaction Processing GeoJSON Snowflake Documentation SnowAlert Splunk Data Catalog

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.  Abstract This week on Making Data Simple, our guest is Adam Kocoloski, VP & CTO of Cloud Databases and IBM Fellow. Adam shares his insight of current trends in the cloud and database markets, while host Al Martin gives insight on how to raise 3 daughters! Tune in for a high level, yet candid discussion. Connect with Adam LinkedIn Twitter IBM Fellow FoundationDB Show Notes 05:50 - Check out this article on the importance of good communication. 09:59 - Read PCMag.com's definition of cloud computing here. 18:54 - Check out the different cloud database offerings from IBM.  27:00 - Here are some tips to becoming a more effective learner.  Connect with the Team Producer Liam Seston - LinkedIn. Producer Lana Cosic - LinkedIn. Producer Meighann Helene - LinkedIn.  Producer Mark Simmonds - LinkedIn.  Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

SQL Server Big Data Clusters: Early First Edition Based on Release Candidate 1

Get a head-start on learning one of SQL Server 2019’s latest and most impactful features—Big Data Clusters—that combines large volumes of non-relational data for analysis along with data stored relationally inside a SQL Server database. This book provides a first look at Big Data Clusters based upon SQL Server 2019 Release Candidate 1. Start now and get a jump on your competition in learning this important new feature. Big Data Clusters is a feature set covering data virtualization, distributed computing, and relational databases and provides a complete AI platform across the entire cluster environment. This book shows you how to deploy, manage, and use Big Data Clusters. For example, you will learn how to combine data stored on the HDFS file system together with data stored inside the SQL Server instances that make up the Big Data Cluster. Filled with clear examples and use cases, this book provides everything necessary to get started working with Big Data Clusters in SQL Server 2019 using Release Candidate 1. You will learn about the architectural foundations that are made up from Kubernetes, Spark, HDFS, and SQL Server on Linux. You then are shown how to configure and deploy Big Data Clusters in on-premises environments or in the cloud. Next, you are taught about querying. You will learn to write queries in Transact-SQL—taking advantage of skills you have honed for years—and with those queries you will be able to examine and analyze data from a wide variety of sources such as Apache Spark. Through the theoretical foundation provided in this book and easy-to-follow example scripts and notebooks, you will be ready to use and unveil the full potential of SQL Server 2019: combining different types of data spread across widely disparate sources into a single view that is useful for business intelligence and machine learning analysis. What You Will Learn Install, manage, and troubleshoot Big Data Clusters in cloud or on-premise environments Analyze large volumes of data directly from SQL Server and/or Apache Spark Manage data stored in HDFS from SQL Server as if it were relational data Implement advanced analytics solutions through machine learning and AI Expose different data sources as a single logical source using data virtualization Who This Book Is For For data engineers, data scientists, data architects, and database administrators who want to employ data virtualization and big data analytics in their environment

SAP Landscape Management 3.0 and IBM Power Systems Servers

This IBM® Redpaper publication is part of a series of technical documentation to help the enablement of SAP on Linux for IBM Power Systems servers and IBM System Storage™ servers. This book describes how by using SAP Landscape Management (SAP LaMa) 3.0 software that clients gain full visibility and control over their SAP and non-SAP systems, including the underlying physical, virtual, and cloud infrastructures. With SAP LaMa, you can automate repetitive tasks to manage critical applications across complex, hybrid IT landscapes. This publication helps you to better control IT costs and increase business agility, for example, by freeing staff to focus on more strategic work rather than manual, error-prone tasks. The target audiences of this book are architects, IT specialists, and systems administrators deploying SAP LaMa 3.0 whom often spend much time and effort managing and provisioning SAP software systems and landscapes.

Google BigQuery: The Definitive Guide

Work with petabyte-scale datasets while building a collaborative, agile workplace in the process. This practical book is the canonical reference to Google BigQuery, the query engine that lets you conduct interactive analysis of large datasets. BigQuery enables enterprises to efficiently store, query, ingest, and learn from their data in a convenient framework. With this book, you’ll examine how to analyze data at scale to derive insights from large datasets efficiently. Valliappa Lakshmanan, tech lead for Google Cloud Platform, and Jordan Tigani, engineering director for the BigQuery team, provide best practices for modern data warehousing within an autoscaled, serverless public cloud. Whether you want to explore parts of BigQuery you’re not familiar with or prefer to focus on specific tasks, this reference is indispensable.

Summary The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Dipti Borkark about data orchestration and how it helps in migrating data workloads to the cloud

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you mean by the term "Data Orchestration"?

How does it compare to the concept of "Data Virtualization"? What are some of the tools and platforms that fit under that umbrella?

What are some of the motivations for organizations to use the cloud for their data oriented workloads?

What are they giving up by using cloud resources in place of on-premises compute?

For businesses that have invested heavily in their own datacenters, what are some ways that they can begin to replicate some of the benefits of cloud environments? What are some of the common patterns for cloud migration projects and what challenges do they present?

Do you have advice on useful metrics to track for determining project completion or success criteria?

How do businesses approach employee education for designing and implementing effective systems for achieving their migration goals? Can you talk through some of the ways that different data orchestration tools can be composed together for a cloud migration effort?

What are some of the common pain points that organizations encounter when working on hybrid implementations?

What are some of the missing pieces in the data orchestration landscape?

Are there any efforts that you are aware of that are aiming to fill those gaps?

Where is the data orchestration market heading, and what are some industry trends that are driving it?

What projects are you most interested in or excited by?

For someone who wants to learn more about data orchestration and the benefits the technologies can provide, what are some resources that you would recommend?

Contact Info

LinkedIn @dborkar on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Alluxio

Podcast Episode

UC San Diego Couchbase Presto

Podcast Episode

Spark SQL Data Orchestration Data Virtualization PyTorch

Podcast.init Episode

Rook storage orchestration PySpark MinIO

Podcast Episode

Kubernetes Openstack Hadoop HDFS Parquet Files

Podcast Episode

ORC Files Hive Metastore Iceberg Table Format

Podcast Episode

Data Orchestration Summit Star Schema Snowflake Schema Data Warehouse Data Lake Teradata

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.  Abstract This week on Making Data Simple, our guest is Arin Bhowmick, vice president and Chief Design Officer for IBM Cloud, Data and AI. Host Al Martin strikes up a conversation on what defines good design, why user experience is critical to product development, along with tips for good leadership and company culture. Tune in for a high-value discussion. Connect with Arin LinkedIn Twitter Medium Show Notes 04:17 - Check out this medium article on the importance of design.  09:07 - Click here to look at IBM's page for design thinking.  24:35 - Learn more on Human-Computer interaction here. 27:14 - Is A.I. scary? Read how Watson answered this question here. Connect with the Team Producer Liam Seston - LinkedIn. Producer Lana Cosic - LinkedIn. Producer Meighann Helene - LinkedIn.  Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

IBM z15 Technical Introduction

This IBM® Redbooks® publication introduces the latest member of the IBM Z® platform, the IBM z15™ (machine type 8561). It includes information about the Z environment and how it helps integrate data and transactions more securely. It also provides insight for faster and more accurate business decisions. The z15 is a state-of-the-art data and transaction system that delivers advanced capabilities, which are vital to any digital transformation. The z15 is designed for enhanced modularity, which is in an industry-standard footprint. The z15 system excels at the following tasks: Using multicloud integration services Securing data with pervasive encryption Providing resilience with key to zero downtime Transforming a transactional platform into a data powerhouse Getting more out of the platform with IT Operational Analytics Accelerating digital transformation with agile service delivery Revolutionizing business processes Blending open source and Z technologies This book explains how this system uses new innovations and traditional Z strengths to satisfy growing demand for cloud, analytics, and open source technologies. With the z15 as the base, applications can run in a trusted, reliable, and secure environment that improves operations and lessens business risk.

Summary The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Shruti Bhat and Venkat Venkataramani about Rockset, a serverless platform for enabling fast SQL queries across all of your data

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Rockset is and your motivation for creating it?

What are some of the use cases that it enables which would otherwise be impractical or intractable?

How does Rockset fit into the infrastructure and workflow of data teams and what portions of a typical stack does it replace? Can you describe how the Rockset platform is architected and how it has evolved as you onboard more customers? Can you describe the flow of a piece of data as it traverses the full lifecycle in Rockset? How is your storage backend implemented to allow for speed and flexibility in the query layer?

How does it manage distribution, balancing, and durability of the data? What are your strategies for handling node and region failure in the cloud?

You have a whitepaper describing your ar

IBM Storage for Red Hat OpenShift Blueprint Version 1 Release 1

IBM Storage for Red Hat OpenShift is a comprehensive container-ready solution that includes all the hardware & software components necessary to setup and/or expand your Red Hat OpenShift environment. This blueprint includes Red Hat OpenShift Container Platform and uses Container Storage Interface (CSI) standards. IBM Storage brings enterprise data services to containers. In this blueprint, learn how to: · Combine the benefits of IBM Systems with the performance of IBM Storage solutions so that you can deliver the right services to your clients today! · Build a 24 by 7 by 365 enterprise class private cloud with Red Hat OpenShift Container Platform utilizing new open source Container Storage interface (CSI) drivers · Leverage enterprise class services such as NVMe based flash performance, high data availability, and advanced container security IBM Storage for Red Hat OpenShift Container Platform is designed for your DevOps environment for on-premises deployment with easy-to-consume components built to perform and scale for your enterprise. Simplify your journey to cloud with pre-tested and validated blueprints engineered to enable rapid deployment and peace of mind as you move to a hybrid multicloud environment. You now have the capabilities.

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract This week, our guest is Debra Jenson, director of digital technical engagement  for IBM Hybrid Cloud. Host Al Martin and Debra discuss churn rates, client  retention, buying methods and more. How do you acquire and keep clients in  such a competitive market? Get all the details in this episode of Making Data Simple.

Connect with Deb LinkedIn Twitter

Show Notes 06:54 - Checkout this article on the role of the user experience in product development.  11:51 - Try before you buy is reshaping how we purchase products. 20:38 - Checkout IBM's demos here.   Connect with the Team Producer Liam Seston - LinkedIn. Producer Lana Cosic - LinkedIn. Producer Meighann Helene - LinkedIn.  Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

IBM Spectrum Discover: Metadata Management for Deep Insight of Unstructured Storage

This IBM® Redpaper publication provides a comprehensive overview of the IBM Spectrum® Discover metadata management software platform. We give a detailed explanation of how the product creates, collects, and analyzes metadata. Several in-depth use cases are used that show examples of analytics, governance, and optimization. We also provide step-by-step information to install and set up the IBM Spectrum Discover trial environment. More than 80% of all data that is collected by organizations is not in a standard relational database. Instead, it is trapped in unstructured documents, social media posts, machine logs, and so on. Many organizations face significant challenges to manage this deluge of unstructured data such as: Pinpointing and activating relevant data for large-scale analytics Lacking the fine-grained visibility that is needed to map data to business priorities Removing redundant, obsolete, and trivial (ROT) data Identifying and classifying sensitive data IBM Spectrum Discover is a modern metadata management software that provides data insight for petabyte-scale file and Object Storage, storage on premises, and in the cloud. This software enables organizations to make better business decisions and gain and maintain a competitive advantage. IBM Spectrum Discover provides a rich metadata layer that enables storage administrators, data stewards, and data scientists to efficiently manage, classify, and gain insights from massive amounts of unstructured data. It improves storage economics, helps mitigate risk, and accelerates large-scale analytics to create competitive advantage and speed critical research.

IBM Storage for Red Hat OpenShift Container Platform V3.11 Blueprint Version 1 Release 1

IBM Storage for Red Hat OpenShift Container Platform is a comprehensive container-ready solution that includes all the hardware & software components necessary to setup and/or expand your Red Hat OpenShift Container Platform V3.11 environment. IBM Storage, bringing enterprise data services to containers. In this blueprint, learn how to: • Combine the benefits of IBM Systems with the performance of IBM Storage solutions so that you can deliver the right services to your clients today! • Build a 24 by 7 by 365 enterprise class private cloud with Red Hat OpenShift Container Platform • Leverage enterprise class services such as NVMe based flash performance, high data availability, and advanced container security IBM Storage for Red Hat OpenShift Container Platform: designed for your DevOps environment for on-premises deployment with easy-to-consume components built to perform and scale for your enterprise. Simplify your journey to cloud with pre-tested and validated blueprints engineered to enable rapid deployment and peace of mind as you move to a hybrid multicloud environment. You now have the capabilities.

SAS for R Users

BRIDGES THE GAP BETWEEN SAS AND R, ALLOWING USERS TRAINED IN ONE LANGUAGE TO EASILY LEARN THE OTHER SAS and R are widely-used, very different software environments. Prized for its statistical and graphical tools, R is an open-source programming language that is popular with statisticians and data miners who develop statistical software and analyze data. SAS (Statistical Analysis System) is the leading corporate software in analytics thanks to its faster data handling and smaller learning curve. SAS for R Users enables entry-level data scientists to take advantage of the best aspects of both tools by providing a cross-functional framework for users who already know R but may need to work with SAS. Those with knowledge of both R and SAS are of far greater value to employers, particularly in corporate settings. Using a clear, step-by-step approach, this book presents an analytics workflow that mirrors that of the everyday data scientist. This up-to-date guide is compatible with the latest R packages as well as SAS University Edition. Useful for anyone seeking employment in data science, this book: Instructs both practitioners and students fluent in one language seeking to learn the other Provides command-by-command translations of R to SAS and SAS to R Offers examples and applications in both R and SAS Presents step-by-step guidance on workflows, color illustrations, sample code, chapter quizzes, and more Includes sections on advanced methods and applications Designed for professionals, researchers, and students, SAS for R Users is a valuable resource for those with some knowledge of coding and basic statistics who wish to enter the realm of data science and business analytics. AJAY OHRI is the founder of analytics startup Decisionstats.com. His research interests include spreading open source analytics, analyzing social media manipulation with mechanism design, simpler interfaces to cloud computing, investigating climate change, and knowledge flows. He currently advises startups in analytics off shoring, analytics services, and analytics. He is the author of Python for R Users: A Data Science Approach (Wiley), R for Business Analytics, and R for Cloud Computing.

Summary Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Anand Babu Periasamy about MinIO, the neutral, open source, enterprise grade object storage system.

Interview

Introduction How did you get involved in the area of data management? Can you explain what MinIO is and its origin story? What are some of the main use cases that MinIO enables? How does MinIO compare to other object storage options and what benefits does it provide over other open source platforms?

Your marketing focuses on the utility of MinIO for ML and AI workloads. What benefits does object storage provide as compared to distributed file systems? (e.g. HDFS, GlusterFS, Ceph)

What are some of the challenges that you face in terms of maintaining compatibility with the S3 interface?

What are the constraints and opportunities that are provided by adhering to that API?

Can you describe how MinIO is implemented and the overall system design?

How has that design evolved since you first began working on it?

What assumptions did you have at the outset and how have they been challenged or updated?

What are the axes for scaling that MinIO provides and how does it handle clustering?

Where does it fall on the axes of availability and consistency in the CAP theorem?

One of the useful features that you provide is efficient erasure coding, as well as protection against data corruption. How much overhead do those capabilties incur, in terms of computational efficiency and, in a clustered scenario, storage volume? For someone who is interested in running MinIO, what is involved in deploying and maintain

IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences

This IBM® Redpaper publication provides an update to the original description of IBM Reference Architecture for Genomics. This paper expands the reference architecture to cover all of the major vertical areas of healthcare and life sciences industries, such as genomics, imaging, and clinical and translational research. The architecture was renamed IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences to reflect the fact that it incorporates key building blocks for high-performance computing (HPC) and software-defined storage, and that it supports an expanding infrastructure of leading industry partners, platforms, and frameworks. The reference architecture defines a highly flexible, scalable, and cost-effective platform for accessing, managing, storing, sharing, integrating, and analyzing big data, which can be deployed on-premises, in the cloud, or as a hybrid of the two. IT organizations can use the reference architecture as a high-level guide for overcoming data management challenges and processing bottlenecks that are frequently encountered in personalized healthcare initiatives, and in compute-intensive and data-intensive biomedical workloads. This reference architecture also provides a framework and context for modern healthcare and life sciences institutions to adopt cutting-edge technologies, such as cognitive life sciences solutions, machine learning and deep learning, Spark for analytics, and cloud computing. To illustrate these points, this paper includes case studies describing how clients and IBM Business Partners alike used the reference architecture in the deployments of demanding infrastructures for precision medicine. This publication targets technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing life sciences solutions and support.

Practical Data Science with Python 3: Synthesizing Actionable Insights from Data

Gain insight into essential data science skills in a holistic manner using data engineering and associated scalable computational methods. This book covers the most popular Python 3 frameworks for both local and distributed (in premise and cloud based) processing. Along the way, you will be introduced to many popular open-source frameworks, like, SciPy, scikitlearn, Numba, Apache Spark, etc. The book is structured around examples, so you will grasp core concepts via case studies and Python 3 code. As data science projects gets continuously larger and more complex, software engineering knowledge and experience is crucial to produce evolvable solutions. You'll see how to create maintainable software for data science and how to document data engineering practices. This book is a good starting point for people who want to gain practical skills to perform data science. All the code willbe available in the form of IPython notebooks and Python 3 programs, which allow you to reproduce all analyses from the book and customize them for your own purpose. You'll also benefit from advanced topics like Machine Learning, Recommender Systems, and Security in Data Science. Practical Data Science with Python will empower you analyze data, formulate proper questions, and produce actionable insights, three core stages in most data science endeavors. What You'll Learn Play the role of a data scientist when completing increasingly challenging exercises using Python 3 Work work with proven data science techniques/technologies Review scalable software engineering practices to ramp up data analysis abilities in the realm of Big Data Apply theory of probability, statistical inference, and algebra to understand the data sciencepractices Who This Book Is For Anyone who would like to embark into the realm of data science using Python 3.

Simplify Management of IT Security and Compliance with IBM PowerSC in Cloud and Virtualized Environments

This IBM® Redbooks® publication provides a security and compliance solution that is optimized for virtualized environments on IBM Power Systems™ servers, running IBM PowerVM® and IBM AIX®. Security control and compliance are some of the key components that are needed to defend the virtualized data center and cloud infrastructure against ever evolving new threats. The IBM business-driven approach to enterprise security that is used with solutions, such as IBM PowerSC™, makes IBM the premier security vendor in the market today. The book explores, tests, and documents scenarios using IBM PowerSC that leverage IBM Power Systems servers architecture and software solutions from IBM to help defend the virtualized data center and cloud infrastructure against ever evolving new threats. This publication helps IT and Security managers, architects, and consultants to strengthen their security and compliance posture in a virtualized environment running IBM PowerVM.