Big Data

Hadley Wickham talks about his journey in data science, tidy data concepts, and his many books.

2020-07-15 · Making Data Simple Listen

podcast_episode

by Hadley Wickham , Al Martin (IBM)

AI/ML Data Science IBM Python

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

This week on Making Data Simple, we have Hadley Wickham is Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University. He builds tools that make data science easier and faster, including the famous tidy verse packages for the R programming language. He was named a Fellow by the American Statistical Association for "pivotal contributions to statistical practice through innovative and pioneering research in statistical graphics and computing".

Show Notes 2:39 – Hadley talks about his journey 5:22 – Hadley talks about his American Statistical Association for "pivotal contributions to statistical practice" 8:00 – Tidy data concept 9:02 - How Hadley became interested in big data and R 10:12 – Python and R 12:30 – What Hadley is doing now 13:47 – Top 3 packages that help data scientists 17:47 – Hadley discusses his book 22:48 – Writing a book vs. code 29:40 – What language is going to take over 31:01 – What’s next for data 31:54 – What’s cool for Hadley 36:26 – Hadley’s Role model Hadley Wickham’s books Ggplot2 R for Data Science Advanced R R Packages Hadl Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Open Source Production Grade Data Integration With Meltano

2020-07-13 · Data Engineering Podcast Listen

podcast_episode

by Douwe Maan (Meltano) , Tobias Macey

Analytics Cloud Computing Data Engineering Data Management Datadog ETL/ELT Kubernetes Meltano SaaS Singer Data Streaming

Summary The first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use open source option. The Meltano project is aiming to provide a solution to that situation. In this episode, project lead Douwe Maan shares the history of how Meltano got started, the motivation for the recent shift in focus, and how it is implemented. The Singer ecosystem has laid the groundwork for a great option to empower teams of all sizes to unlock the value of their Data and Meltano is building the reamining structure to make it a fully featured contender for proprietary systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Douwe Maan about Meltano, an open source platform for building, running & orchestrating ELT pipelines.

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Meltano is and the story behind it? Who is the target audience?

How does the focus on small or early stage organizations constrain the architectural decisions that go into Meltano?

What have you found to be the complexities in trying to encapsulate the entirety of the data lifecycle in a single tool or platform?

What are the most painful transitions in that lifecycle and how does that pain manifest?

How and why has the focus of the project shifted from its original vision? With your current focus on the data integration/data transfer stage of the lifecycle, what are you seeing as the biggest barriers to entry with the current ecosystem?

What are the main elements of

This week Ross Mauri discusses his career and changes at IBM

2020-07-08 · Making Data Simple Listen

podcast_episode

by Ross Mauri (IBM) , Al Martin (IBM)

AI/ML Cloud Computing IBM Marketing

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Ross Mauri General Manager, IBM Z & LinuxONE IBM Systems. Ross has expertise in strategy, technology, engineering, marketing, and sales. This week we discuss his career at IBM, the people and CEOs of IBM, Mainframes, Databases, Cloud, and Datastores, the myth of Z, and IBM's Z15.

Show Notes 3:43 - Ross Mauri's career 8:22 – People of IBM and CEOs 12:11 – Working with CEOs 14:34 – Ross talks legacy 17:20 – What does the mainframe do that no one else can? 18:28 – Z myth 21:22 – LinuxONE 24:29 – RedHat 25:15 – What does Z not do well? 26:00 – Z15 Ross Mauri - LinkedIn Connect with the Team Producer Kate Brown - LinkedIn. Producer Meighann Helene - LinkedIn. Producer Michael Sestak - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

DataOps For Streaming Systems With Lenses.io

2020-07-06 · Data Engineering Podcast Listen

podcast_episode

by Andrew Stevenson (Lenses.io) , Tobias Macey

Analytics Cloud Computing Data Engineering Data Management Datadog DataOps Kubernetes SaaS SQL Data Streaming

Summary There are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, including observability, discovery, and governance of your data. That’s what the Lenses.io DataOps platform is built for. In this episode CTO Andrew Stevenson discusses the challenges that arise from building decoupled systems, the benefits of using SQL as the common interface for your data, and the metrics that need to be tracked to keep the overall system healthy. Observability and governance of streaming data requires a different approach than batch oriented workflows, and this episode does an excellent job of outlining the complexities involved and how to address them.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Andrew Stevenson about Lenses.io, a platform to provide real-time data operations for engineers

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Lenses is and the story behind it? What is your working definition for what constitutes DataOps?

How does the Lenses platform support the cross-cutting concerns that arise when trying to bridge the different roles in an organization to deliver value with data?

What are the typical barriers to collaboration, and how does Lenses help with that?

Many different systems provide a SQL interface to streaming data on various substrates. What was your reason for building your own SQL engine and what is unique about it? What are the main challenges that you see engineers facing when working with s

Al and Jim discuss how to monetize data

2020-07-01 · Making Data Simple Listen

podcast_episode

by Jim Ruston (Armeta Analytics) , Al Martin (IBM)

AI/ML Analytics DWH IBM

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Jim Ruston Managing Director of Armeta Analytics. Al and Jim discuss monetizing data, prescriptive and descriptive approaches, and sealing the deal. Show Notes 4:45 - How do you monetize data 6:20 - Common actions 8:50 - Prescriptive approach 11:15 – Gap in data warehouse 13:05 – Cleanup 17:40 – Overhead costs 19:22 - Prescriptive and descriptive approach 20:56 – Preferred technology 23:07 - Who are the decision makers 24:56 – Sealing the deal 27:10 – Why do I need Armeta Armeta Armeta Linkedin Guaranteed Analytics The Challenger Sale Connect with the Team Producer Kate Brown - LinkedIn. Producer Meighann Helene - LinkedIn. Producer Michael Sestak - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Airflow: A beast character in the gaming world

2020-07-01 · Airflow Summit 2020

session

by Naresh Yegireddi (PlayStation) , Patricio Garza (PlayStation)

Airflow Analytics AWS Amazon EC2 Data Analytics Docker ETL/ELT Python Spark

Being a pioneer for the past 25 years, SONY PlayStation has played a vital role in the Interactive Gaming Industry. Over 100+ million monthly active users, 100+ million PS-4 console sales along with thousands of game development partners across the globe, big-data problem is quite inevitable. This presentation talks about how we scaled Airflow horizontally which has helped us building a stable, scalable and optimal data processing infrastructure powered by Apache Spark, AWS ECS, EC2 and Docker. Due to the demand for processing large volumes of data and also to meet the growing Organization’s data analytics and usage demands, the data team at PlayStation took an initiative to build an open source big data processing infrastructure where Apache Spark in Python as the core ETL engine. Apache Airflow is the core workflow management tool for the entire eco system. We started with an Airflow application running on a single AWS EC2 instance to support parallelism of 16 with 1 scheduler and 1 worker and eventually scaled it to a bigger scheduler along with 4 workers to support a parallelism of 96, DAG concurrency of 96 and a worker task concurrency of 24. Containerized all the services on AWS ECS which gave us an ability to scale Airflow horizontally.

Data Collection And Management To Power Sound Recognition At Audio Analytic

2020-06-30 · Data Engineering Podcast Listen

podcast_episode

by Dr. Thomas le Cornu (Audio Analytic) , Dr. Chris Mitchell (Audio Analytic) , Tobias Macey

AI/ML Data Collection Data Engineering Data Management Kubernetes Data Streaming

Summary We have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are working to impart a sense of hearing to our myriad devices with their sound recognition technology. In this episode Dr. Chris Mitchell and Dr. Thomas le Cornu describe the challenges that they are faced with in the collection and labelling of high quality data to make this possible, including the lack of a publicly available collection of audio samples to work from, the need for custom metadata throughout the processing pipeline, and the need for customized data processing tools for working with sound data. This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Dr. Chris Mitchell and Dr. Thomas le Cornu about Audio Analytic, a company that is building sound recognition technology that is giving machines a sense of hearing beyond speech and music

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Audio Analytic?

What was your motivation for building an AI platform for sound recognition?

What are some of the ways that your platform is being used? What are the unique challenges that you have faced in working with arbitrary sound data? How do you handle the collection and labelling of the source data that you rely on for building your models?

Beyond just collection and storage, what is your process for defining a taxonomy of the audio data that you are working with? How has the taxonomy had to evolve, and what assumptions have had to change, as you progressed in building the data set and the resulting models?

challenges of building an embeddable AI model

update cycle

difficulty of identifying relevant audio and dealing with literal noise in the input data rights and ownership challenges in collection of source data What was your design process for constructing a pipeline for the audio data that you need to process? Can you describe how your overall data management system is

We discuss Data Science and IBMs partnership with Anaconda

2020-06-24 · Making Data Simple Listen

podcast_episode

by Peter Wang (Databricks) , Shadi Copty (IBM) , Al Martin (IBM)

AI/ML Cloud Computing Data Science IBM Python

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Peter Wang Co Founder and CEO of Anaconda and Shadi Copty VP of Offering Manager. Al, Peter, and Shadi discuss Data Science and IBMs partnership with Anaconda. Show Notes 6:11 - Corporate Mission 8:00 - Use Case 9:20 - IBM and Anaconda partnership 14:04 - Cloud Pak for Data what is it? 15:43 – Python vs R 17:15 – Anaconda’s Future 23:25 – Shadi takes over from Al 25:05 – Data Science Community 33:40 – Centre of Humane Technology Anaconda - https://www.linkedin.com/company/anacondainc/

Connect with the Team Producer Kate Brown - LinkedIn. Producer Meighann Helene - LinkedIn. Producer Michael Sestak - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Andy Youniss President and CEO of Rocket Software discusses Covid

2020-06-17 · Making Data Simple Listen

podcast_episode

by Andy Youniss (Rocket Software) , Al Martin (IBM)

AI/ML Cloud Computing IBM

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Andy Youniss President and CEO of Rocket Software, co-founded in 1990 with products across the Ladder to AI, Hybrid Cloud and businesses in Technology partnerships. This week we discuss Covid and its personal and professional affects on business. Show Notes 3:00 - Rockets core values 7:17 – Life and times 10:25 – Portfolio 11:26 - Vision 15:07 - Legacy tagline 20:26 - Rockets responds to Covid 27:17 – Secure file transfer business 28:53 – Where did the name Rocket come from? 30:05 – What does leadership mean to you? IBM Assistance - https://www.ibm.com/watson/covid-response Rocket Software - https://www.rocketsoftware.com/ Linkedin - https://www.linkedin.com/company/rocket-software Twitter - https://twitter.com/rocket Facebook - https://www.facebook.com/RocketSoftwareInc Brene Brown - https://brenebrown.com/ Arvind Krishna - https://www.linkedin.com/in/arvindkrishna/ Spillover – https://www.amazon.com/Spillover-Animal-Infections-Human-Pandemic/dp/0393066800#reader_0393066800 Bruce Springsteen - https://www.goodreads.com/book/show/29072594-born-to-run Connect with the Team Producer Kate Brown - LinkedIn. Producer Meighann Helene - LinkedIn. Producer Michael Sestak - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud

2020-06-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Robert Ilijason

AI/ML Analytics AWS Azure Cloud Computing Confluence Data Analytics Databricks Hadoop Hive Microsoft Python +5 more

Analyze vast amounts of data in record time using Apache Spark with Databricks in the Cloud. Learn the fundamentals, and more, of running analytics on large clusters in Azure and AWS, using Apache Spark with Databricks on top. Discover how to squeeze the most value out of your data at a mere fraction of what classical analytics solutions cost, while at the same time getting the results you need, incrementally faster. This book explains how the confluence of these pivotal technologies gives you enormous power, and cheaply, when it comes to huge datasets. You will begin by learning how cloud infrastructure makes it possible to scale your code to large amounts of processing units, without having to pay for the machinery in advance. From there you will learn how Apache Spark, an open source framework, can enable all those CPUs for data analytics use. Finally, you will see how services such as Databricks provide the power of Apache Spark, without you having to know anything aboutconfiguring hardware or software. By removing the need for expensive experts and hardware, your resources can instead be allocated to actually finding business value in the data. This book guides you through some advanced topics such as analytics in the cloud, data lakes, data ingestion, architecture, machine learning, and tools, including Apache Spark, Apache Hadoop, Apache Hive, Python, and SQL. Valuable exercises help reinforce what you have learned. What You Will Learn Discover the value of big data analytics that leverage the power of the cloud Get started with Databricks using SQL and Python in either Microsoft Azure or AWS Understand the underlying technology, and how the cloud and Apache Spark fit into the bigger picture See how these tools are used in the real world Run basic analytics, including machine learning, on billions of rows at a fraction of a cost or free Who This Book Is For Data engineers, data scientists, and cloud architects who want or need to run advanced analytics in the cloud. It is assumed that the reader has data experience, but perhaps minimal exposure to Apache Spark and Azure Databricks. The book is also recommended for people who want to get started in the analytics field, as it provides a strong foundation.

Al and Rishal talk about Rishal’s book Grokking AI Algorithms

2020-06-09 · Making Data Simple Listen

podcast_episode

by Rishal Hurbans (Intellect; Prolific Idea) , Al Martin (IBM)

AI/ML IBM

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Rishal Hurbans author of Grokking AI Algorithms, Solution Architect at Intellect, and founder at Prolific Idea in 2015. Al and Rishal talk about Rishal’s book Grokking AI Algorithms and its effect on ants and understanding algorithms. Show Notes 4:04 – Bio Rhythmic Using Ants 8:04 – Knapsack Problem vs Algorithms 9:02 – Mother and Father DNA 10:29 – Silver Bullet 12:42 – Real World Examples Algorithms 17:20 – Genetic Algorithms 28:30 – Rishal’s Best Podcasts · Ted Radio Hour - https://www.npr.org/programs/ted-radio-hour/ · Hidden Brian - https://www.npr.org/series/423302056/hidden-brain · Masters of Scale - https://mastersofscale.com/ · Tim Irvine Blog Wait But Why - https://waitbutwhy.com/ Connect with Rishal Hurbans Rishal Hurbans - https://rhurbans.com/ Grokking AI Algorithms - https://www.manning.com/

Connect with the Team Producer Kate Brown - LinkedIn. Producer Meighann Helene - LinkedIn. Producer Michael Sestak - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Spark in Action, Second Edition

2020-06-05 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jean-Georges Perrin (Actian)

AI/ML Analytics API ELK GitHub Hadoop IBM Java Python Scala Spark SQL +4 more

The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop. About the Technology Analyzing enterprise data starts by reading, filtering, and merging files and streams from many sources. The Spark data processing engine handles this varied volume like a champ, delivering speeds 100 times faster than Hadoop systems. Thanks to SQL support, an intuitive interface, and a straightforward multilanguage API, you can use Spark without learning a complex new ecosystem. About the Book Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you’ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you’ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms. What's Inside Writing Spark applications in Java Spark application architecture Ingestion through files, databases, streaming, and Elasticsearch Querying distributed datasets with Spark SQL About the Reader This book does not assume previous experience with Spark, Scala, or Hadoop. About the Author Jean-Georges Perrin is an experienced data and software architect. He is France’s first IBM Champion and has been honored for 12 consecutive years. Quotes This book reveals the tools and secrets you need to drive innovation in your company or community. - Rob Thomas, IBM An indispensable, well-paced, and in-depth guide. A must-have for anyone into big data and real-time stream processing. - Anupam Sengupta, GuardHat Inc. This book will help spark a love affair with distributed processing. - Conor Redmond, InComm Product Control Currently the best book on the subject! - Markus Breuer, Materna IPS

Thinking in Pandas: How to Use the Python Data Analysis Library the Right Way

2020-06-05 · O'Reilly Data Science Books O'Reilly Amazon

book

by Hannah Stepanek

Pandas Python data data-science data-science-tools

Understand and implement big data analysis solutions in pandas with an emphasis on performance. This book strengthens your intuition for working with pandas, the Python data analysis library, by exploring its underlying implementation and data structures. Thinking in Pandas introduces the topic of big data and demonstrates concepts by looking at exciting and impactful projects that pandas helped to solve. From there, you will learn to assess your own projects by size and type to see if pandas is the appropriate library for your needs. Author Hannah Stepanek explains how to load and normalize data in pandas efficiently, and reviews some of the most commonly used loaders and several of their most powerful options. You will then learn how to access and transform data efficiently, what methods to avoid, and when to employ more advanced performance techniques. You will also go over basic data access and munging in pandas and the intuitive dictionary syntax. Choosing the right DataFrame format, working with multi-level DataFrames, and how pandas might be improved upon in the future are also covered. By the end of the book, you will have a solid understanding of how the pandas library works under the hood. Get ready to make confident decisions in your own projects by utilizing pandas—the right way. What You Will Learn Understand the underlying data structure of pandas and why it performs the way it does under certain circumstances Discover how to use pandas to extract, transform, and load data correctly with an emphasis on performance Choose the right DataFrame so that the data analysis is simple and efficient. Improve performance of pandas operations with other Python libraries Who This Book Is For Software engineers with basic programming skills in Python keen on using pandas for a big data analysis project. Python software developers interested in big data.

Acquiring companies using AI with Simon Lightstone

2020-06-03 · Making Data Simple Listen

podcast_episode

by Simon Lightstone (IBM) , Al Martin (IBM)

AI/ML IBM

Send us a text Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract This week on Making Data Simple, we have Simon Lightstone Co Managing Partner at Levion Partners. Al and Simon discuss the strategy of acquiring companies using AI, search funds, and the special sauce.

Show Notes 3:25: The Beginning 5:23: Still Looking 6:55: Search Fund 7:29: Two Stages 11:28: Creating a Search Fund 15:13: Top Three 16:18: Special Sauce 16:55: The Process of the Secret Sauce 20:10: Advice on First AI Project

Grilled cheese blog - https://www.lifehack.org/349275/why-grilled-cheese-lovers-are-better-life https://www.levionpartners.com/

Connect with the Team Producer Kate Brown - LinkedIn. Producer Meighann Helene - LinkedIn. Producer Michael Sestak - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter.

Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

In this weeks podcast Al talks to Priya Srinivasan on a number of Data and AI topics.

2020-05-27 · Making Data Simple Listen

podcast_episode

by Priya Srinivasan (IBM) , Al Martin (IBM)

AI/ML Analytics Cloud Computing IBM Cyber Security

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM , Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

This week on Making Data Simple, we have Priya Srinivasan Director, IBM Data and AI Expert Labs SWAT. In this week’s podcast we talk about Data (world’s new oil), AI (world’s refinery), Cloud (the pipeline), Unified Governance and Data Ops, Security, Analytics and Services.

Show Notes 4:50 Expert labs and what it means 5:52 Al talks how SWAT was for him 6:10 Priya discusses SWAT now 7:18 Priya gives examples of SWAT 9:04 Deliverables from Expert Labs 13:37 Al talks about Services 17:30 Priya talks about solving long term problems 20:04 Priya discusses GROW (Guidance, Resources, and Outreach for Women) 21:25 Al asks Priya what excites her

Linkedin - https://www.linkedin.com/in/sripriya-srinivasan-385a0812/ Twitter - https://twitter.com/Priyavikram2

GROW - https://w3-connections.ibm.com/wikis/home?lang=en-us#!/wiki/W7e7074647e13_420c_9abf_875dd706e4b4/page/Welcome%20to%20GROW%20in%20Hybrid%20Cloud%20-%20Guidance,%20Resources,%20Outreach%20for%20Women%20in%20Hybrid%20Cloud

Connect with the Team Producer Kate Brown - LinkedIn. Producer Michael Sestak - LinkedIn. Producer Meighann Helene - LinkedIn.

Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform

2020-05-23 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Benjamin Weissman , Enrico van de Laar

AI/ML Analytics BI Cloud Computing Data Analytics Data Lake HDFS Kubernetes Linux Spark SQL Data Streaming +4 more

Use this guide to one of SQL Server 2019’s most impactful features—Big Data Clusters. You will learn about data virtualization and data lakes for this complete artificial intelligence (AI) and machine learning (ML) platform within the SQL Server database engine. You will know how to use Big Data Clusters to combine large volumes of streaming data for analysis along with data stored in a traditional database. For example, you can stream large volumes of data from Apache Spark in real time while executing Transact-SQL queries to bring in relevant additional data from your corporate, SQL Server database. Filled with clear examples and use cases, this book provides everything necessary to get started working with Big Data Clusters in SQL Server 2019. You will learn about the architectural foundations that are made up from Kubernetes, Spark, HDFS, and SQL Server on Linux. You then are shown how to configure and deploy Big Data Clusters in on-premises environments or in the cloud. Next, you are taught about querying. You will learn to write queries in Transact-SQL—taking advantage of skills you have honed for years—and with those queries you will be able to examine and analyze data from a wide variety of sources such as Apache Spark. Through the theoretical foundation provided in this book and easy-to-follow example scripts and notebooks, you will be ready to use and unveil the full potential of SQL Server 2019: combining different types of data spread across widely disparate sources into a single view that is useful for business intelligence and machine learning analysis. What You Will Learn Install, manage, and troubleshoot Big Data Clusters in cloud or on-premise environments Analyze large volumes of data directly from SQL Server and/or Apache Spark Manage data stored in HDFS from SQL Server as if it wererelational data Implement advanced analytics solutions through machine learning and AI Expose different data sources as a single logical source using data virtualization Who This Book Is For Data engineers, data scientists, data architects, and database administrators who want to employ data virtualization and big data analytics in their environments

Deborah Leff talks about Demystifying AI

2020-05-20 · Making Data Simple Listen

podcast_episode

by Deborah Leff , Al Martin (IBM)

AI/ML Data Science IBM

Send us a text Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract This week on Making Data Simple, we have Deborah Leff, Global Leader and Industry CTO, Data Science and AI Elite Team. Deborah is an Industry specialist for consumer and travel. In this week’s podcast we talk about demystifying AI and supporting customers around the world with Data Science and AI solutions.

Show Notes 1:10 Deborah explains the mission. 2:46 Deborah talks about transformational technology 14:36 American Airline reference on YouTube link - https://www.youtube.com/watch?v=t1PgNr8VMLc 13:24 Deborah Medium paper reference "AI Demands a New Perspective" - https://medium.com/@deborah.leff 22:39 Deborah Instagram account - https://www.instagram.com/ deborah.leff 23:50 and 26:04 Deborah LinkedIn page - https://www.linkedin.com/in/deborahleff/

Connect with the Team Producer Kate Brown - LinkedIn. Producer Michael Sestak - LinkedIn Producer Meighann Helene - LinkedIn.

Host Al Martin - LinkedIn and Twitter.

Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Covid-19 interview with Dr. Kyu Rhee

2020-05-13 · Making Data Simple Listen

podcast_episode

by Dr. Kyu Rhee (IBM) , Al Martin (IBM)

AI/ML IBM React

Send us a text Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next. Abstract

This week on Making Data Simple, we have a returning guest Dr. Kyu Rhee VP & Chief Health Officer IBM and IBM Watson Health, discussing the Covid-19 pandemic and how we prepare and react individually and as a country. What can we do for ourselves and how this pandemic affects the economy. And when do we see a light at the end of the tunnel. Show Notes 1. https://www.ibm.com/blogs/watson-health/author/kyurhee/ 2. https://www.ibm.com/impact/covid-19/ Connect with the Team Producer Kate Brown - LinkedIn. Producer Michael Sestak - LinkedIn. Producer Meighann Helene - LinkedIn.

Host Al Martin - LinkedIn and Twitter. Additional resources:

IBM Watson Health COVID-19 Resources: https://www.ibm.com/watson-health/covid-19 IBM Watson Health: Micromedex with Watson: https://www.ibm.com/products/dynamed-and-micromedex-with-watson How governments are rising to the challenge of COVID-19: https://www.ibm.com/blogs/watson-health/governments-agencies-rising-challenge-of-covid-19/ (edited)

Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Practical Statistics for Data Scientists, 2nd Edition

2020-05-11 · O'Reilly Data Science Books O'Reilly Amazon

book

by Andrew Bruce , Peter Bruce , Peter Gedeck

AI/ML Data Science Python data data-science data-science-tasks statistics

Statistical methods are a key part of data science, yet few data scientists have formal statistical training. Courses and books on basic statistics rarely cover the topic from a data science perspective. The second edition of this popular guide adds comprehensive examples in Python, provides practical guidance on applying statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what’s important and what’s not. Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R or Python programming languages and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format. With this book, you’ll learn: Why exploratory data analysis is a key preliminary step in data science How random sampling can reduce bias and yield a higher-quality dataset, even with big data How the principles of experimental design yield definitive answers to questions How to use regression to estimate outcomes and detect anomalies Key classification techniques for predicting which categories a record belongs to Statistical machine learning methods that "learn" from data Unsupervised learning methods for extracting meaning from unlabeled data

Enterprise Data Operations And Orchestration At Infoworks

2020-05-04 · Data Engineering Podcast Listen

podcast_episode

by Amar Arsikere (Infoworks) , Tobias Macey

AI/ML Data Engineering Data Management Kubernetes

Summary Data management is hard at any scale, but working in the context of an enterprise organization adds even greater complexity. Infoworks is a platform built to provide a unified set of tooling for managing the full lifecycle of data in large businesses. By reducing the barrier to entry with a graphical interface for defining data transformations and analysis, it makes it easier to bring the domain experts into the process. In this interview co-founder and CTO of Infoworks Amar Arsikere explains the unique challenges faced by enterprise organizations, how the platform is architected to provide the needed flexibility and scale, and how a unified platform for data improves the outcomes of the organizations using it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Free yourself from maintaining brittle data pipelines that require excessive coding and don’t operationally scale. With the Ascend Unified Data Engineering Platform, you and your team can easily build autonomous data pipelines that dynamically adapt to changes in data, code, and environment — enabling 10x faster build velocity and automated maintenance. On Ascend, data engineers can ingest, build, integrate, run, and govern advanced data pipelines with 95% less code. Go to dataengineeringpodcast.com/ascend to start building with a free 30-day trial. You’ll partner with a dedicated data engineer at Ascend to help you get started and accelerate your journey from prototype to production. Your host is Tobias Macey and today I’m interviewing Amar Arsikere about the Infoworks platform for enterprise data operations and orchestration

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you have built at Infoworks and the story of how it got started? What are the fundamental challenges that often plague organizations dealing with "big data"?

How do those challenges change or compound in the context of an enterprise organization? What are some of the unique needs that enterprise organizations have of their data?

What are the design or technical limitations of existing big data technologies that contribute to the overall difficulty of using or integrating them effectively? What are some of the tools or platforms that InfoWorks replaces in the overall data lifecycle?

How do you identify and prioritize the integrations that you build?

How is Infoworks itself architected and how has it evolved since you first built it? Discoverability and reuse of data is one of the biggest challenges facing organizations of all sizes. How do you address that in your platform? What are the roles that use InfoWorks in their day-to-day?

What does the workflow look like for each of those roles?

Can you talk through the overall lifecycle of a unit of data in InfoWorks and the different subsystems that it interacts with at each stage? What are some of the design challenges that you face in building a UI oriented workflow while providing the necessary level of control for these systems?

How do you handle versioning of pipelines and validation of new iterations prior to production release? What are the cases where the no code, graphical paradigm for data orchestration breaks down

talk-data.com

Activity Trend

Top Events

Top Speakers

Hadley Wickham talks about his journey in data science, tidy data concepts, and his many books.

Open Source Production Grade Data Integration With Meltano

This week Ross Mauri discusses his career and changes at IBM

DataOps For Streaming Systems With Lenses.io

Al and Jim discuss how to monetize data

Airflow: A beast character in the gaming world

Data Collection And Management To Power Sound Recognition At Audio Analytic

We discuss Data Science and IBMs partnership with Anaconda

Andy Youniss President and CEO of Rocket Software discusses Covid

Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud

Al and Rishal talk about Rishal’s book Grokking AI Algorithms

Spark in Action, Second Edition

Thinking in Pandas: How to Use the Python Data Analysis Library the Right Way

Acquiring companies using AI with Simon Lightstone

In this weeks podcast Al talks to Priya Srinivasan on a number of Data and AI topics.

SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform

Deborah Leff talks about Demystifying AI

Covid-19 interview with Dr. Kyu Rhee

Practical Statistics for Data Scientists, 2nd Edition

Enterprise Data Operations And Orchestration At Infoworks