talk-data.com talk-data.com

Topic

Data Science

machine_learning statistics analytics

1516

tagged

Activity Trend

68 peak/qtr
2020-Q1 2026-Q1

Activities

1516 activities · Newest first

Summary One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s infrastructure and designed to serve the needs of modern systems. Evan Weaver is the co-founder and CEO of Fauna and in this episode he explains the unique capabilities of Fauna, compares the consensus and transaction algorithm to that used in other NewSQL systems, and describes the ways that it allows for new application design patterns. One of the unique aspects of Fauna that is worth drawing attention to is the first class support for temporality that simplifies querying of historical states of the data. It is definitely worth a good look for anyone building a platform that needs a simple to manage data layer that will scale with your business.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Evan Weaver about FaunaDB, a modern operational data platform built for your cloud

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what FaunaDB is and how it got started? What are some of the main use cases that FaunaDB is targeting?

How does it compare to some of the other global scale databases that have been built in recent years such as CockroachDB?

Can you describe the architecture of FaunaDB and how it has evolved? The consensus and replication protocol in Fauna is intriguing. Can you talk through how it works?

What are some of the edge cases that users should be aware of? How are conflicts managed in Fauna?

What is the underlying storage layer?

How is the query layer designed to allow for different query patterns and model representations?

How does data modeling in Fauna compare to that of relational or document databases?

Can you describe the query format? What are some of the common difficulties or points of confusion around interacting with data in Fauna?

What are some application design patterns that are enabled by using Fauna as the storage layer? Given the ability to replicate globally, how do you mitigate latency when interacting with the database? What are some of the most interesting or unexpected ways that you have seen Fauna used? When is it the wrong choice? What have been some of the most interesting/unexpected/challenging aspects of building the Fauna database and company? What do you have in store for the future of Fauna?

Contact Info

@evan on Twitter LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Fauna Ruby on Rails CNET GitHub Twitter NoSQL Cassandra InnoDB Redis Memcached Timeseries Spanner Paper DynamoDB Paper Percolator ACID Calvin Protocol Daniel Abadi LINQ LSM Tree (Log-structured Merge-tree) Scala Change Data Capture GraphQL

Podcast.init Interview About Graphene

Fauna Query Language (FQL) CQL == Cassandra Query Language Object-Relational Databases LDAP == Lightweight Directory Access Protocol Auth0 OLAP == Online Analytical Processing Jepsen distributed systems safety research

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

HighlightsDigging into the music charts Gender Gap with data scientist Joshua HayesMissionGood morning, it’s Jason here at Chartmetric with your 3-minute Data Dump where we usually upload charts, artists and playlists into your brain so you can stay up on the latest in the music data world.DateThis is your Data Dump for Friday April 19th 2019.Gender Gap articleToday, we’re going to do something a little more in-depth.A few days ago, Chartmetric released a blog article about the gender gap in the music charts, to include Billboard, Spotify and Apple Music charts and also across country boundaries and genres.Check it out, article link is the show notes.But first, let me introduce the author of the piece, data scientist and newest team member of Chartmetric, Josh Hayes.Josh is a Data Science Doctoral Fellow, Sociology PhD from UC Davis with additional degrees at Stanford and Michigan State, and is neck deep in data science projects.Gist of the articleThe converging of Spotify, Apple Music and BillboardChartmetric Score - connection to revenueSuper high/low female track, baseline male track dynamicPost-Grande eraGenresPopLatinHip-HopCountryAll GenresOutroThat’s it for your Daily Data Dump for Friday April 19th 2019. This is Jason from Chartmetric.Free accounts are at app.chartmetric.com/signupAnd article links and show notes are at a new website: podcast.chartmetric.com.Happy Friday, have a lovely weekend! 

Learn RStudio IDE: Quick, Effective, and Productive Data Science

Discover how to use the popular RStudio IDE as a professional tool that includes code refactoring support, debugging, and Git version control integration. This book gives you a tour of RStudio and shows you how it helps you do exploratory data analysis; build data visualizations with ggplot; and create custom R packages and web-based interactive visualizations with Shiny. In addition, you will cover common data analysis tasks including importing data from diverse sources such as SAS files, CSV files, and JSON. You will map out the features in RStudio so that you will be able to customize RStudio to fit your own style of coding. Finally, you will see how to save a ton of time by adopting best practices and using packages to extend RStudio. Learn RStudio IDE is a quick, no-nonsense tutorial of RStudio that will give you a head start to develop the insights you need in your data science projects. What YouWill Learn Quickly, effectively, and productively use RStudio IDE for building data science applications Install RStudio and program your first Hello World application Adopt the RStudio workflow Make your code reusable using RStudio Use RStudio and Shiny for data visualization projects Debug your code with RStudio Import CSV, SPSS, SAS, JSON, and other data Who This Book Is For Programmers who want to start doing data science, but don’t know what tools to focus on to get up to speed quickly.

Summary Database indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for building an index of your data to enable high-speed aggregate analysis. In this episode Seebs explains how Pilosa fits in the broader data landscape, how it is architected, and how you can start using it for your own analysis. This was an interesting exploration of a different way to look at what a database can be.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Seebs about Pilosa, an open source, distributed bitmap index

Interview

Introduction How did you get involved in the area of data

Data Science for Business and Decision Making

Data Science for Business and Decision Making covers both statistics and operations research while most competing textbooks focus on one or the other. As a result, the book more clearly defines the principles of business analytics for those who want to apply quantitative methods in their work. Its emphasis reflects the importance of regression, optimization and simulation for practitioners of business analytics. Each chapter uses a didactic format that is followed by exercises and answers. Freely-accessible datasets enable students and professionals to work with Excel, Stata Statistical Software®, and IBM SPSS Statistics Software®. Combines statistics and operations research modeling to teach the principles of business analytics Written for students who want to apply statistics, optimization and multivariate modeling to gain competitive advantages in business Shows how powerful software packages, such as SPSS and Stata, can create graphical and numerical outputs

podcast_episode
by Vijay Bommireddipalli (CODAIT (Center for Open-Source Data & AI Technologies)) , Gabriela de Queiroz (IBM) , Al Martin (IBM)

Send us a text This week on the Making Data Simple podcast, Al Martin welcomes two guests from within IBM: Vijay Bommireddipalli is head of development CODAIT, the Center for Open-Source Data & AI Technologies. And Gabriela de Queiroz is a senior engineering and data science manager. Together, the three discuss the importance of open-source software solutions, along with the tools they use in everyday development. Gear up for a technical conversation that will challenge your perspective on what it means for something to be open-source — and why it matters.

Check us out on: - YouTube  - Apple Podcasts - Google Play Music - Spotify - TuneIn - Stitcher

Show Notes 00:10 - Connect with Producer Steve Moore on LinkedIn and Twitter.  00:15 - Connect with Producer Liam Seston on LinkedIn and Twitter.  00:20 - Connect with Producer Rachit Sharma on LinkedIn.  00:25 - Connect with Host Al Martin on LinkedIn and Twitter.  01:51 - Connect with Vijay on LinkedIn. 02:40 - Connect with Gabriela on LinkedIn. 03:12 - Not sure what open-source is? Find out here.  04:13 - Learn more about CODAIT here. 13:02 - What value does open-source software give? Find out here. 21:59 - Learn here how the democratization of A.I. is reshaping the industry. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Are you a data scientist? I mean, are you really a data scientist? What does that even mean...other than a healthy salary increase? On this episode of the show, Ian Thomas, Chief Data Officer for Publicis Spine sat down with the three co-citizen-data-scientists who regularly host the show to delve into the subject! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

Data Science Using Python and R

Learn data science by doing data science! Data Science Using Python and R will get you plugged into the world’s two most widespread open-source platforms for data science: Python and R. Data science is hot. Bloomberg called data scientist “the hottest job in America.” Python and R are the top two open-source data science tools in the world. In Data Science Using Python and R, you will learn step-by-step how to produce hands-on solutions to real-world business problems, using state-of-the-art techniques. Data Science Using Python and R is written for the general reader with no previous analytics or programming experience. An entire chapter is dedicated to learning the basics of Python and R. Then, each chapter presents step-by-step instructions and walkthroughs for solving data science problems using Python and R. Those with analytics experience will appreciate having a one-stop shop for learning how to do data science using Python and R. Topics covered include data preparation, exploratory data analysis, preparing to model the data, decision trees, model evaluation, misclassification costs, naïve Bayes classification, neural networks, clustering, regression modeling, dimension reduction, and association rules mining. Further, exciting new topics such as random forests and general linear models are also included. The book emphasizes data-driven error costs to enhance profitability, which avoids the common pitfalls that may cost a company millions of dollars. Data Science Using Python and R provides exercises at the end of every chapter, totaling over 500 exercises in the book. Readers will therefore have plenty of opportunity to test their newfound data science skills and expertise. In the Hands-on Analysis exercises, readers are challenged to solve interesting business problems using real-world data sets.

Summary How much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on the actual problem that you are trying to solve. In this episode he explains his motivation for building the DataCoral platform, how it is leveraging serverless computing, the challenges of delivering software as a service to customer environments, and the architecture that he has designed to make batch data management easier to work with. This was a fascinating conversation with someone who has spent his entire career working on simplifying complex data problems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Raghu Murthy about DataCoral, a platform that offers a fully managed and secure stack in your own cloud that delivers data to where you need it

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what DataCoral is and your motivation for founding it? How does the data-centric approa

Seja bem-vindo ao seu podcast de Data Science favorito! Nessa semana nós vamos conversar sobre como é trabalhar com Jornalismo de Dados em um dos maiores jornais do país: o Estadão. Já se perguntou sobre o que faz um Data Journalist? Quer saber que projetos legais rolam por lá? Confira isso e muito mais no papo de hoje.

No episódio de hoje, nós convidamos o Data Hacker Rodrigo Menegat— Jornalista de Dados no Estadão — para bater um papo sobre como é o dia a dia de um Data Journalist, quais desafios ele enfrenta, e como você pode se tornar um!

Acesse nosso post do Medium  para ter acesso as coisas que falamos no episódio:   https://bit.ly/2YUuGGG

podcast_episode
by Sean Law (TD Ameritrade) , Hugo (DataCamp)

This week, Hugo speaks with Sean Law about data science research and development at TD Ameritrade. Sean’s work on the Exploration team uses cutting edge theories and tools to build proofs of concept. At TD Ameritrade they think about a wide array of questions from conversational agents that can help customers quickly get to information that they need and going beyond chatbots. They use modern time series analysis and more advanced techniques like recurrent neural networks to predict the next time a customer might call and what they might be calling about, as well as helping investors leverage alternative data sets and make more informed decisions.

What does this proof of concept work on the edge of data science look like at TD Ameritrade and how does it differ from building prototypes and products? And How does exploration differ from production? Stick around to find out.

LINKS FROM THE SHOW

DATAFRAMED GUEST SUGGESTIONS

DataFramed Guest Suggestions (who do you want to hear on DataFramed?)

FROM THE INTERVIEW

Sean on TwitterSean's WebsiteTD Ameritrade Careers PagePyData Ann Arbor MeetupPyData Ann Arbor YouTube Channel (Videos)TDA Github Account (Time Series Pattern Matching repo to be open sourced in the coming months)Aura Shows Human Fingerprint on Global Air Quality

FROM THE SEGMENTS

Guidelines for A/B Testing (with Emily Robinson ~19:20)

Guidelines for A/B Testing (By Emily Robinson)10 Guidelines for A/B Testing Slides (By Emily Robinson)

Data Science Best Practices (with Ben Skrainka ~34:50)

Debugging (By David J. Agans)Basic Debugging With GDB (By Ben Skrainka)Sneaky Bugs and How to Find Them (with git bisect) (By Wiktor Czajkowski)Good logging practice in Python (By Victor Lin)

Original music and sounds by The Sticks.

Summary Analytics projects fail all the time, resulting in lost opportunities and wasted resources. There are a number of factors that contribute to that failure and not all of them are under our control. However, many of them are and as data engineers we can help to keep our projects on the path to success. Eugene Khazin is the CEO of PrimeTSR where he is tasked with rescuing floundering analytics efforts and ensuring that they provide value to the business. In this episode he reflects on the ways that data projects can be structured to provide a higher probability of success and utility, how data engineers can get throughout the project lifecycle, and how to salvage a failed project so that some value can be gained from the effort.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Eugene Khazin about the leading causes for failure in analytics projects

Interview

Introduction How did you get involved in the area of data management? The term "analytics" has grown to mean many different things to different people, so can you start by sharing your definition of what is in scope for an "analytics project" for the purposes of this discussion?

Wh

Data Science for Marketing Analytics

Data Science for Marketing Analytics introduces you to leveraging state-of-the-art data science techniques to optimize marketing outcomes. You'll learn how to manipulate and analyze data using Python, create customer segments, and apply machine learning algorithms to predict customer behavior. This book provides a comprehensive, hands-on approach to marketing analytics. What this Book will help me do Learn to use Python libraries like pandas & Matplotlib for data analysis. Understand clustering techniques to create meaningful customer segments. Implement linear regression for predicting customer lifetime value. Explore classification algorithms to model customer preferences. Develop skills to build interactive dashboards for marketing reports. Author(s) None Blanchard, Nona Behera, and Pranshu Bhatnagar are experienced professionals in data science and marketing analytics, with extensive backgrounds in applying machine learning to real-world business applications. They bring a wealth of knowledge and an approachable teaching style to this book, focusing on practical, industry-relevant applications for learners. Who is it for? This book is for developers and marketing professionals looking to advance their analytics skills. It is ideal for individuals with a basic understanding of Python and mathematics who want to explore predictive modeling and segmentation strategies. Readers should have a curiosity for data-driven problem-solving in marketing contexts to benefit most from the content.

Hands-On Data Science for Marketing

The book "Hands-On Data Science for Marketing" equips readers with the tools and insights to optimize their marketing campaigns using data science and machine learning techniques. Using practical examples in Python and R, you will learn how to analyze data, predict customer behavior, and implement effective strategies for better customer engagement and retention. What this Book will help me do Understand marketing KPIs and learn to compute and visualize them in Python and R. Develop the ability to analyze customer behavior and predict potential high-value customers. Master machine learning concepts for customer segmentation and personalized marketing strategies. Improve your skills to forecast customer engagement and lifetime value for more effective planning. Learn the techniques of A/B testing and their application in refining marketing decisions. Author(s) Yoon Hyup Hwang is a seasoned data scientist with a deep interest in the intersection of marketing and technology. With years of expertise in implementing machine learning algorithms in marketing analytics, Yoon brings a unique perspective by blending technical insights with business strategy. As an educator and practitioner, Yoon's approachable style and clear explanations make complex topics accessible for all learners. Who is it for? This book is tailored for marketing professionals looking to enhance their strategies using data science, data enthusiasts eager to apply their skills in marketing, and students or engineers seeking to expand their knowledge in this domain. A basic understanding of Python or R is beneficial, but the book is structured to welcome beginners by covering foundational to advanced concepts in a practical way.

Machine Learning with R Quick Start Guide

Machine Learning with R Quick Start Guide takes you through the foundations of machine learning using the R programming language. Starting with the basics, this book introduces key algorithms and methodologies, offering hands-on examples and applicable machine learning solutions that allow you to extract insights and create predictive models. What this Book will help me do Understand the basics of machine learning and apply them using R 3.5. Learn to clean, prepare, and visualize data with R to ensure robust data analysis. Develop and work with predictive models using various machine learning techniques. Discover advanced topics like Natural Language Processing and neural network training. Implement end-to-end pipeline solutions, from data collection to predictive analytics, in R. Author(s) None Sanz, the author of Machine Learning with R Quick Start Guide, is an expert in data science with years of experience in the field of machine learning and R programming. Known for their accessible and detailed teaching style, the author focuses on providing practical knowledge to empower readers in the real world. Who is it for? This book is ideal for graduate students and professionals, including aspiring data scientists and data analysts, looking to start their journey in machine learning. Readers are expected to have some familiarity with the R programming language but no prior machine learning experience is necessary. With this book, the audience will gain the ability to confidently navigate machine learning concepts and practices.

Summary Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tim Ward about CluedIn, an integration platform for implementing your companies data fabric

Interview

Introduction

How did you get involved in t

AI and Big Data on IBM Power Systems Servers

Abstract As big data becomes more ubiquitous, businesses are wondering how they can best leverage it to gain insight into their most important business questions. Using machine learning (ML) and deep learning (DL) in big data environments can identify historical patterns and build artificial intelligence (AI) models that can help businesses to improve customer experience, add services and offerings, identify new revenue streams or lines of business (LOBs), and optimize business or manufacturing operations. The power of AI for predictive analytics is being harnessed across all industries, so it is important that businesses familiarize themselves with all of the tools and techniques that are available for integration with their data lake environments. In this IBM® Redbooks® publication, we cover the best practices for deploying and integrating some of the best AI solutions on the market, including: IBM Watson Machine Learning Accelerator (see note for product naming) IBM Watson Studio Local IBM Power Systems™ IBM Spectrum™ Scale IBM Data Science Experience (IBM DSX) IBM Elastic Storage™ Server Hortonworks Data Platform (HDP) Hortonworks DataFlow (HDF) H2O Driverless AI We map out all the integrations that are possible with our different AI solutions and how they can integrate with your existing or new data lake. We also walk you through some of our client use cases and show you how some of the industry leaders are using Hortonworks, IBM PowerAI, and IBM Watson Studio Local to drive decision making. We also advise you on your deployment options, when to use a GPU, and why you should use the IBM Elastic Storage Server (IBM ESS) to improve storage management. Lastly, we describe how to integrate IBM Watson Machine Learning Accelerator and Hortonworks with or without IBM Watson Studio Local, how to access real-time data, and security. Note: IBM Watson Machine Learning Accelerator is the new product name for IBM PowerAI Enterprise. Note: Hortonworks merged with Cloudera in January 2019. The new company is called Cloudera. References to Hortonworks as a business entity in this publication are now referring to the merged company. Product names beginning with Hortonworks continue to be marketed and sold under their original names.

Summary Delivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of practices to increase the probability of success by creating value early and often, and using feedback loops to keep your project on course. In this episode Chris Bergh, head chef of Data Kitchen, explains how DataOps differs from DevOps, how the industry has begun adopting DataOps, and how to adopt an agile approach to building your data platform.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. "There aren’t enough data conferences out there that focus on the community, so that’s why these folks built a better one": Data Council is the premier community powered data platforms & engineering event for software engineers, data engineers, machine learning experts, deep learning researchers & artificial intelligence buffs who want to discover tools & insights to build new products. This year they will host over 50 speakers and 500 attendees (yeah that’s one of the best "Attendee:Speaker" ratios out there) in San Francisco on April 17-18th and are offering a $200 discount to listeners of the Data Engineering Podcast. Use code: DEP-200 at checkout You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Chris Bergh about the current state of DataOps and why it’s more than just DevOps for data

Interview

Introduction How did you get involved in the area of data management? We talked last year about what DataOps is, but can you give a quick overview of how the industry has changed or updated the definition since then?

It is easy to draw parallels between DataOps and DevOps, can you provide some clarity as to how they are different?

How has the conversat

Send us a text Rob Thomas, leader of IBM’s Data and AI division, talks with host Al Martin about the need to de-mystify AI. In particular, Rob recommends a "fail fast" approach to data science: run wide-ranging but short-term experiments — and expect disappointment on the way to insight. This episode offers a host of such suggestions, plus thoughtful leadership advice and tips for motivation. Part 1 of 2.


Show Notes 00:00 - Check us out on YouTube and SoundCloud.  00:10 - Connect with Producer Steve Moore on LinkedIn and Twitter.  00:15 - Connect with Producer Liam Seston on LinkedIn and Twitter.  00:20 - Connect with Producer Rachit Sharma on LinkedIn.  00:25 - Connect with Host Al Martin on LinkedIn and Twitter.  00:55 - Connect with Rob Thomas on LinkedIn and Twitter. 04:01 - Discover what big data and A.I. have in store.  06:22 - Learn more about IBM's Think conference here. 06:48 - Read more on Watson anywhere here. 10:24 - There is no A.I. without I.A. 23:21 - Check out Al's talk at Think 2019 with Jeff Jonas here. 23:51 - Check out this interview with Rob Thomas at Think 2019 here. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

The Enterprise Big Data Lake

The data lake is a daring new approach for harnessing the power of big data technology and providing convenient self-service capabilities. But is it right for your company? This book is based on discussions with practitioners and executives from more than a hundred organizations, ranging from data-driven companies such as Google, LinkedIn, and Facebook, to governments and traditional corporate enterprises. You’ll learn what a data lake is, why enterprises need one, and how to build one successfully with the best practices in this book. Alex Gorelik, CTO and founder of Waterline Data, explains why old systems and processes can no longer support data needs in the enterprise. Then, in a collection of essays about data lake implementation, you’ll examine data lake initiatives, analytic projects, experiences, and best practices from data experts working in various industries. Get a succinct introduction to data warehousing, big data, and data science Learn various paths enterprises take to build a data lake Explore how to build a self-service model and best practices for providing analysts access to the data Use different methods for architecting your data lake Discover ways to implement a data lake from experts in different industries