talk-data.com talk-data.com

Topic

Data Science

machine_learning statistics analytics

1516

tagged

Activity Trend

68 peak/qtr
2020-Q1 2026-Q1

Activities

1516 activities · Newest first

Send us a text  According to the Economist, 70 percent of business executives rated analytics as “very” or “extremely important,” but just 2 percent say they have achieved “broad positive results.” In this episode of Stories from the Field, host Wennie Allen talks to Carlo Appugliese of the IBM Data Science and AI Elite Team about his insights from more than 100 client engagements to improve business outcomes quickly. They discuss top challenges and barriers to AI adoption, as well as expert advice on how to overcome them. You will hear how companies are succeeding in achieving ROI and business objectives. Finally, Carlo shares his perspective on how to get started with data science and AI. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Summary The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to grow your professional network and find opportunities with the startups that are changing the world then Angel List is the place to go. Go to dataengineeringpodcast.com/angel to sign up today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Zhamak Dehghani about building a distributed data mesh for a domain oriented approach to data management

Interview

Introduction How did you get involved in the area of data management? Can you start by providing your definition of a "data lake" and discussing some of the problems and challenges that they pose?

What are some of the organizational and industry trends that tend to lead to this solution?

You have written a detailed post outlining the concept of a "data mesh" as an alternative to data lakes. Can you give a summary of what you mean by that phrase?

In a domain oriented data model, what are some useful methods for determining appropriate boundaries for the various data products?

What are some of the challenges that arise in this data mesh approach and how do they compare to those of a data lake? One of the primary complications of any data platform, whether distributed or monolithic, is that of discoverability. How do you approach that in a data mesh scenario?

A corollary to the issue of discovery is that of access

Send us a text What are your major concerns before flying on a trip? Would you ever give up your seat due to overbooking? How do airlines predict weather patterns and take proactive action to minimize delays? In this episode of Making Data Simple, Yianni Gamvros, Global Data Science Enablement Leader for IBM Watson and Cloud Platform, talks about how to use data science to better manage passenger flight experiences.   Show Notes 00.30 Connect with Al Martin on Twitter (@amartin_v) and LinkedIn (linkedin.com/in/al-martin-ku) 00.40 Connect with Yianni Gamvros on Twitter (@YGamvros), LinkedIn (linkedin.com/in/gamvros), or at datascience.ibm.com 00.50  Copyright and all rights reserved to Yanni. Song from Yanni Concert 2006 can be found here http://bit.ly/2iSlJJG  1:00 Learn more about Yanni the composer at http://www.yanni.com/welcome 08.25 Check the latest in weather and storm reports at https://weather.com/en-CA/ 23.50 Discover what Watson is doing in aviation at https://ibm.co/2yHDm5g or check out this video to see how data science and Watson can be used to better your flight experience: http://bit.ly/2j9kdGN 28.45 Find Negotiation Genius: How to Overcome Obstacles and Achieve Brilliant Results at the Bargaining Table and Beyond by Deepak Malhotra & Max Bazerman here: http://amzn.to/2zox6Dy 29.10 Find The Goal: A Process of Ongoing Improvement  by Eliyahu M Goldratt here: http://amzn.to/2iETTkd 29.40 Learn more about the IBM Watson Data Platform at https://ibm.co/2reqLmN or http://bit.ly/2iEUbYl   30.40 Copyright YanniVEVO, all rights reserved to Yanni. Song Name: The Rain Must Fall. Listen here, http://bit.ly/2iU3uDU Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Data Science with Python and Dask

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you’re already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And Data Science with Python and Dask is your guide to using Dask for your data projects without changing the way you work! About the Technology An efficient data pipeline means everything for the success of a data science project. Dask is a flexible library for parallel computing in Python that makes it easy to build intuitive workflows for ingesting and analyzing large, distributed datasets. Dask provides dynamic task scheduling and parallel collections that extend the functionality of NumPy, Pandas, and Scikit-learn, enabling users to scale their code from a single laptop to a cluster of hundreds of machines with ease. About the Book Data Science with Python and Dask teaches you to build scalable projects that can handle massive datasets. After meeting the Dask framework, you’ll analyze data in the NYC Parking Ticket database and use DataFrames to streamline your process. Then, you’ll create machine learning models using Dask-ML, build interactive visualizations, and build clusters using AWS and Docker. What's Inside Working with large, structured and unstructured datasets Visualization with Seaborn and Datashader Implementing your own algorithms Building distributed apps with Dask Distributed Packaging and deploying Dask apps About the Reader For data scientists and developers with experience using Python and the PyData stack. About the Author Jesse Daniel is an experienced Python developer. He taught Python for Data Science at the University of Denver and leads a team of data scientists at a Denver-based media technology company. We interviewed Jesse as a part of our Six Questions series. Check it out here. Quotes The most comprehensive coverage of Dask to date, with real-world examples that made a difference in my daily work. - Al Krinker, United States Patent and Trademark Office An excellent alternative to PySpark for those who are not on a cloud platform. The author introduces Dask in a way that speaks directly to an analyst. - Jeremy Loscheider, Panera Bread A greatly paced introduction to Dask with real-world datasets. - George Thomas, R&D Architecture Manhattan Associates The ultimate resource to quickly get up and running with Dask and parallel processing in Python. - Gustavo Patino, Oakland University William Beaumont School of Medicine

Summary Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Mark Sears about Cloud Factory, masters of the art and science of labeling data for Machine Learning and more

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what CloudFactory is and the story behind it? What are some of the common requirements

Quer saber um pouco sobre como fazemos para aprender técnicas e habilidades novas? Que tal algumas dicas de aplicativos e produtividade e, de quebra, entender como Data Science irá influenciar no futuro da educação no Brasil e no mundo? É isso e muito mais que iremos discutir hoje!

Nesse episódio, nós tivemos a participação ilustre do Data Hacker Guilherme Silveira — Co-fundador e Head de Educação na Caelum e Alura — para bater um papo sobre como sua forma de aprendizado deu origem a uma das maiores plataformas de ensino no Brasil, e como será o futuro da educação com a ajuda de Data Science.

Link para post do Medium com o que falamos no episódio:

https://medium.com/data-hackers/aprendizado-e-o-futuro-da-educa%C3%A7%C3%A3o-data-hackers-podcast-12-e29cfeb450f4

Data Science Strategy For Dummies

All the answers to your data science questions Over half of all businesses are using data science to generate insights and value from big data. How are they doing it? Data Science Strategy For Dummies answers all your questions about how to build a data science capability from scratch, starting with the “what” and the “why” of data science and covering what it takes to lead and nurture a top-notch team of data scientists. With this book, you’ll learn how to incorporate data science as a strategic function into any business, large or small. Find solutions to your real-life challenges as you uncover the stories and value hidden within data. Learn exactly what data science is and why it’s important Adopt a data-driven mindset as the foundation to success Understand the processes and common roadblocks behind data science Keep your data science program focused on generating business value Nurture a top-quality data science team In non-technical language, Data Science Strategy For Dummies outlines new perspectives and strategies to effectively lead analytics and data science functions to create real value.

Send us a text In this week's episode of Making Data Simple, we are joined by guest John DeNero, who is a Professor at UC Berkeley. John specializes in teaching artificial intelligence - winning a distinguished teaching award in 2018 as a result. Host Al Martin and John discuss methods of teaching AI, the state of the education industry, and the company he co-founded, Lilt. Listen in for a diverse, high level, and engaging conversation.

Show Notes 00:10 - Connect with Producer Steve Moore on LinkedIn and Twitter. 00:15 - Connect with Producer Liam Seston on LinkedIn and Twitter. 00:20 - Connect with Producer Rachit Sharma on LinkedIn. 00:25 - Connect with Host Al Martin on LinkedIn and Twitter.

00:30 - Connect with Producer Lana Cosic on LinkedIn and Twitter.  00:35 - Connect with John DeNero on LinkedIn and be sure to check out Lilt! 02:31 - This Wall Street Journal article backs it up - UC Berkeley's fastest-growing class is Data Science 101. 09:06 - Check out this interesting article on statistics anxiety.  19:26 - Where do data scientists come from? What field of study is more related to the role? 35:42 - Why the tech world needs philosophers. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Data science has made immense progress, but companies are still stuck with the question: how do you use data science to deliver real value to the business? They hire dozens of data scientists and invest in state-of-the-art technology, but only a few have delivered ROI and business impact. In this episode, Wayne Eckerson and Alex Vayner discuss what organizations need to do for data science success.

Alex Vayner is a Partner and Americas Data & AI Practice Leader for PA Consulting Group, an innovation and transformation consultancy. Alex has spent his entire career in data & analytics, with his last five roles focused on building and running high-performance data science teams and capabilities in consulting and corporate environments. Before joining PA Consulting, Alex ran the NA Data Science & AI practice at Capgemini. He joined Capgemini from Equifax, where he served as VP, Global Data Innovation Leader, building a team responsible for pioneering disruptive data & analytics solutions for clients across all industries.

Summary The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Robert Hodges and Alexander Zaitsev about Clickhouse, an open source, column-oriented database for fast and scalable OLAP queries

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Clickhouse is and how you each got involved with it?

What are the primary use cases that Clickhouse is targeting? Where does it fit in the database market and how does it compare to other column stores, both open source and commercial?

Can you describe how Clickhouse is architected? Can you talk through the lifecycle of a given record or set of records from when they first get inserted into Clickhouse, through the engine an

Summary Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Paul Brebner about his experience designing and building a scalable, real-time anomaly detection system using Kafka and Cassandra

Interview

Introduction How did you get involved in the area of data management? Can you start by describing the problem that you were trying to solve and the requirements that you were aiming for?

What are some example cases where anomaly detection is useful or necessary?

Once you had established the requirements in terms of functionality and data volume, what was your approach for dete

Big Data Simplified
"Big Data Simplified blends technology with strategy and delves into applications of big data in specialized areas, such as recommendation engines, data science and Internet of Things (IoT) and enables a practitioner to make the right technology choice. The steps to strategize a big data implementation are also discussed in detail. This book presents a holistic approach to the topic, covering a wide landscape of big

data technologies like Hadoop 2.0 and package implementations, such as Cloudera. In-depth discussion of associated technologies, such as MapReduce, Hive, Pig, Oozie, ApacheZookeeper, Flume, Kafka, Spark, Python and NoSQL databases like Cassandra, MongoDB, GraphDB, etc., is also included.

Summary Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Jeremiah Lowin about Prefect, a workflow platform for data engineering

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Prefect is and your motivation for creating it? What are the axes along which a workflow engine can differentiate itself, and which of those have you focused on for Prefect? In some of your blog posts and your PyData presentation you discuss the concept of negative vs. positive engineering. Can you briefly outline what you mean by that and the ways that Prefect handles the negative cases for you? How is Prefect itself implemented and what tools or systems have you relied on most heavily for inspiration? How do you manage passing data between stages in a pipeline when they are running across distributed nodes? What was your decision making process when deciding to use Dask as your supported execution engine?

For tasks that require specific resources or dependencies how do you approach the idea of task affinity?

Does Prefect support managing tasks that bridge network boundaries? What are some of the features or capabilities of Prefect that are misunderstood or overlooked by users which you think should be exercised more often? What are the limitations of the open source core as compared to the cloud offering that you are building? What were your assumptions going into this project and how have they been challenged or updated as you dug deeper into the problem domain and received feedback from users? What are some of the most interesting/innovative/unexpected ways that you have seen Prefect used? When is Prefect the wrong choice? In your experience working on Airflow and Prefect, what are some of the common challenges and anti-patterns that arise in data engineering projects?

What are some best practices and industry trends that you are most excited by?

What do you have planned for the future of the Prefect project and company?

Contact Info

LinkedIn @jlowin on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Prefect Airflow Dask

Podcast Episode

Prefect Blog PyData Presentation Tensorflow Workflow Engine

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

The Care and Feeding of Data Scientists

As a discipline, data science is relatively young, but the job of managing data scientists is younger still. Many people undertake this management position without the tools, mentorship, or role models they need to do it well. This report examines the steps necessary to build, manage, sustain, and retain a growing data science team. You’ll learn how data science management is similar to but distinct from other management types. Michelangelo D’Agostino, VP of Data Science and Engineering at ShopRunner, and Katie Malone, Director of Data Science at Civis Analytics, provide concrete tips for balancing and structuring a data science team. The authors provide tips for balancing and structuring a data science team, recruiting and interviewing the best candidates, and keeping them productive and happy once they're in place. In this report, you'll: Explore data scientist archetypes, such as operations and research, that fit your organization Devise a plan to recruit, interview, and hire members for your data science team Retain your hires by providing challenging work and learning opportunities Explore Agile and OKR methodology to determine how your team will work together Provide your team with a career ladder through guidance and mentorship

Send us a text  This week host Al Martin invites Shadi Copty, who is the Executive Director for AI Tools & Runtimes at IBM. In this episode he talks about automation, tools and multiple IBM AI technologies.

Check us out on: - YouTube - Apple Podcasts - Google Play Music - Spotify - TuneIn - Stitcher Show notes:  00:00 - Check out "Making Data Simple" on YouTube and SoundCloud. 00:10 - Connect with Producer Steve Moore on LinkedIn and Twitter. 00:15 - Connect with Producer Liam Seston on LinkedIn and Twitter. 00:20 - Connect with Producer Rachit Sharma on LinkedIn. 00:25 - Connect with Host Al Martin on LinkedIn and Twitter. 00:40 - Connect with Shadi Copty on LinkedIn. 04:20 What is Data Science? 08:00 What is Watson Studio? 21:45 What is OpenScale? Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

podcast_episode
by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Maryam Jahanshahi (TapRecruit) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

What's in a job title? That which we call a senior data scientist by any other job title would model as predictively...  This, dear listener, is why the hosts of this podcast crunch data rather than dabble in iambic pentameter. With sincere apologies to William Shakespeare, we sat down with Maryam Jahanshahi to discuss job titles, job descriptions, and the research, experiments, and analysis that she has conducted as a research scientist at TapRecruit, specifically relating to data science and analytics roles. The discussion was intriguing and enlightening! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

Summary Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Michael Armbrust about Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark and big data workloads.

Interview

Introduction How did you get involved in the area of data m

Deep Learning for Search

Deep Learning for Search teaches you how to improve the effectiveness of your search by implementing neural network-based techniques. By the time you're finished with the book, you'll be ready to build amazing search engines that deliver the results your users need and that get better as time goes on! About the Technology Deep learning handles the toughest search challenges, including imprecise search terms, badly indexed data, and retrieving images with minimal metadata. And with modern tools like DL4J and TensorFlow, you can apply powerful DL techniques without a deep background in data science or natural language processing (NLP). This book will show you how. About the Book Deep Learning for Search teaches you to improve your search results with neural networks. You’ll review how DL relates to search basics like indexing and ranking. Then, you’ll walk through in-depth examples to upgrade your search with DL techniques using Apache Lucene and Deeplearning4j. As the book progresses, you’ll explore advanced topics like searching through images, translating user queries, and designing search engines that improve as they learn! What's Inside Accurate and relevant rankings Searching across languages Content-based image search Search with recommendations About the Reader For developers comfortable with Java or a similar language and search basics. No experience with deep learning or NLP needed. About the Author Tommaso Teofili is a software engineer with a passion for open source and machine learning. As a member of the Apache Software Foundation, he contributes to a number of open source projects, ranging from topics like information retrieval (such as Lucene and Solr) to natural language processing and machine translation (including OpenNLP, Joshua, and UIMA). He currently works at Adobe, developing search and indexing infrastructure components, and researching the areas of natural language processing, information retrieval, and deep learning. He has presented search and machine learning talks at conferences including BerlinBuzzwords, International Conference on Computational Science, ApacheCon, EclipseCon, and others. You can find him on Twitter at @tteofili. Quotes A practical approach that shows you the state of the art in using neural networks, AI, and deep learning in the development of search engines. - From the Foreword by Chris Mattmann, NASA JPL A thorough and thoughtful synthesis of traditional search and the latest advancements in deep learning. - Greg Zanotti, Marquette Partners A well-laid-out deep dive into the latest technologies that will take your search engine to the next level. - Andrew Wyllie, Thynk Health Hands-on exercises teach you how to master deep learning for search-based products. - Antonio Magnaghi, System1

Send us a text  This week's guest is Jorge Castanon, a Sr. Data Scientist for Watson Studio at IBM. Host Al Martin and Jorge discuss some typical data problems currently plaguing the industry, and how Watson Studio makes dealing with those problems that much easier. Get ready for an in-depth, technical conversation with two industry experts.

Show Note 00:10 - Connect with Producer Steve Moore on LinkedIn and Twitter.  00:15 - Connect with Producer Liam Seston on LinkedIn and Twitter.  00:20 - Connect with Producer Rachit Sharma on LinkedIn. 00:25 - Connect with Host Al Martin on LinkedIn and Twitter.  00:41 - Connect with Jorge Castanon on LinkedIn and Twitter 05:42 - Check out the machine learning hub here. 09:53 - Unsure what customer churn is? Find out in this article. 20:00 - AI is not magic. Read an article discussion the topic here. 24:34 - Learn about SPSS Modeler here. 35:46 - Check out coursera here. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

HighlightsSpecial interview episode today: Does data science scare you? Does it keep you up at night when you hear or read about it at a panel or on some podcast, and you think to yourself, “I have no idea what they are talking about.”Rest easy and let Chartmetric’s Resident Data Scientist assuage your fears.How do you measure artist success across multiple streaming, social and other Internets platforms? We might have something for you.Mission   Good morning, it’s Jason and Josh here at Chartmetric usually with your 3-minute Data Dump where we upload charts, artists, and playlists into your brain so you can stay up on the latest in the music data world.DateThis is your Data Dump for Wednesday, June 12th, 2019.Interview OutlineWhat is Cross-Platform Performance scoring and ranking on Chartmetric?Josh’s blog article / CPP explanationCPP measurementsStage: This is the amount of “reach” or “exposure” that an artist has over audiences. The bigger the stage, the more people actively listening, watching, or consuming what the artist is creating.Followers: This is the size of an artist’s “fanbase” or an artist’s “stickiness” with audiences. Followers have opted into tracking an artist and therefore are more likely to re-engage with the artist’s products in the future. Followers are not actively engaging with an artist all the time, but artists have an easier job of connecting with followers than non-followers.Cool CPP video to visualize the data science (made by Graphic & Motion Design Artist Anastasiya Bulavkina)Philosophical debate: what is “best” nowadays?Is there a way for people to reach out to you on the Interwebs, Josh?Josh’s LinkedIn profilehi (at) chartmetric (dot) comOutroThat’s it for your Daily Data Dump for Wednesday, June 12th, 2019. This is Josh and Jason from Chartmetric.Free accounts are at app.chartmetric.com/signupAnd article links and show notes are at: podcast.chartmetric.comHappy Wednesday, see you tomorrow!