Everybody knows our yellow vans, trucks and planes around the world. But do you know how data drives our business and how we leverage algorithms and technology in our core operations? We will share some “behind the scenes” insights on Deutsche Post DHL Group’s journey towards a Data-Driven Company. • Large-Scale Use Cases: Challenging and high impact Use Cases in all major areas of logistics, including Computer Vision and NLP • Fancy Algorithms: Deep-Neural Networks, TSP Solvers and the standard toolkit of a Data Scientist • Modern Tooling: Cloud Platforms, Kubernetes , Kubeflow, Auto ML • No rusty working mode: small, self-organized, agile project teams, combining state of the art Machine Learning with MLOps best practices • A young, motivated and international team – German skills are only “nice to have” But we have more to offer than slides filled with buzzwords. We will demonstrate our passion for our work, deep dive into our largest use cases that impact your everyday life and share our approach for a timeseries forecasting library - combining data science, software engineering and technology for efficient and easy to maintain machine learning projects..
talk-data.com
Topic
Data Science
1516
tagged
Activity Trend
Top Events
Debugging is hard. Distributed debugging is hell.
Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease.
However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success.
In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild.
This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.
Even if every data science work is special, a lot can be learned from similar problems solved in the past. In this talk, I will share some specific software design concepts that data scientists can use to build better data products.
The title “Data Scientist” has been in use for 15 years now. We have been attending PyData conferences for over 10 years as well. The hype around data science and AI seems higher than ever before. But How are we managing?
When handling a large amount of data, memory profiling the data science workflow becomes more important. It gives you insight into which process consumes lots of memory. In this talk, we will introduce Mamray, a Python memory profiling tool and its new Jupyter plugin.
Building machine learning systems with high predictive accuracy is inherently hard, and embedding these systems into great product experiences is doubly so. To build truly great machine learning products that reach millions of users, organizations need to marry great data science expertise, with strong attention to user experience, design thinking, and a deep consideration for the impacts of your prediction on users and stakeholders. So how do you do that? Today’s guest is Sam Stone, Director of Product Management, Pricing & Data at Opendoor, a real-estate technology company that leverages machine learning to streamline the home buying and selling process. Sam played an integral part in developing AI/ML products related to home pricing including the Opendoor Valuation Model (OVM), market liquidity forecasting, portfolio optimization, and resale decision tooling. Prior to Opendoor, he was a co-founder and product manager at Ansaro, a SaaS startup using data science and machine learning to help companies improve hiring decisions. Sam holds degrees in Math and International Relations from Stanford and an MBA from Harvard. Throughout the episode, we spoke about his principles for great ML product design, how to think about data collection for these types of products, how to package outputs from a model within a slick user interface, what interpretability means in the eyes of customers, how to be proactive about monitoring failure points, and much more.
Snowflake as a data platform is the core data repository of many large organizations.
With the introduction of Snowflake's Snowpark for Python, Python developers can now collaborate and build on one platform with a secure Python sandbox, providing developers with dynamic scalability & elasticity as well as security and compliance.
In this talk I'll explain the core concepts of Snowpark for Python and how they can be used for large scale feature engineering and data science.
The nightmare before data science production: You found a working prototype for your problem using a Jupyter notebook and now it's time to build a production grade solution from that notebook. Unfortunately, your notebook looks anything but production grade. The good news is, there's finally a cure!
The open-source python package LineaPy aims to automate data science workflow generation and expediting the process of going from data science development to production. And truly, it transforms messy notebooks into data pipelines like Apache Airflow, DVC, Argo, Kubeflow, and many more. And if you can't find your favorite orchestration framework, you are welcome to work with the creators of LineaPy to contribute a plugin for it!
In this talk, you will learn the basic concepts of LineaPy and how it supports your everyday tasks as a data practitioner. For this purpose, we will transform a notebook step by step together to create a DVC pipeline. Finally, we will discuss what place LineaPy will take in the MLOps universe. Will you only have to check in your notebook in the future?
Ofcom is the government-approved regulatory and competition authority for the broadcasting, telecommunications and postal industries of the United Kingdom. It plays a vital role in ensuring TV, radio and telecoms work as they should. With vast swathes of information from a wide range of sources, data plays a huge role in the way Ofcom operates - in this episode, we learn the key drivers of Ofcom’s data strategy. Richard Davis is the Chief Data Officer at Ofcom, responsible for enabling data and analytics capabilities across the organisation. Prior to Ofcom, Richard worked as a Quantitative Analyst as well as being the former Head of Analytics and Innovation at LLoyds Bank, proving he has a wealth of experience across a variety of data roles. After joining Ofcom in 2022, Richard describes his experience of joining Ofcom, his ambition to bring in new processes, and how he leverages the community of data professionals. Richard also shares his advice for a new data leader, which includes understanding the pain points of the team, making insights more efficient, and keeping data teams aligned with the business's needs. He also elaborates on the key components of the data strategy at Ofcom, including aligning to good data, good people, and good decisions.
Also discussed is the importance of cultural change in an organization and how to upskill data experts and train non-data specialists in data literacy, the difference between technical experts and people managers, and how organizations can enable people to grow to become technical leaders. Finally, Richard emphasizes the importance of evidence-based regulation, and how data literacy supports effective output. Richard provides excellent insight into the world of regulatory data, the challenges faced by Ofcom, and the solutions they can implement to overcome them.
Today I’m chatting with Josh Noble, Principal User Researcher at TruEra. TruEra is working to improve AI quality by developing products that help data scientists and machine learning engineers improve their AI/ML models by combatting things like bias and improving explainability. Throughout our conversation, Josh—who also used to work as a Design Lead at IDEO.org—explains the unique challenges and importance of doing design and user research, even for technical users such as data scientists. He also shares tangible insights on what informs his product design strategy, the importance of measuring product success accurately, and the importance of understanding the current state of a solution when trying to improve it.
Highlights/ Skip to:
Josh introduces himself and explains why it’s important to do design and user research work for technical tools used by data scientists (00:43) The work that TruEra does to mitigate bias in AI as well as their broader focus on AI quality management (05:10) Josh describes how user roles informed TruEra’s design their upcoming monitoring product, and the emphasis he places on iterating with users (10:24) How Josh approaches striking a balance between displaying extraneous information in the tools he designs vs. removing explainability (14:28) Josh explains how TruEra measures product success now and how they envision that changing in the future (17:59) The difference Josh sees between explainability and interpretability (26:56) How Josh decided to go from being a designer to getting a data science degree (31:08) Josh gives his take on what skills are most valuable as a designer and how to develop them (36:12)
Quotes from Today’s Episode “We want to make machine learning better by testing it, helping people analyze it, helping people monitor models. Bias and fairness is an important part of that, as is accuracy, as is explainability, and as is more broadly AI quality.” — Josh Noble (05:13)
“These two groups, the data scientists and the machine-learning engineer, they think quite differently about the problems that they need to solve. And they have very different toolsets. … Looking at how we can think about making a product and building tools that make sense to both of those different groups is a really important part of user experience.” – Josh Noble (09:04)
“I’m a big advocate for iterating with users. To the degree possible, get things in front of people so they can tell you whether it works for them or not, whether it fits their expectations or not.” – Josh Noble (12:15)
“Our goal is to get people to think about AI quality differently, not to necessarily change. We don’t want to change their performance metrics. We don’t want to make them change how they calculate something or change a workflow that works for them. We just want to get them to a place where they can bring together our four pillars and build better models and build better AI.” – Josh Noble (17:38)
“I’ve always wanted to know what was going on underneath the design. I think it’s an important part of designing anything to understand how the thing that you are making is actually built.” – Josh Noble (31:56)
“There’s a empathy-building exercise that comes from using these tools and understanding where they come from. I do understand the argument that some designers make. If you want to find a better way to do something, spending a ton of time in the trenches of the current way that it’s done is not always the solution, right?” – Josh Noble (36:12)
“There’s a real empathy that you build and understanding that you build from seeing how your designs are actually implemented that makes you a better teammate. It makes you a better collaborator and ultimately, I think, makes you a better designer because of that.” – Josh Noble (36:46)
“I would say to the non-designers who work with designers, measuring designs is not invalidating the designer. It doesn’t invalidate the craft of design. It shouldn’t be something that designers are hesitant to do. I think it’s really important to understand in a qualitative way what your design is doing and understand in a quantitative way what your design is doing.” – Josh Noble (38:18)
Links Truera: https://truera.com/ Medium: https://medium.com/@fctry2
The concept of literate programming, or the idea of programming in a document, was first introduced in 1984 by Donald Knuth. And as of today, notebooks are now the defacto tool for doing data science work. So as the data tooling space continues to evolve at breakneck speed, what are the possible directions the data science notebook can take? In this episode of DataFramed, we talk with Dr. Jodie Burchell, Data Science Developer Advocate at JetBrains, to find out how data science notebooks evolved into what they are today, what her predictions are for the future of notebooks and data science, and how generative AI will impact data teams going forward. Jodie completed a Ph.D. in clinical psychology and a postdoc in biostatistics before transitioning into data science. She has since worked for 7 years as a data scientist, developing products ranging from recommendation systems to audience profiling. She is also a prolific content creator in the data science community. Throughout the episode, Jodie discusses the evolution of data science notebooks over the last few years, noting how the move to remote-based notebooks has allowed for the seamless development of more complex models straight from the notebook environment. Jodie and Adel’s conversation also covers tooling challenges that have led to modern IDEs and notebooks, with Jodie highlighting the importance of good database tooling and visibility. She shares how data science notebooks have evolved to help democratize data for the wider organization, the tradeoffs between engineering-led approaches to tooling compared to data science approaches, what generative AI means for the data profession, her predictions for data science, and more. Tune in to this episode to learn more about the evolution of data science notebooks and the challenges and opportunities facing the data science community today. Links to mentioned in the show: DataCamp Workspace: An-in Browser Notebook IDEJetBrains' DataloreNick Cave on ChatGPT song lyrics imitating his styleGitHub Copilot More on the topic: The Past, Present, And Future of The Data Science NotebookHow to Use Jupyter Notebooks: The Ultimate Guide
We talked about:
Shir’s background Debrief culture The responsibilities of a group manager Defining the success of a DS manager The three pillars of data science management Managing up Managing down Managing across Managing data science teams vs business teams Scrum teams, brainstorming, and sprints The most important skills and strategies for DS and ML managers Making sure proof of concepts get into production
Links:
The secret sauce of data science management: https://www.youtube.com/watch?v=tbBfVHIh-38 Lessons learned leading AI teams: https://blogs.intuit.com/2020/06/23/lessons-learned-leading-ai-teams/ How to avoid conflicts and delays in the AI development process (Part I): https://blogs.intuit.com/2020/12/08/how-to-avoid-conflicts-and-delays-in-the-ai-development-process-part-i/ How to avoid conflicts and delays in the AI development process (Part II): https://blogs.intuit.com/2021/01/06/how-to-avoid-conflicts-and-delays-in-the-ai-development-process-part-ii/ Leading AI teams deck: https://drive.google.com/drive/folders/1_CnqjugtsEbkIyOUKFHe48BeRttX0uJG Leading AI teams video: https://www.youtube.com/watch?app=desktop&v=tbBfVHIh-38
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html
Discover how to effectively forecast time series data using Prophet, the versatile open-source tool developed by Meta. Whether you're a business analyst or a machine learning expert, this book provides comprehensive insights into creating, diagnosing, and refining forecasting models. By mastering Prophet, you'll be equipped to make accurate predictions that drive decisions. What this Book will help me do Master the core principles of using Prophet for time series forecasting. Ensure your forecasts are accurate and robust for better decision-making. Gain experience in handling real-world forecasting challenges, like seasonality and outliers. Learn how to fine-tune and optimize models using additional regressors. Understand productionalization of forecasting models to apply solutions at scale. Author(s) Greg Rafferty is a seasoned data scientist specializing in time series analysis and machine learning. With years of practical experience building forecasting models in industries ranging from finance to e-commerce, Greg is dedicated to teaching accessible and actionable approaches to data science. Through clear explanations and practical examples, he empowers readers to solve challenging forecasting problems with confidence. Who is it for? Ideal for data scientists, business analysts, machine learning engineers, and software developers seeking to enhance their forecasting skills with Prophet. Whether you're familiar with time series concepts or just starting to explore forecasting methods, this book helps you advance from fundamental understanding to practical application of state-of-the-art techniques for impactful results.
In 2023, businesses are relying more heavily on data science and analytics teams than ever before. However, simply having a team of talented individuals is not enough to guarantee success. In the last of our RADAR 2023 sessions, Vijay Yadav and Vanessa Gonzalez will outline the keys to building high-impact data teams in 2023. They will discuss what are the hallmarks of a high-performing data team, the importance of diversity of background and skillset needed to build impactful data teams, setting up career pathways for data scientists, and more. Vijay Yadav is a highly respected data and analytics thought leader with over 20 years of experience in data product development, data engineering, and advanced analytics. As Director of Quantitative Sciences - Digital, Data, and Analytics at Merck, he leads data & analytics teams in creating AI/ML-driven data products to drive digital transformation. Vijay has held numerous leadership positions at various companies and is known for his ability to lead global teams to achieve high-impact results. Vanessa Gonzalez is the Sr. Director of Data Science and Innovation at Businessolver where she leads the Computational Linguistics, Machine Learning Engineering, Data Science, BI Analytics, and BI Engineering teams. She is experienced in leading data transformations, performing analytical and management functions that contribute to the goals and growth objectives of organizations and divisions. Listen in as Vanessa and Vijay share how to enable data teams to flourish in an ever-evolving data landscape.
Summary
The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack today to learn more Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today. Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today Your host is Tobias Macey and today I'm interviewing Ashish Kumar and Abhishek Chauhan about Grainite, a platform designed to give you a single place to build streaming data applications
Interview
Introduction How did you get involved in the area of data management? Can you describe what Grainite is and the story behind it? What are the personas that you are focused on addressing with Grainite? What are some of the most complex aspects of building streaming data applications in the absence of something like Grainite?
How does Grainite work to reduce that complexity?
What are some of the commonalities that you see in the teams/organizations that find their way to Grainite?
What are some of the higher-order projects that teams are able to build when they are using Grainite as a starting point vs. where they would be spending effort on a fully managed streaming architecture?
Can you describe how Grainite is architected?
How have the design and goals of the platform changed/evolved since you first started working on it?
Wh
When it comes to simulation, we're all really asking the same question: are we living in one? Alas! We did not tackle that on this episode. Instead, with Julie Hoyer as a guest co-host while Moe is on leave, we were joined by Frances Sneddon, the CTO of Simul8, to dig into some of the nuts and bolts of simulation as a tool for improving processes. It turns out that effectively putting simulations to use means focusing on some of the same foundational aspects of effectively using analytics, data science, or experimentation: clearly defining the problem, tapping into the domain experts to actually understand the process or scenario of focus, and applying some level of "art" to complement the science of the work! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.
Data leaders play a critical role in driving innovation and growth in various industries, and this is particularly true in highly regulated industries such as aviation. In such industries, data leaders face unique challenges and opportunities, working to balance the need for innovation with strict regulatory requirements. This week’s guest is Derek Cedillo, who has 27 years of experience working in Data and Analytics at GE Aerospace. Derek currently works as a Senior Manager for GE Aerospace’s Remote Monitoring and Diagnostics division, having previously worked as the Senior Director for Data Science and Analytics. In the episode, Derek shares the key components to successfully managing a Data Science program within a large and highly regulated organization. He also shares his insights on how to standardize data science planning across various projects and how to get a Data Scientists to think and work in an agile manner. We hear about ideal data team structures, how to approach hiring, and what skills to look for in new hires. The conversation also touches on what responsibility Data Leaders have within organizations, championing data-driven decisions and strategy, as well as the complexity Data Leaders face in highly regulated industries. When it comes to solving problems that provide value for the business, engagement and transparency are key aspects. Derek shares how to ensure that expectations are met through clear and frank conversations with executives that try to align expectations between management and Data Science teams.
Finally, you'll learn about validation frameworks, best practices for teams in less regulated industries, what trends to look out for in 2023 and how ChatGPT is changing how executives define their expectations from Data Science teams.
Links to mentioned in the show: The Checklist Manifesto by Atul Gawande Team of Teams by General Stanley McChrystal The Harvard Data Science Review Podcast
Relevant Links from DataCamp: Article: Storytelling for More Impactful Data Science Course: Data Communication Concepts Course: Data-Driven Decision-Making for Business
Summary
As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today RudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/rudder Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Yoav Cohen about the challenges that data teams face in securing their data platforms and how that impacts the productivity and adoption of data in the organization
Interview
Introduction How did you get involved in the area of data management? Data security is a very broad term. Can you start by enumerating some of the different concerns that are involved? How has the scope and complexity of implementing security controls on data systems changed in recent years?
In your experience, what is a typical number of data locations that an organization is trying to manage access/permissions within?
What are some of the main challenges that data/compliance teams face in establishing and maintaining security controls?
How much of the problem is technical vs. procedural/organizational?
As a vendor in the space, how do you think about the broad categories/boundary lines for the different elements of data security? (e.g. masking vs. RBAC, etc.)
What are the different layers that are best suited to managing each of those categories? (e.g. masking and encryption in storage layer, RBAC in warehouse, etc.)
What are some of the ways that data security and organizational productivity are at odds with each other?
What are some of the shortcuts that you see teams and individuals taking to address the productivity hit from security controls?
What are some of the methods that you have found to be most effective at mitigating or even improving productivity impacts through security controls?
How does up-front design of the security layers improve the final outcome vs. trying to bolt on security after the platform is already in use? How can education about the motivations for different security practices improve compliance and user experience?
What are the most interesting, innovative, or unexpected ways that you have seen data teams align data security and productivity? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data security technology? What are the areas of data security that still need improvements?
Contact Info
Yoav Cohen
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Satori
Podcast Episode
Data Masking RBAC == Role Based Access Control ABAC == Attribute Based Access Control Gartner Data Security Platform Report
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Rudderstack:
Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit RudderStack.com/DEP to learn moreData Council:
Join us at the event for the global data community, Data Council Austin. From March 28-30th 2023, we'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount off tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit: dataengineeringpodcast.com/data-council Promo Code: dataengpod20TimeXtender:
TimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.
You can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.
Go to dataengineeringpodcast.com/timextender today to get started for free!Support Data Engineering Podcast
Oftentimes, Kaggle competitions are looked at as an excellent way for data scientists to sharpen their machine learning skills and become technically excellent. This begs the question, what are the hallmarks of high-performing Kaggle competitors? What makes a Kaggle Grand Master? Today’s guest, Jean-Francois Puget PhD, distinguished engineer at NVIDIA, has achieved this impressive feat three times. Throughout the episode, Richie and Jean-Francois discuss his background and how he became a Kaggle Grandmaster. He shares his scientific approach to machine learning and how he uses this to consistently achieve high results in Kaggle competitions. Jean-Francois also discusses how NVIDIA employs nine Kaggle Grandmasters and how they use Kaggle experiments to breed innovation in solving their machine learning challenges. He expands on the toolkit he employs in solving Kaggle competitions, and how he has achieved 50X improvements in efficiencies using tools like RAPIDS. Richie and Jean-Francois also delve into the difference between competitive data science on Kaggle and machine learning work in a real-world setting. They deep dive into the challenges of real-world machine learning, and how to resolve the ambiguities of using machine learning in production that data scientists don’t encounter in Kaggle competitions.
Summary
With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Your host is Tobias Macey and today I'm interviewing Priyendra Deshwal about how NetSpring is using the data warehouse to deliver a more flexible and detailed view of your product analytics
Interview
Introduction How did you get involved in the area of data management? Can you describe what NetSpring is and the story behind it?
What are the activities that constitute "product analytics" and what are the roles/teams involved in those activities?
When teams first come to you, what are the common challenges that they are facing and what are the solutions that they have attempted to employ? Can you describe some of the challenges involved in bringing product analytics into enterprise or highly regulated environments/industries?
How does a warehouse-native approach simplify that effort?
There are many different players (both commercial and open source) in the product analytics space. Can you share your view on the role that NetSpring plays in that ecosystem? How is the NetSpring platform implemented to be able to best take advantage of modern warehouse technologies and the associated data stacks?
What are the pre-requisites for an organization's infrastructure/data maturity for being able to benefit from NetSpring? How have the goals and implementation of the NetSpring platform evolved from when you first started working on it?
Can you describe the steps involved in integrating NetSpring with an organization's existing warehouse?
What are the signals that NetSpring uses to understand the customer journeys of different organizations? How do you manage the variance of the data models in the warehouse while providing a consistent experience for your users?
Given that you are a product organization, how are you using NetSpring to power NetSpring? What are the most interesting, innovative, or unexpected ways that you have seen NetSpring used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on NetSpring? When is NetSpring the wrong choice? What do you have planned for the future of NetSpring?
Contact Info
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
NetSpring ThoughtSpot Product Analytics Amplitude Mixpanel Customer Data Platform GDPR CCPA Segment
Podcast Episode
Rudderstack
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
TimeXtender:
TimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.
You can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.
Go to dataengineeringpodcast.com/timextender today to get started for free!Rudderstack: 
RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.
RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.
RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.
Visit dataengineeringpodcast.com/rudderstack to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Data Council:
Join us at the event for the global data community, Data Council Austin. From March 28-30th 2023, we'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount off tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit: dataengineeringpodcast.com/data-council Promo Code: dataengpod20Support Data Engineering Podcast