talk-data.com talk-data.com

Topic

Scikit-learn

machine_learning data_science data_analysis

63

tagged

Activity Trend

6 peak/qtr
2020-Q1 2026-Q2

Activities

63 activities · Newest first

Building Data Science Applications with FastAPI

This comprehensive guide to FastAPI walks readers through developing modern web backends optimized for data science applications. By mastering key concepts like dependency injection and asynchronous programming, you will create high-performing REST APIs and machine learning powered systems. What this Book will help me do Master asynchronous programming and type hinting in Python for efficient coding. Design comprehensive RESTful APIs for machine learning with FastAPI. Build, test, and maintain scalable data science applications. Integrate Python libraries like NumPy and scikit-learn into web backends. Deploy modular and efficient FastAPI-backed systems to production. Author(s) None Voron is a seasoned software developer specialized in web frameworks and data science applications. With a strong background in building scalable systems, they bring invaluable insights on utilizing FastAPI. Voron emphasizes clarity and hands-on learning, sharing their expertise to help developers master the technology efficiently. Who is it for? This book is ideal for data scientists and Python developers interested in creating efficient data science backends. If you have groundwork knowledge of machine learning concepts and Python programming, this book will enhance your ability to deploy and manage APIs for data-driven applications.

Data Science Projects with Python - Second Edition

Data Science Projects with Python offers a hands-on, project-based approach to learning data science using real-world data sets and tools. You will explore data using Python libraries like pandas and Matplotlib, build machine learning models with scikit-learn, and apply advanced techniques like XGBoost and SHAP values. This book equips you to confidently extract insights, evaluate models, and deliver results with clarity. What this Book will help me do Learn to load, clean, and preprocess data using Python and pandas. Build and evaluate predictive models, including logistic regression and random forests. Visualize data effectively using Python libraries like Matplotlib. Master advanced techniques like XGBoost and algorithmic fairness. Communicate data-driven insights to aid decision making in practical scenarios. Author(s) Stephen Klosterman is an experienced data scientist with a strong focus on practical applications of machine learning in business. Combining a rich academic background with hands-on industry experience, he excels at explaining complex concepts in an approachable way. As the author of 'Data Science Projects with Python,' his goal is to provide learners with the skills needed for real-world data science challenges. Who is it for? This book is ideal for beginners in data science and machine learning who have some basic programming knowledge in Python. Aspiring data scientists will benefit from its practical, end-to-end examples. Professionals seeking to expand their skillset in predictive modeling and delivering business insights will find this book invaluable. Some foundation in statistics and programming is recommended.

Machine Learning and Data Science Blueprints for Finance

Over the next few decades, machine learning and data science will transform the finance industry. With this practical book, analysts, traders, researchers, and developers will learn how to build machine learning algorithms crucial to the industry. You'll examine ML concepts and over 20 case studies in supervised, unsupervised, and reinforcement learning, along with natural language processing (NLP). Ideal for professionals working at hedge funds, investment and retail banks, and fintech firms, this book also delves deep into portfolio management, algorithmic trading, derivative pricing, fraud detection, asset price prediction, sentiment analysis, and chatbot development. You'll explore real-life problems faced by practitioners and learn scientifically sound solutions supported by code and examples. This book covers: Supervised learning regression-based models for trading strategies, derivative pricing, and portfolio management Supervised learning classification-based models for credit default risk prediction, fraud detection, and trading strategies Dimensionality reduction techniques with case studies in portfolio management, trading strategy, and yield curve construction Algorithms and clustering techniques for finding similar objects, with case studies in trading strategies and portfolio management Reinforcement learning models and techniques used for building trading strategies, derivatives hedging, and portfolio management NLP techniques using Python libraries such as NLTK and scikit-learn for transforming text into meaningful representations

The Data Science Workshop - Second Edition

The Data Science Workshop provides a comprehensive introduction to building real-world data science projects. Through a hands-on approach, you will learn how to analyze data, build machine learning models, and deploy them effectively in various scenarios. This book is designed to equip you with the skills to confidently tackle data science challenges. What this Book will help me do Understand the differences between supervised and unsupervised learning to select the appropriate technique. Master data manipulation and analysis using popular Python libraries like pandas and scikit-learn. Develop skills in regression, classification, and clustering to solve diverse data science problems. Learn advanced methods to improve model accuracy, including hyperparameter tuning and feature engineering. Implement and deploy machine learning models efficiently in production workflows. Author(s) The authors of The Data Science Workshop are experienced professionals and educators in the field of data science and machine learning. They have extensive expertise in using practical methods to solve data challenges and have a passion for teaching others through engaging and clear instructional material. Who is it for? This book is ideal for aspiring data analysts, data scientists, and business analysts who wish to build foundational skills in data science. It caters to those new to the field and professionals transitioning to a data-centric role, providing practical knowledge without requiring an advanced mathematical background. Familiarity with Python is recommended.

The Data Science Workshop

The Data Science Workshop is designed for beginners looking to step into the rigorous yet rewarding world of data science. By leveraging a hands-on approach, this book demystifies key concepts and guides you gently into creating practical machine learning models with Python. What this Book will help me do Understand supervised and unsupervised learning and their applications. Gain hands-on experience with Python libraries like scikit-learn and pandas for data manipulation. Learn practical use cases of machine learning techniques such as regression and clustering. Discover techniques to ensure robustness in machine learning with hyperparameter tuning and ensembling. Develop efficiency in feature engineering with automated tools to accelerate workflows. Author(s) Anthony So None, Thomas Joseph, Robert Thas John, and Andrew Worsley are seasoned experts in data science and Python programming. Along with Dr. Samuel Asare None, they bring decades of experience and practical knowledge to this book, delivering an engaging and approachable learning experience. Who is it for? This book is targeted toward individuals who are beginners in data science and are eager to acquire foundational knowledge and practical skills. It appeals to those who prefer a structured, hands-on approach to learning, possibly having some prior programming experience or interest in Python. Professionals aspiring to pivot into data-oriented roles or students aiming to strengthen their understanding of data science concepts will find this book particularly valuable. If you're looking to gain confidence in implementing data science projects and solving real-world problems, this text is for you.

Learn Python by Building Data Science Applications

Learn Python by Building Data Science Applications takes a hands-on approach to teaching Python programming by guiding you through building engaging real-world data science projects. This book introduces Python's rich ecosystem and equips you with the skills to analyze data, train models, and deploy them as efficient applications. What this Book will help me do Get proficient in Python programming by learning core topics like data structures, loops, and functions. Explore data science libraries such as NumPy, Pandas, and scikit-learn to analyze and process data. Learn to create visualizations with Matplotlib and Altair, simplifying data communication. Build and deploy machine learning models using Python and share them as web services. Understand development practices such as testing, packaging, and continuous integration for professional workflows. Author(s) None Kats and None Katz are seasoned Python developers with years of experience in teaching programming and deploying data science applications. Their expertise spans providing learners with practical knowledge and versatile skills. They combine clear explanations with engaging projects to ensure a rewarding learning experience. Who is it for? This book is ideal for individuals new to programming or data science who want to learn Python through practical projects. Researchers, analysts, and ambitious students with minimal coding background but a keen interest in data analysis and application development will find this book beneficial. It's a perfect choice for anyone eager to explore and leverage Python for real-world solutions.

Hands-On Data Analysis with Pandas

Hands-On Data Analysis with Pandas provides an intensive dive into mastering the pandas library for data science and analysis using Python. Through a combination of conceptual explanations and practical demonstrations, readers will learn how to manipulate, visualize, and analyze data efficiently. What this Book will help me do Understand and apply the pandas library for efficient data manipulation. Learn to perform data wrangling tasks such as cleaning and reshaping datasets. Create effective visualizations using pandas and libraries like matplotlib and seaborn. Grasp the basics of machine learning and implement solutions with scikit-learn. Develop reusable data analysis scripts and modules in Python. Author(s) Stefanie Molin is a seasoned data scientist and software engineer with extensive experience in Python and data analytics. She specializes in leveraging the latest data science techniques to solve real-world problems. Her engaging and detailed writing draws from her practical expertise, aiming to make complex concepts accessible to all. Who is it for? This book is ideal for data analysts and aspiring data scientists who are at the beginning stages of their careers or looking to enhance their toolset with pandas and Python. It caters to Python developers eager to delve into data analysis workflows. Readers should have some programming knowledge to fully benefit from the examples and exercises.

Data Science with Python and Dask

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you’re already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And Data Science with Python and Dask is your guide to using Dask for your data projects without changing the way you work! About the Technology An efficient data pipeline means everything for the success of a data science project. Dask is a flexible library for parallel computing in Python that makes it easy to build intuitive workflows for ingesting and analyzing large, distributed datasets. Dask provides dynamic task scheduling and parallel collections that extend the functionality of NumPy, Pandas, and Scikit-learn, enabling users to scale their code from a single laptop to a cluster of hundreds of machines with ease. About the Book Data Science with Python and Dask teaches you to build scalable projects that can handle massive datasets. After meeting the Dask framework, you’ll analyze data in the NYC Parking Ticket database and use DataFrames to streamline your process. Then, you’ll create machine learning models using Dask-ML, build interactive visualizations, and build clusters using AWS and Docker. What's Inside Working with large, structured and unstructured datasets Visualization with Seaborn and Datashader Implementing your own algorithms Building distributed apps with Dask Distributed Packaging and deploying Dask apps About the Reader For data scientists and developers with experience using Python and the PyData stack. About the Author Jesse Daniel is an experienced Python developer. He taught Python for Data Science at the University of Denver and leads a team of data scientists at a Denver-based media technology company. We interviewed Jesse as a part of our Six Questions series. Check it out here. Quotes The most comprehensive coverage of Dask to date, with real-world examples that made a difference in my daily work. - Al Krinker, United States Patent and Trademark Office An excellent alternative to PySpark for those who are not on a cloud platform. The author introduces Dask in a way that speaks directly to an analyst. - Jeremy Loscheider, Panera Bread A greatly paced introduction to Dask with real-world datasets. - George Thomas, R&D Architecture Manhattan Associates The ultimate resource to quickly get up and running with Dask and parallel processing in Python. - Gustavo Patino, Oakland University William Beaumont School of Medicine

Machine Learning for Finance

Dive deep into how machine learning is transforming the financial industry with 'Machine Learning for Finance'. This comprehensive guide explores cutting-edge concepts in machine learning while providing practical insights and Python code examples to help readers apply these techniques to real-world financial scenarios. Whether tackling fraud detection, financial forecasting, or sentiment analysis, this book equips you with the understanding and tools needed to excel. What this Book will help me do Understand and implement machine learning techniques for structured data, natural language, images, and text. Learn Python-based tools and libraries such as scikit-learn, Keras, and TensorFlow for financial data analysis. Apply machine learning for tasks like predicting financial trends, detecting fraud, and customer sentiment analysis. Explore advanced topics such as neural networks, generative adversarial networks (GANs), and reinforcement learning. Gain hands-on experience with machine learning debugging, products launch preparation, and addressing bias in data. Author(s) James Le None and Jannes Klaas are experts in machine learning applications in financial technology. Jannes has extensive experience training financial professionals on implementing machine learning strategies in their work and pairs this with a deep academic understanding of the topic. Their dedication to empowering readers to confidently integrate AI and machine learning into financial applications shines through in this user-focused, richly detailed book. Who is it for? This book is tailored for financial professionals, data scientists, and enthusiasts aiming to harness machine learning's potential in finance. Readers should have a foundational understanding of mathematics, statistics, and Python programming. If you work in financial services and are curious about applications ranging from fraud detection to trend forecasting, this resource is for you. It's designed for those looking to advance their skills and make impactful contributions in financial technology.

Data Science Projects with Python

Data Science Projects with Python introduces you to data science and machine learning using Python through practical examples. In this book, you'll learn to analyze, visualize, and model data, applying techniques like logistic regression and random forests. With a case-study method, you'll build confidence implementing insights in real-world scenarios. What this Book will help me do Set up a data science environment with necessary Python libraries such as pandas and scikit-learn. Effectively visualize data insights through Matplotlib and summary statistics. Apply machine learning models including logistic regression and random forests to solve data problems. Identify optimal models through evaluation metrics like k-fold cross-validation. Develop confidence in data preparation and modeling techniques for real-world data challenges. Author(s) Stephen Klosterman is a seasoned data scientist with a keen interest in practical applications of machine learning. He combines a strong academic foundation with real-world experience to craft relatable content. Stephen excels in breaking down complex topics into approachable lessons, helping learners grow their data science expertise step by step. Who is it for? This book is ideal for data analysts, scientists, and business professionals looking to enhance their skills in Python and data science. If you have some experience in Python and a foundational understanding of algebra and statistics, you'll find this book approachable. It offers an excellent gateway to mastering advanced data analysis techniques. Whether you're seeking to explore machine learning or apply data insights, this book supports your growth.

Numerical Python: Scientific Computing and Data Science Applications with Numpy, SciPy and Matplotlib

Leverage the numerical and mathematical modules in Python and its standard library as well as popular open source numerical Python packages like NumPy, SciPy, FiPy, matplotlib and more. This fully revised edition, updated with the latest details of each package and changes to Jupyter projects, demonstrates how to numerically compute solutions and mathematically model applications in big data, cloud computing, financial engineering, business management and more. Numerical Python, Second Edition, presents many brand-new case study examples of applications in data science and statistics using Python, along with extensions to many previous examples. Each of these demonstrates the power of Python for rapid development and exploratory computing due to its simple and high-level syntax and multiple options for data analysis. After reading this book, readers will be familiar with many computing techniques including array-based and symbolic computing, visualization and numerical file I/O, equation solving, optimization, interpolation and integration, and domain-specific computational problems, such as differential equation solving, data analysis, statistical modeling and machine learning. What You'll Learn Work with vectors and matrices using NumPy Plot and visualize data with Matplotlib Perform data analysis tasks with Pandas and SciPy Review statistical modeling and machine learning with statsmodels and scikit-learn Optimize Python code using Numba and Cython Who This Book Is For Developers who want to understand how to use Python and its related ecosystem for numerical computing.

Python Data Science Essentials - Third Edition

Learn the essentials of data science with Python through this comprehensive guide. By the end of this book, you'll have an in-depth understanding of core data science workflows, tools, and techniques. What this Book will help me do Understand and apply data manipulation techniques with pandas and NumPy. Build and optimize machine learning models with scikit-learn. Analyze and visualize complex datasets for derived insights. Implement exploratory data analysis to uncover trends in data. Leverage advanced techniques like graph analysis and deep learning for sophisticated projects. Author(s) Alberto Boschetti and Luca Massaron combine their extensive expertise in data science and Python programming to guide readers effectively. With hands-on knowledge and a passion for teaching, they provide practical insights across the data science lifecycle. Who is it for? This book is ideal for aspiring data scientists, data analysts, and software developers aiming to enhance their data analysis skills. Suited for beginners familiar with Python and basic statistics, this guide bridges the gap to real-world applications. Advance your career by unlocking crucial data science expertise.

Python Data Analytics: With Pandas, NumPy, and Matplotlib

Explore the latest Python tools and techniques to help you tackle the world of data acquisition and analysis. You'll review scientific computing with NumPy, visualization with matplotlib, and machine learning with scikit-learn. This revision is fully updated with new content on social media data analysis, image analysis with OpenCV, and deep learning libraries. Each chapter includes multiple examples demonstrating how to work with each library. At its heart lies the coverage of pandas, for high-performance, easy-to-use data structures and tools for data manipulation Author Fabio Nelli expertly demonstrates using Python for data processing, management, and information retrieval. Later chapters apply what you've learned to handwriting recognition and extending graphical capabilities with the JavaScript D3 library. Whether you are dealing with sales data, investment data, medical data, web page usage, or other data sets, Python Data Analytics, Second Edition is an invaluable reference with its examples of storing, accessing, and analyzing data. What You'll Learn Understand the core concepts of data analysis and the Python ecosystem Go in depth with pandas for reading, writing, and processing data Use tools and techniques for data visualization and image analysis Examine popular deep learning libraries Keras, Theano,TensorFlow, and PyTorch Who This Book Is For Experienced Python developers who need to learn about Pythonic tools for data analysis

Summary

The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements:

There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.

Your host is Tobias Macey and today I’m interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists

Interview

Introduction How did you get involved in the area of data management? The terms “Data Scientist” and “Data Engineer” are fluid and seem to have a different meaning for everyone who uses them. Can you share how you define those terms? What parallels do you see between the relationships of data engineers and data scientists and those of developers and systems administrators? Is there a particular size of organization or problem that serves as a tipping point for when you start to separate the two roles into the responsibilities of more than one person or team? What are the benefits of splitting the responsibilities of data engineering and data science?

What are the disadvantages?

What are some strategies to ensure successful interaction between data engineers and data scientists? How do you view these roles evolving as they become more prevalent across companies and industries?

Contact Info

Website wdm0006 on GitHub @willmcginniser on Twitter LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Blog Post: Tendencies of Data Engineers and Data Scientists Predikto Categorical Encoders DevOps SciKit-Learn

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Pandas for Everyone: Python Data Analysis, First Edition

The Hands-On, Example-Rich Introduction to Pandas Data Analysis in Python Today, analysts must manage data characterized by extraordinary variety, velocity, and volume. Using the open source Pandas library, you can use Python to rapidly automate and perform virtually any data analysis task, no matter how large or complex. Pandas can help you ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple datasets. brings together practical knowledge and insight for solving real problems with Pandas, even if you’re new to Python data analysis. Daniel Y. Chen introduces key concepts through simple but practical examples, incrementally building on them to solve more difficult, real-world problems. Pandas for Everyone Chen gives you a jumpstart on using Pandas with a realistic dataset and covers combining datasets, handling missing data, and structuring datasets for easier analysis and visualization. He demonstrates powerful data cleaning techniques, from basic string manipulation to applying functions simultaneously across dataframes. Once your data is ready, Chen guides you through fitting models for prediction, clustering, inference, and exploration. He provides tips on performance and scalability, and introduces you to the wider Python data analysis ecosystem. Work with DataFrames and Series, and import or export data Create plots with matplotlib, seaborn, and pandas Combine datasets and handle missing data Reshape, tidy, and clean datasets so they’re easier to work with Convert data types and manipulate text strings Apply functions to scale data manipulations Aggregate, transform, and filter large datasets with groupby Leverage Pandas’ advanced date and time capabilities Fit linear models using statsmodels and scikit-learn libraries Use generalized linear modeling to fit models with different response variables Compare multiple models to select the “best” Regularize to overcome overfitting and improve performance Use clustering in unsupervised machine learning Register your product at informit.com/register for convenient access to downloads, updates, and/or corrections as they become available.

Agile Data Science 2.0

Data science teams looking to turn research into useful analytics applications require not only the right tools, but also the right approach if they’re to succeed. With the revised second edition of this hands-on guide, up-and-coming data scientists will learn how to use the Agile Data Science development methodology to build data applications with Python, Apache Spark, Kafka, and other tools. Author Russell Jurney demonstrates how to compose a data platform for building, deploying, and refining analytics applications with Apache Kafka, MongoDB, ElasticSearch, d3.js, scikit-learn, and Apache Airflow. You’ll learn an iterative approach that lets you quickly change the kind of analysis you’re doing, depending on what the data is telling you. Publish data science work as a web application, and affect meaningful change in your organization. Build value from your data in a series of agile sprints, using the data-value pyramid Extract features for statistical models from a single dataset Visualize data with charts, and expose different aspects through interactive reports Use historical data to predict the future via classification and regression Translate predictions into actions Get feedback from users after each sprint to keep your project on track

Python: Data Analytics and Visualization

Understand, evaluate, and visualize data About This Book Learn basic steps of data analysis and how to use Python and its packages A step-by-step guide to predictive modeling including tips, tricks, and best practices Effectively visualize a broad set of analyzed data and generate effective results Who This Book Is For This book is for Python Developers who are keen to get into data analysis and wish to visualize their analyzed data in a more efficient and insightful manner. What You Will Learn Get acquainted with NumPy and use arrays and array-oriented computing in data analysis Process and analyze data using the time-series capabilities of Pandas Understand the statistical and mathematical concepts behind predictive analytics algorithms Data visualization with Matplotlib Interactive plotting with NumPy, Scipy, and MKL functions Build financial models using Monte-Carlo simulations Create directed graphs and multi-graphs Advanced visualization with D3 In Detail You will start the course with an introduction to the principles of data analysis and supported libraries, along with NumPy basics for statistics and data processing. Next, you will overview the Pandas package and use its powerful features to solve data-processing problems. Moving on, you will get a brief overview of the Matplotlib API .Next, you will learn to manipulate time and data structures, and load and store data in a file or database using Python packages. You will learn how to apply powerful packages in Python to process raw data into pure and helpful data using examples. You will also get a brief overview of machine learning algorithms, that is, applying data analysis results to make decisions or building helpful products such as recommendations and predictions using Scikit-learn. After this, you will move on to a data analytics specialization - predictive analytics. Social media and IOT have resulted in an avalanche of data. You will get started with predictive analytics using Python. You will see how to create predictive models from data. You will get balanced information on statistical and mathematical concepts, and implement them in Python using libraries such as Pandas, scikit-learn, and NumPy. You'll learn more about the best predictive modeling algorithms such as Linear Regression, Decision Tree, and Logistic Regression. Finally, you will master best practices in predictive modeling. After this, you will get all the practical guidance you need to help you on the journey to effective data visualization. Starting with a chapter on data frameworks, which explains the transformation of data into information and eventually knowledge, this path subsequently cover the complete visualization process using the most popular Python libraries with working examples This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: Getting Started with Python Data Analysis, Phuong Vo.T.H &Martin Czygan Learning Predictive Analytics with Python, Ashish Kumar Mastering Python Data Visualization, Kirthi Raman Style and approach The course acts as a step-by-step guide to get you familiar with data analysis and the libraries supported by Python with the help of real-world examples and datasets. It also helps you gain practical insights into predictive modeling by implementing predictive-analytics algorithms on public datasets with Python. The course offers a wealth of practical guidance to help you on this journey to data visualization

Summary

Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today!

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Daniel Whitenack about Pachyderm, a modern container based system for building and analyzing a versioned data lake.

Interview with Daniel Whitenack

Introduction How did you get started in the data engineering space? What is pachyderm and what problem were you trying to solve when the project was started? Where does the name come from? What are some of the competing projects in the space and what features does Pachyderm offer that would convince someone to choose it over the other options? Because of the fact that the analysis code and the data that it acts on are all versioned together it allows for tracking the provenance of the end result. Why is this such an important capability in the context of data engineering and analytics? What does Pachyderm use for the distribution and scaling mechanism of the file system? Given that you can version your data and track all of the modifications made to it in a manner that allows for traversal of those changesets, how much additional storage is necessary over and above the original capacity needed for the raw data? For a typical use of Pachyderm would someone keep all of the revisions in perpetuity or are the changesets primarily just useful in the context of an analysis workflow? Given that the state of the data is calculated by applying the diffs in sequence what impact does that have on processing speed and what are some of the ways of mitigating that? Another compelling feature of Pachyderm is the fact that it natively supports the use of any language for interacting with your data. Why is this such an important capability and why is it more difficult with alternative solutions?

How did you implement this feature so that it would be maintainable and easy to implement for end users?

Given that the intent of using containers is for encapsulating the analysis code from experimentation through to production, it seems that there is the potential for the implementations to run into problems as they scale. What are some things that users should be aware of to help mitigate this? The data pipeline and dependency graph tooling is a useful addition to the combination of file system and processing interface. Does that preclude any requirement for external tools such as Luigi or Airflow? I see that the docs mention using the map reduce pattern for analyzing the data in Pachyderm. Does it support other approaches such as streaming or tools like Apache Drill? What are some of the most interesting deployments and uses of Pachyderm that you have seen? What are some of the areas that you are looking for help from the community and are there any particular issues that the listeners can check out to get started with the project?

Keep in touch

Daniel

Twitter – @dwhitena

Pachyderm

Website

Free Weekend Project

GopherNotes

Links

AirBnB RethinkDB Flocker Infinite Project Git LFS Luigi Airflow Kafka Kubernetes Rkt SciKit Learn Docker Minikube General Fusion

The intro and outro music is from The Hug by The Freak Fandango Or

Python Data Science Handbook

For many researchers, Python is a first-class tool mainly because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the Python Data Science Handbook do you get them all—IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and other related tools. Working scientists and data crunchers familiar with reading and writing Python code will find this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python. With this handbook, you’ll learn how to use: IPython and Jupyter: provide computational environments for data scientists using Python NumPy: includes the ndarray for efficient storage and manipulation of dense data arrays in Python Pandas: features the DataFrame for efficient storage and manipulation of labeled/columnar data in Python Matplotlib: includes capabilities for a flexible range of data visualizations in Python Scikit-Learn: for efficient and clean Python implementations of the most important and established machine learning algorithms

Introduction to Machine Learning with Python

Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination. Youâ??ll learn the steps necessary to create a successful machine-learning application with Python and the scikit-learn library. Authors Andreas Müller and Sarah Guido focus on the practical aspects of using machine learning algorithms, rather than the math behind them. Familiarity with the NumPy and matplotlib libraries will help you get even more from this book. With this book, youâ??ll learn: Fundamental concepts and applications of machine learning Advantages and shortcomings of widely used machine learning algorithms How to represent data processed by machine learning, including which data aspects to focus on Advanced methods for model evaluation and parameter tuning The concept of pipelines for chaining models and encapsulating your workflow Methods for working with text data, including text-specific processing techniques Suggestions for improving your machine learning and data science skills