AI can type faster than you. However, it has been trained on lots of naive or poor code (and a little decent code). Let's explore how you can take advantage of software engineering (and Python) best practices to help tame the bias of the AIs.
talk-data.com
Speaker
Matt Harrison
3
talks
Matt Harrison teaches Python and Data Science. He holds a computer science degree from Stanford University and is a best-selling author on Python and data topics. His books include Effective Pandas, Illustrated Guide to Learning Python 3, Intermediate Python, Learning the Pandas Library, and Effective PyCharm, with recent titles Machine Learning Pocket Reference and Pandas Cookbook (Second Edition). He has taught courses at major companies such as Netflix and NASA, at universities including Stanford, University of Utah, and BYU, and has trained thousands through live online and in-person sessions since 2000.
Bio from: PyBerlin 56 - December event
Filter by Event / Source
Talks & appearances
3 activities · Newest first
Matt Harrison is the author of many of the most successful Python books, including Effective Pandas, Effective XGBoost, The Machine Learning Pocket Reference, and many more. I consider him the top author of Python books and content on the planet.
He stopped by my house to chat about self-publishing technical books, the pros and cons of using a publisher, book piracy, and much more. We both talk about our experiences as best-selling technical authors, and don't hold back in this wide ranging and very candid conversation. Enjoy!
Note - my audio got a bit clippy in spots. Sorry if I blew up your speaker.
Summary Pandas is a powerful tool for cleaning, transforming, manipulating, or enriching data, among many other potential uses. As a result it has become a standard tool for data engineers for a wide range of applications. Matt Harrison is a Python expert with a long history of working with data who now spends his time on consulting and training. He recently wrote a book on effective patterns for Pandas code, and in this episode he shares advice on how to write efficient data processing routines that will scale with your data volumes, while being understandable and maintainable.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Matt Harrison about useful tips for using Pandas for data engineering projects
Interview
Introduction How did you get involved in the area of data management? What are the main tasks that you have seen Pandas used for in a data engineering context? What are some of the common mistakes that can lead to poor performance when scaling to large data sets? What are some of the utility features that you have found most helpful for data processing? One of the interesting add-ons to Pandas is its integration with Arrow. What are some of the considerations for how and when to use the Arrow capabilities vs. out-of-the-box Pandas? Pandas is a tool that spans data processing and data science. What are some of the ways that data engineers should think about writing their code to make it accessible to data scientists for supporting collaboration across data workflows? Pandas is often used for transformation logic. What are some of the ways that engineers should approach the design of their code to make it understandable and maint