Pandas

Python for Data Analysis, 3rd Edition

2022-08-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Wes McKinney (Posit)

Data Science GitHub Matplotlib NumPy Python data data-science

Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.10 and pandas 1.4, the third edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You'll learn the latest versions of pandas, NumPy, and Jupyter in the process. Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It's ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub. Use the Jupyter notebook and IPython shell for exploratory computing Learn basic and advanced features in NumPy Get started with data analysis tools in the pandas library Use flexible tools to load, clean, transform, merge, and reshape data Create informative visualizations with matplotlib Apply the pandas groupby facility to slice, dice, and summarize datasets Analyze and manipulate regular and irregular time series data Learn how to solve real-world data analysis problems with thorough, detailed examples

Python for Data Science

2022-08-02 · O'Reilly Data Science Books O'Reilly Amazon

book

by Yuli Vasiliev

AI/ML Data Science Marketing Matplotlib NumPy Python Scikit-learn programming-languages software-development

Python is an ideal choice for accessing, manipulating, and gaining insights from data of all kinds. Python for Data Science introduces you to the Pythonic world of data analysis with a learn-by-doing approach rooted in practical examples and hands-on activities. Youâ??ll learn how to write Python code to obtain, transform, and analyze data, practicing state-of-the-art data processing techniques for use cases in business management, marketing, and decision support. You will discover Pythonâ??s rich set of built-in data structures for basic operations, as well as its robust ecosystem of open-source libraries for data science, including NumPy, pandas, scikit-learn, matplotlib, and more. Examples show how to load data in various formats, how to streamline, group, and aggregate data sets, and how to create charts, maps, and other visualizations. Later chapters go in-depth with demonstrations of real-world data applications, including using location data to power a taxi service, market basket analysis to identify items commonly purchased together, and machine learning to predict stock prices.

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Analytics AWS BI Cloud Computing Data Quality Databricks Delta Kafka Python Spark SQL +1 more

Microservices is an increasingly popular architecture much loved by application teams, for it allows services to be developed and scaled independently. Data teams, though, often need a centralized repository where all data from different services come together to join and aggregate. The data platform can serve as a single source of company facts, enable near real time analytics, and secure sharing of massive data sets across clouds.

A viable microservices ingestion pattern is Change Data Capture, using AWS Database Migration Services or Debezium. CDC proves to be a scalable solution ideal for stable platforms, but it has several challenges for evolving services: Frequent schema changes, complex, unsupported DDL during migration, and automated deployments are but a few. An event streaming architecture can address these challenges.

Confluent, for example, provides a schema registry service where all services can register their event schemas. Schema registration helps with verifying that the events are being published based on the agreed contracts between data producers and consumers. It also provides a separation between internal service logic and the data consumed downstream. The services write their events to Kafka using the registered schemas with a specific topic based on the type of the event.

Data teams can leverage Spark jobs to ingest Kafka topics into Bronze tables in the Delta Lake. On ingestion, the registered schema from schema registry is used to validate the schema based on the provided version. A merge operation is sometimes called to translate events into final states of the records per business requirements.

Data teams can take advantage of Delta Live Tables on streaming datasets to produce Silver and Gold tables in near real time. Each input data source also has a set of expectations to ensure data quality and business rules. The pipeline allows Engineering and Analytics to collaborate by mixing Python and SQL. The refined data sets are then fed into Auto ML for discovery and baseline modeling.

To expose Gold tables to more consumers, especially non spark users across clouds, data teams can implement Delta Sharing. Recipients can accesses Silver tables from a different cloud and build their own analytics data sets. Analytics teams can also access Gold tables via pandas Delta Sharing client and BI tools.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Auto Encoder Decoder-Based Anomaly Detection with the Lakehouse Paradigm

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Data Lakehouse Data Management Databricks

Auto-Encoder-Decoder is a type of deep learning neural network architecture with an hourglass shape, high dimensional inputs are compressed to latent space through the encoder. The decoder mirrors the encoder architecture and reconstructs the input data from the latent space. Auto-Encoder-Decoder models are commonly used for anomaly detection, after training, the reconstructed error of normal data is minimized thus anomaly can be detected if its reconstructed error gets higher than the “normal threshold”. This presentation will demonstrate an Auto-Encoder-Decoder anomaly detection solution built with the Lakehouse Paradigm, from data management to after-deployment monitoring, to explain the entire model life cycle. It will also highlight the flexibility and scalability that MLflow custom model and Pandas UDF can bring when a large number of individual models need to be trained, deployed, and monitored in parallel.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

PySpark in Apache Spark 3.3 and Beyond

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

API Databricks PySpark Python Spark

PySpark has rapidly evolved with the momentum of Project Zen introduced in Apache Spark 3.0. We improved error messages, added type hints for autocompletion, implemented visualization, etc. Most importantly, Pandas API on Spark was introduced from Apache Spark 3.2 which exposes the pandas API that runs on Apache Spark, and the Pandas API on Spark has gained a lot of popularity.

In Apache Spark 3.3, the effort of Project Zen continued and PySpark has many cool changes such as more API coverage & faster default index in Pandas API on Spark, datetime.timedelta support, new PyArrow batch interface, better autocompletion, Python & Pandas UDF profiler and new error classification.

In this talk, we will introduce what is new in PySpark at Apache Spark 3.3, and what is next beyond Apache Spark 3.3 with the current effort and roadmap in PySpark.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Deep Dive into the New Features of Apache Spark 3.2 and 3.3

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML API Big Data Data Engineering Data Science Databricks Spark SQL

Apache Spark has become the most widely-used engine for executing data engineering, data science and machine learning on single-node machines or clusters. The number of monthly maven downloads of Spark has rapidly increased to 20 million.

We will talk about the higher-level features and improvements in Spark 3.2 and 3.3. The talk also dives deeper into the following features + Introducing pandas API on Apache Spark to unify small data API and big data API. + Completing the ANSI SQL compatibility mode to simplify migration of SQL workloads. + Productionizing adaptive query execution to speed up Spark SQL at runtime. + Introducing RocksDB state store to make state processing more scalable

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

FugueSQL—The Enhanced SQL Interface for Pandas and Spark DataFrames

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Databricks Python Spark SQL

SQL users working with Pandas and Spark quickly realize SQL is a second-class interface, invoked between predominantly Python code.

We will introduce FugueSQL, an enhanced SQL interface that allows SQL lovers to express end-to-end workflows predominantly in SQL. With a Jupyter notebook extension, SQL commands can be used in Databricks notebooks for interactive handling of in-memory datasets. This allows heavy SQL users to fully leverage Spark in their preferred grammar.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Pandas & Friends

2022-07-15 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

by Murilo Cunha

Python

Send us a text Today we are joined by Murilo Cunha who talks us through the good, the bad, and the ugly of the Pandas python library and its many alternatives. Pandas is widely used for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Tour de Tools is brought to you by Dataroots Music from Uppbeat (free for Creators!)Thumbnail image is generated by Craiyon

In-Memory Analytics with Apache Arrow

2022-06-24 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Matthew Topol (Voltron Data)

Analytics API Arrow Data Analytics Parquet Python Spark apache-arrow data data-engineering

Discover the power of in-memory data analytics with "In-Memory Analytics with Apache Arrow." This book delves into Apache Arrow's unique capabilities, enabling you to handle vast amounts of data efficiently and effectively. Learn how Arrow improves performance, offers seamless integration, and simplifies data analysis in diverse computing environments. What this Book will help me do Gain proficiency with the datastore facilities and data types defined by Apache Arrow. Master the Arrow Flight APIs to efficiently transfer data between systems. Learn to leverage in-memory processing advantages offered by Arrow for state-of-the-art analytics. Understand how Arrow interoperates with popular tools like Pandas, Parquet, and Spark. Develop and deploy high-performance data analysis pipelines with Apache Arrow. Author(s) Matthew Topol, the author of the book, is an experienced practitioner in data analytics and Apache Arrow technology. Having contributed to the development and implementation of Arrow-powered systems, he brings a wealth of knowledge to readers. His ability to delve deep into technical concepts while keeping explanations practical makes this book an excellent guide for learners of the subject. Who is it for? This book is ideal for professionals in the data domain including developers, data analysts, and data scientists aiming to enhance their data manipulation capabilities. Beginners with some familiarity with data analysis concepts will find it beneficial, as well as engineers designing analytics utilities. Programming examples accommodate users of C, Go, and Python, making it broadly accessible.

The Pandas Workshop

2022-06-17 · O'Reilly Data Science Books O'Reilly Amazon

book

by William So , Thomas Joseph , Blaine Bateman , Saikat Basak

Data Science Matplotlib Python data data-science data-science-tools

The Pandas Workshop offers a detailed journey into the world of data analysis using Python and the pandas library. Throughout the book, you'll build skills in accessing, transforming, visualizing, and modeling data, all while focusing on real-world data science challenges. You will gain the knowledge and confidence needed to dissect and derive insights from complex datasets. What this Book will help me do Understand how to access and load data from various formats including databases and web-based sources. Manipulate and transform data for analysis using efficient pandas techniques. Create insightful visualizations using Matplotlib integrated with pandas for clearer data presentation. Build predictive and descriptive data models and glean data-driven insights. Handle and analyze time-series data to uncover trends and seasonal effects in data patterns. Author(s) Blaine Bateman, Saikat Basak, Thomas Joseph, and William So collectively bring diverse expertise in data analysis, programming, and teaching. Their goal is to make cutting-edge data science techniques accessible through clear explanations and practical exercises, helping learners from varied backgrounds master the pandas library. Who is it for? This book is best suited for novice to intermediate programmers and data enthusiasts who are already familiar with Python but are new to the pandas library. Ideal readers are those interested in honing their skills in data analysis and visualization, as well as leveraging data for informed decision-making. Whether you're an analyst, aspiring data scientist, or business professional seeking to strengthen your analytical toolkit, this book provides beneficial insights and techniques.

Building Data Science Solutions with Anaconda

2022-05-27 · O'Reilly Data Science Books O'Reilly Amazon

book

by Dan Meador

AI/ML Data Science NumPy Python data data-science

Explore the comprehensive world of data science with "Building Data Science Solutions with Anaconda." This book covers essential topics like managing environments with Anaconda, detecting and overcoming bias, and ensuring model interpretability. Delve into practical tools and solutions, all explained in an approachable way to help you become proficient in data science workflows. What this Book will help me do Master environment management for data science projects using Anaconda and conda. Detect and mitigate dataset biases to ensure fair and ethical machine learning models. Learn advanced data science techniques with tools like NumPy, pandas, and Jupyter Notebooks. Understand and explain your machine learning models using LIME and SHAP. Grow your expertise in selecting and fine-tuning AI/ML algorithms for diverse applications. Author(s) None Meador combines extensive expertise in data science with a thorough understanding of Anaconda tools and open source software. With a background in engineering and AI model management, None provides an insightful perspective on the field. Their practical and analogy-driven approach makes technical concepts accessible to learners of any level. Who is it for? This book is ideal for data analysts, aspiring machine learning engineers, and data science professionals who wish to deepen their knowledge and make the most of Anaconda's capabilities. A prior understanding of Python and basic data science principles is assumed. If you're looking to optimize your data science workflows and gain hands-on practice, this book is for you.

Python for ArcGIS Pro

2022-04-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by William Parker , Silas Toms

API Data Management GIS NumPy Python arcgis data data-engineering geographic-information-system-gis location-data

Python for ArcGIS Pro is your guide to automating geospatial tasks and maximizing your productivity using Python. Inside, you'll learn how to integrate Python scripting into ArcGIS workflows to streamline map production, data analysis, and data management. What this Book will help me do Automate map production and streamline repetitive cartography tasks. Conduct geospatial data analysis using Python libraries like pandas and NumPy. Integrate ArcPy and ArcGIS API for Python to manage geospatial data more effectively. Create script tools to improve repeatability and manage datasets. Publish and manage geospatial data to ArcGIS Online seamlessly. Author(s) None Toms and None Parker are both experienced GIS professionals and Python developers. With years of hands-on experience using Esri technology in real-world scenarios, they bring practical insights into the application's nuances. Their collaborative approach allows them to demystify technical concepts, making their teachings accessible to audiences of all skill levels. Who is it for? This book is for ArcGIS users looking to integrate Python into workflows, whether you're a GIS specialist, technician, or analyst. It's also suitable for those transitioning to roles requiring programming skills. A basic understanding of ArcGIS helps, but the book starts from the fundamentals.

Data Analysis with Python and PySpark

2022-03-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jonathan Rioux

AI/ML Analytics API Big Data Cloud Computing Data Science Hadoop Microsoft PySpark Python Spark apache-spark +2 more

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines. In Data Analysis with Python and PySpark you will learn how to: Manage your data as it scales across multiple machines Scale up your data programs with full confidence Read and write data to and from a variety of sources and formats Deal with messy data with PySpark’s data manipulation functionality Discover new data sets and perform exploratory data analysis Build automated data pipelines that transform, summarize, and get insights from data Troubleshoot common PySpark errors Creating reliable long-running jobs Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required. About the Technology The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem. About the Book Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code. What's Inside Organizing your PySpark code Managing your data, no matter the size Scale up your data programs with full confidence Troubleshooting common data pipeline problems Creating reliable long-running jobs About the Reader Written for data scientists and data engineers comfortable with Python. About the Author As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts. Quotes A clear and in-depth introduction for truly tackling big data with Python. - Gustavo Patino, Oakland University William Beaumont School of Medicine The perfect way to learn how to analyze and master huge datasets. - Gary Bake, Brambles Covers both basic and more advanced topics of PySpark, with a good balance between theory and hands-on. - Philippe Van Bergenl, P² Consulting For beginner to pro, a well-written book to help understand PySpark. - Raushan Kumar Jha, Microsoft

Build Your Python Data Processing Your Way And Run It Anywhere With Fugue

2022-02-21 · Data Engineering Podcast Listen

podcast_episode

by Kevin Kho (Fugue) , Tobias Macey

AI/ML API BigEye Data Engineering Data Management GitHub JavaScript Kubernetes Looker Modern Data Stack Python Snowflake +2 more

Summary Python has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Every data project starts with collecting the information that will provide answers to your questions or inputs to your models. The web is the largest trove of information on the planet and Oxylabs helps you unlock its potential. With the Oxylabs scraper APIs you can extract data from even javascript heavy websites. Combined with their residential proxies you can be sure that you’ll have reliable and high quality data whenever you need it. Go to dataengineeringpodcast.com/oxylabs today and use code DEP25 to get your special discount on residential proxies. Your host is Tobias Macey and today I’m interviewing Kevin Kho about Fugue, a library that offers a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites

Interview

Introduction How did you get involved in the area of data management? Can you describe what Fugue is and the story behind it? What are the core goals of the Fugue project? Who are the target users for Fugue and how does that influence the feature priorities and API design? How does Fugue compare to projects such as Modin, etc. for abst

Scalable Strategies For Protecting Data Privacy In Your Shared Data Sets

2022-02-06 · Data Engineering Podcast Listen

podcast_episode

by Will Thompson (Microsoft) , Tobias Macey

AI/ML Airflow Analytics API BigEye ClickHouse Cloud Computing Data Engineering Data Management Druid ETL/ELT Git +6 more

Summary There are many dimensions to the work of protecting the privacy of users in our data. When you need to share a data set with other teams, departments, or businesses then it is of utmost importance that you eliminate or obfuscate personal information. In this episode Will Thompson explores the many ways that sensitive data can be leaked, re-identified, or otherwise be at risk, as well as the different strategies that can be employed to mitigate those attack vectors. He also explains how he and his team at Privacy Dynamics are working to make those strategies more accessible to organizations so that you can focus on all of the other tasks required of you.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Will Thompson about managing data privacy concerns for data sets used in analytics and machine learning

Interview

Introduction How did you get involved in the area of data management? Data privacy is a multi-faceted problem domain. Can you start by enumerating the different categories of privacy concern that are involved in analytical use cases? Can you describe what Privacy Dynamics is and the story behind it?

Which categor(y|ies) are you focused on addressing?

What are some of the best practices in the definition, protection, and enforcement of data privacy policies?

Is there a data security/privacy equivalent to the OWASP top 10?

What are some of the techniques that are available for anonymizing data while maintaining statistical utility/significance?

What are some of the engineering/systems capabilities that are required for data (platform) engineers to incorporate these practices in their platforms?

What are the tradeoffs of encryption vs. obfuscation when anonymizing data? What are some of the types of PII that are non-obvious? What are the risks associated with data re-identification, and what are some of the vectors that might be exploited to achieve that?

How can privacy risks mitigation be maintained as new data sources are introduced that might contribute to these re-identification vectors?

Can you describe how Privacy Dynamics is implemented?

What are the most challenging engineering problems that you are dealing with?

How do you approach validation of a data set’s privacy? What have you found to be useful heuristics for identifying private data?

What are the risks of false positives vs. false negatives?

Can you describe what is involved in integrating the Privacy Dynamics system into an existing data platform/warehouse?

What would be required to integrate with systems such as Presto, Clickhouse, Druid, etc.?

What are the most interesting, innovative, or unexpected ways that you have seen Privacy Dynamics used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Privacy Dynamics? When is Privacy Dynamics the wrong choice? What do you have planned for the future of Privacy Dynamics?

Contact Info

LinkedIn @willseth on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

Privacy Dynamics Pandas

Podcast Episode – Pandas For Data Engineering

Homomorphic Encryption Differential Privacy Immuta

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Effective Pandas Patterns For Data Engineering

2022-01-31 · Data Engineering Podcast Listen

podcast_episode

by Matt Harrison , Tobias Macey

Airflow API Arrow BigEye Cloud Computing Data Engineering Data Management Data Science ETL/ELT Git Informatica Kubernetes +2 more

Summary Pandas is a powerful tool for cleaning, transforming, manipulating, or enriching data, among many other potential uses. As a result it has become a standard tool for data engineers for a wide range of applications. Matt Harrison is a Python expert with a long history of working with data who now spends his time on consulting and training. He recently wrote a book on effective patterns for Pandas code, and in this episode he shares advice on how to write efficient data processing routines that will scale with your data volumes, while being understandable and maintainable.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Matt Harrison about useful tips for using Pandas for data engineering projects

Interview

Introduction How did you get involved in the area of data management? What are the main tasks that you have seen Pandas used for in a data engineering context? What are some of the common mistakes that can lead to poor performance when scaling to large data sets? What are some of the utility features that you have found most helpful for data processing? One of the interesting add-ons to Pandas is its integration with Arrow. What are some of the considerations for how and when to use the Arrow capabilities vs. out-of-the-box Pandas? Pandas is a tool that spans data processing and data science. What are some of the ways that data engineers should think about writing their code to make it accessible to data scientists for supporting collaboration across data workflows? Pandas is often used for transformation logic. What are some of the ways that engineers should approach the design of their code to make it understandable and maint

Hands-on Matplotlib: Learn Plotting and Visualizations with Python 3

2021-11-27 · O'Reilly Data Science Books O'Reilly Amazon

book

by Ashwin Pajankar

AI/ML API DataViz MATLAB Matplotlib NumPy Python Seaborn data data-science data-science-tasks data-visualization +1 more

Learn the core aspects of NumPy, Matplotlib, and Pandas, and use them to write programs with Python 3. This book focuses heavily on various data visualization techniques and will help you acquire expert-level knowledge of working with Matplotlib, a MATLAB-style plotting library for Python programming language that provides an object-oriented API for embedding plots into applications. You'll begin with an introduction to Python 3 and the scientific Python ecosystem. Next, you'll explore NumPy and ndarray data structures, creation routines, and data visualization. You'll examine useful concepts related to style sheets, legends, and layouts, followed by line, bar, and scatter plots. Chapters then cover recipes of histograms, contours, streamplots, and heatmaps, and how to visualize images and audio with pie and polar charts. Moving forward, you'll learn how to visualize with pcolor, pcolormesh, and colorbar, and how to visualize in 3D in Matplotlib, create simple animations, and embed Matplotlib with different frameworks. The concluding chapters cover how to visualize data with Pandas and Matplotlib, Seaborn, and how to work with the real-life data and visualize it. After reading Hands-on Matplotlib you'll be proficient with Matplotlib and able to comfortably work with ndarrays in NumPy and data frames in Pandas. What You'll Learn Understand Data Visualization and Python using Matplotlib Review the fundamental data structures in NumPy and Pandas Work with 3D plotting, visualizations, and animations Visualize images and audio data Who This Book Is For Data scientists, machine learning engineers and software professionals with basic programming skills.

Practical Data Science with Python

2021-09-30 · O'Reilly Data Science Books O'Reilly Amazon

book

by Nathan George

AI/ML Analytics Data Science NumPy Python SQL programming-languages software-development

Practical Data Science with Python guides you through the entire process of leveraging Python tools to analyze and gain insights from data. You'll start with foundational concepts and coding essentials, progressing through statistical analysis, machine learning techniques, and ethical considerations. What this Book will help me do Clean, prepare, and explore data using pandas and NumPy. Understand and implement machine learning models such as random forests and support vector machines. Perform statistical tests and analyze distributions to enhance data insights. Utilize SQL with Python for efficient data interaction. Generate automated reports and dashboards for data storytelling. Author(s) Nathan George has extensive professional experience as a data scientist and Python developer. He specializes in the application of machine learning and statistical methods to solve real-world problems. His writing combines technical depth with an approachable style, aiming to provide readers with actionable knowledge and skills. Who is it for? This book is perfect for data science beginners who have a basic understanding of Python and want to build practical data analysis skills. Students in analytics programs or professionals looking to transition into a data science role will find value in its approachable yet comprehensive coverage. Aspiring data analysts and career changers will gain firsthand exposure to Python-based data science best practices. If you're eager to develop practical, hands-on experience in the data science field, this is the guide for you.

Pandas in Action

2021-09-22 · O'Reilly Data Science Books O'Reilly Amazon

book

by Boris Paskhaver

Agile/Scrum Data Science Python data data-science data-science-tools

Take the next steps in your data science career! This friendly and hands-on guide shows you how to start mastering Pandas with skills you already know from spreadsheet software. In Pandas in Action you will learn how to: Import datasets, identify issues with their data structures, and optimize them for efficiency Sort, filter, pivot, and draw conclusions from a dataset and its subsets Identify trends from text-based and time-based data Organize, group, merge, and join separate datasets Use a GroupBy object to store multiple DataFrames Pandas has rapidly become one of Python's most popular data analysis libraries. In Pandas in Action, a friendly and example-rich introduction, author Boris Paskhaver shows you how to master this versatile tool and take the next steps in your data science career. You’ll learn how easy Pandas makes it to efficiently sort, analyze, filter and munge almost any type of data. About the Technology Data analysis with Python doesn’t have to be hard. If you can use a spreadsheet, you can learn pandas! While its grid-style layouts may remind you of Excel, pandas is far more flexible and powerful. This Python library quickly performs operations on millions of rows, and it interfaces easily with other tools in the Python data ecosystem. It’s a perfect way to up your data game. About the Book Pandas in Action introduces Python-based data analysis using the amazing pandas library. You’ll learn to automate repetitive operations and gain deeper insights into your data that would be impractical—or impossible—in Excel. Each chapter is a self-contained tutorial. Realistic downloadable datasets help you learn from the kind of messy data you’ll find in the real world. What's Inside Organize, group, merge, split, and join datasets Find trends in text-based and time-based data Sort, filter, pivot, optimize, and draw conclusions Apply aggregate operations About the Reader For readers experienced with spreadsheets and basic Python programming. About the Author Boris Paskhaver is a software engineer, Agile consultant, and online educator. His programming courses have been taken by 300,000 students across 190 countries. Quotes Of all the introductory pandas books I’ve read—and I did read a few—this is the best, by a mile. - Erico Lendzian, idibu.com This approachable guide will get you up and running quickly with all the basics you need to analyze your data. - Jonathan Sharley, SiriusXM Media Understanding and putting in practice the concepts of this book will help you increase productivity and make you look like a pro. - Jose Apablaza, Steadfast Networks Teaches both novice and expert Python users the essential concepts required for data analysis and data science. - Ben McNamara, DataGeek

Data Science for Marketing Analytics - Second Edition

2021-09-07 · O'Reilly Data Science Books O'Reilly Amazon

book

by Vishwesh Ravi Shrimali , Mirza Rahim Baig , Gururajan Govindan

AI/ML Analytics Data Analytics Data Science Marketing Matplotlib Python data data-science

In 'Data Science for Marketing Analytics', you'll embark on a journey that integrates the power of data analytics with strategic marketing. With a focus on practical application, this guide walks you through using Python to analyze datasets, implement machine learning models, and derive data-driven insights. What this Book will help me do Gain expertise in cleaning, exploring, and visualizing marketing data using Python. Build machine learning models to predict customer behavior and sales outcomes. Leverage unsupervised learning techniques for effective customer segmentation. Compare and optimize predictive models using advanced evaluation methods. Master Python libraries like pandas and Matplotlib for data manipulation and visualization. Author(s) Mirza Rahim Baig, Gururajan Govindan, and Vishwesh Ravi Shrimali combine their extensive expertise in data analytics and marketing to bring you this comprehensive guide. Drawing from years of applying analytics in real-world marketing scenarios, they provide a hands-on approach to learning data science tools and techniques. Who is it for? This book is perfect for marketing professionals and analysts eager to harness the capabilities of Python to enhance their data-driven strategies. It is also ideal for data scientists looking to apply their skills in marketing across various roles. While a basic understanding of data analysis and Python will help, all key concepts are introduced comprehensively for beginners.

talk-data.com

Activity Trend

Top Events

Top Speakers

Python for Data Analysis, 3rd Edition

Python for Data Science

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

Auto Encoder Decoder-Based Anomaly Detection with the Lakehouse Paradigm

PySpark in Apache Spark 3.3 and Beyond

Deep Dive into the New Features of Apache Spark 3.2 and 3.3

FugueSQL—The Enhanced SQL Interface for Pandas and Spark DataFrames

Pandas & Friends

In-Memory Analytics with Apache Arrow

The Pandas Workshop

Building Data Science Solutions with Anaconda

Python for ArcGIS Pro

Data Analysis with Python and PySpark

Build Your Python Data Processing Your Way And Run It Anywhere With Fugue

Scalable Strategies For Protecting Data Privacy In Your Shared Data Sets

Effective Pandas Patterns For Data Engineering

Hands-on Matplotlib: Learn Plotting and Visualizations with Python 3

Practical Data Science with Python

Pandas in Action

Data Science for Marketing Analytics - Second Edition