talk-data.com talk-data.com

Topic

Pandas

data_manipulation data_analysis python

187

tagged

Activity Trend

17 peak/qtr
2020-Q1 2026-Q1

Activities

187 activities · Newest first

Statistics for Data Science and Analytics

Introductory statistics textbook with a focus on data science topics such as prediction, correlation, and data exploration Statistics for Data Science and Analytics is a comprehensive guide to statistical analysis using Python, presenting important topics useful for data science such as prediction, correlation, and data exploration. The authors provide an introduction to statistical science and big data, as well as an overview of Python data structures and operations. A range of statistical techniques are presented with their implementation in Python, including hypothesis testing, probability, exploratory data analysis, categorical variables, surveys and sampling, A/B testing, and correlation. The text introduces binary classification, a foundational element of machine learning, validation of statistical models by applying them to holdout data, and probability and inference via the easy-to-understand method of resampling and the bootstrap instead of using a myriad of “kitchen sink” formulas. Regression is taught both as a tool for explanation and for prediction. This book is informed by the authors’ experience designing and teaching both introductory statistics and machine learning at Statistics.com. Each chapter includes practical examples, explanations of the underlying concepts, and Python code snippets to help readers apply the techniques themselves. Statistics for Data Science and Analytics includes information on sample topics such as: Int, float, and string data types, numerical operations, manipulating strings, converting data types, and advanced data structures like lists, dictionaries, and sets Experiment design via randomizing, blinding, and before-after pairing, as well as proportions and percents when handling binary data Specialized Python packages like numpy, scipy, pandas, scikit-learn and statsmodels—the workhorses of data science—and how to get the most value from them Statistical versus practical significance, random number generators, functions for code reuse, and binomial and normal probability distributions Written by and for data science instructors, Statistics for Data Science and Analytics is an excellent learning resource for data science instructors prescribing a required intro stats course for their programs, as well as other students and professionals seeking to transition to the data science field.

Polars Cookbook

Dive into the world of data analysis with the Polars Cookbook. This book, ideal for data professionals, covers practical recipes to manipulate, transform, and analyze data using the Python Polars library. You'll learn both the fundamentals and advanced techniques to build efficient and scalable data workflows. What this Book will help me do Master the basics of Python Polars including installation and setup. Perform complex data manipulation like pivoting, grouping, and joining. Handle large-scale time series data for accurate analysis. Understand data integration with libraries like pandas and numpy. Optimize workflows for both on-premise and cloud environments. Author(s) Yuki Kakegawa is an experienced data analytics consultant who has collaborated with companies such as Microsoft and Stanford Health Care. His passion for data led him to create this detailed guide on Polars. His expertise ensures you gain real-world, actionable insights from every chapter. Who is it for? This book is perfect for data analysts, engineers, and scientists eager to enhance their efficiency with Python Polars. If you are familiar with Python and tools like pandas but are new to Polars, this book will upskill you. Whether handling big data or optimizing code for performance, the Polars Cookbook has the guidance you need to succeed.

DuckDB in Action

Dive into DuckDB and start processing gigabytes of data with ease—all with no data warehouse. DuckDB is a cutting-edge SQL database that makes it incredibly easy to analyze big data sets right from your laptop. In DuckDB in Action you’ll learn everything you need to know to get the most out of this awesome tool, keep your data secure on prem, and save you hundreds on your cloud bill. From data ingestion to advanced data pipelines, you’ll learn everything you need to get the most out of DuckDB—all through hands-on examples. Open up DuckDB in Action and learn how to: Read and process data from CSV, JSON and Parquet sources both locally and remote Write analytical SQL queries, including aggregations, common table expressions, window functions, special types of joins, and pivot tables Use DuckDB from Python, both with SQL and its "Relational"-API, interacting with databases but also data frames Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Pragmatic and comprehensive, DuckDB in Action introduces the DuckDB database and shows you how to use it to solve common data workflow problems. You won’t need to read through pages of documentation—you’ll learn as you work. Get to grips with DuckDB's unique SQL dialect, learning to seamlessly load, prepare, and analyze data using SQL queries. Extend DuckDB with both Python and built-in tools such as MotherDuck, and gain practical insights into building robust and automated data pipelines. About the Technology DuckDB makes data analytics fast and fun! You don’t need to set up a Spark or run a cloud data warehouse just to process a few hundred gigabytes of data. DuckDB is easily embeddable in any data analytics application, runs on a laptop, and processes data from almost any source, including JSON, CSV, Parquet, SQLite and Postgres. About the Book DuckDB in Action guides you example-by-example from setup, through your first SQL query, to advanced topics like building data pipelines and embedding DuckDB as a local data store for a Streamlit web app. You’ll explore DuckDB’s handy SQL extensions, get to grips with aggregation, analysis, and data without persistence, and use Python to customize DuckDB. A hands-on project accompanies each new topic, so you can see DuckDB in action. What's Inside Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Fast-paced SQL recap: From simple queries to advanced analytics About the Reader For data pros comfortable with Python and CLI tools. About the Authors Mark Needham is a blogger and video creator at @‌LearnDataWithMark. Michael Hunger leads product innovation for the Neo4j graph database. Michael Simons is a Java Champion, author, and Engineer at Neo4j. Quotes I use DuckDB every day, and I still learned a lot about how DuckDB makes things that are hard in most databases easy! - Jordan Tigani, Founder, MotherDuck An excellent resource! Unlocks possibilities for storing, processing, analyzing, and summarizing data at the edge using DuckDB. - Pramod Sadalage, Director, Thoughtworks Clear and accessible. A comprehensive resource for harnessing the power of DuckDB for both novices and experienced professionals. - Qiusheng Wu, Associate Professor, University of Tennessee Excellent! The book all we ducklings have been waiting for! - Gunnar Morling, Decodable

Getting Started with DuckDB

Unlock the full potential of DuckDB with 'Getting Started with DuckDB,' your guide to mastering data analysis efficiently. By reading this book, you'll discover how to load, transform, and query data using DuckDB, leveraging its unique capabilities for processing large datasets. Gain hands-on experience with SQL, Python, and R to enhance your data science and engineering workflows. What this Book will help me do Effectively load and manage various types of data in DuckDB for seamless processing. Gain hands-on experience writing and optimizing SQL queries tailored for analytical tasks. Integrate DuckDB capabilities into Python and R workflows for streamlined data analysis. Understand DuckDB's optimizations and extensions for specialized data applications. Explore the broader ecosystem of data tools that complement DuckDB's capabilities. Author(s) Simon Aubury and Ned Letcher are seasoned experts in the field of data analytics and engineering. With extensive experience in using both SQL and programming languages like Python and R, they bring practical insights into the innovative uses of DuckDB. They have designed this book to provide a hands-on and approachable way to learn DuckDB, making complex concepts accessible. Who is it for? This book is well-suited for data analysts aiming to accelerate their data analysis workflows, data engineers looking for effective tools for data processing, and data scientists searching for a versatile library for scalable data manipulation. Prior exposure to SQL and programming in Python or R will be beneficial for readers to maximize their learning.

Modern Graph Theory Algorithms with Python

Dive into the fascinating world of graph theory and its applications with 'Modern Graph Theory Algorithms with Python.' Through Python programming and real-world case studies, this book equips you with the tools to transform data into graph structures, apply algorithms, and uncover insights, enabling effective solutions in diverse domains such as finance, epidemiology, and social networks. What this Book will help me do Understand how to wrangle a variety of data types into network formats suitable for analysis. Learn to use graph theory algorithms and toolkits such as NetworkX and igraph in Python. Apply network theory to predict and analyze trends, from epidemics to stock market dynamics. Explore the intersection of machine learning and graph theory through advanced neural network techniques. Gain expertise in database solutions with graph database querying and applications. Author(s) Colleen M. Farrelly, an experienced data scientist, and Franck Kalala Mutombo, a seasoned software engineer, bring years of expertise in network science and Python programming to every page of this book. Their professional experience includes working on cutting-edge problems in data analytics, graph theory, and scalable solutions for real-world issues. Combining their practical know-how, they deliver a resource aimed at both learning and applying techniques effectively. Who is it for? This book is tailored for data scientists, researchers, and analysts with an interest in using graph-based approaches for solving complex data problems. Ideal for those with a basic Python knowledge and familiarity with libraries like pandas and NumPy, the content bridges the gap between theory and application. It also provides insights into broad fields where network science can be impactful, contributing value to both students and professionals.

Pandas Workout

Practice makes perfect pandas! Work out your pandas skills against dozens of real-world challenges, each carefully designed to build an intuitive knowledge of essential pandas tasks. In Pandas Workout you’ll learn how to: Clean your data for accurate analysis Work with rows and columns for retrieving and assigning data Handle indexes, including hierarchical indexes Read and write data with a number of common formats, such as CSV and JSON Process and manipulate textual data from within pandas Work with dates and times in pandas Perform aggregate calculations on selected subsets of data Produce attractive and useful visualizations that make your data come alive Pandas Workout hones your pandas skills to a professional-level through two hundred exercises, each designed to strengthen your pandas skills. You’ll test your abilities against common pandas challenges such as importing and exporting, data cleaning, visualization, and performance optimization. Each exercise utilizes a real-world scenario based on real-world data, from tracking the parking tickets in New York City, to working out which country makes the best wines. You’ll soon find your pandas skills becoming second nature—no more trips to StackOverflow for what is now a natural part of your skillset. About the Technology Python’s pandas library can massively reduce the time you spend analyzing, cleaning, exploring, and manipulating data. And the only path to pandas mastery is practice, practice, and, you guessed it, more practice. In this book, Python guru Reuven Lerner is your personal trainer and guide through over 200 exercises guaranteed to boost your pandas skills. About the Book Pandas Workout is a thoughtful collection of practice problems, challenges, and mini-projects designed to build your data analysis skills using Python and pandas. The workouts use realistic data from many sources: the New York taxi fleet, Olympic athletes, SAT scores, oil prices, and more. Each can be completed in ten minutes or less. You’ll explore pandas’ rich functionality for string and date/time handling, complex indexing, and visualization, along with practical tips for every stage of a data analysis project. What's Inside Clean data with less manual labor Retrieving and assigning data Process and manipulate text Calculations on selected data subsets About the Reader For Python programmers and data analysts. About the Author Reuven M. Lerner teaches Python and data science around the world and publishes the “Bamboo Weekly” newsletter. He is the author of Manning’s Python Workout (2020). Quotes A carefully crafted tour through the pandas library, jam-packed with wisdom that will help you become a better pandas user and a better data scientist. - Kevin Markham, Founder of Data School, Creator of pandas in 30 days Will help you apply pandas to real problems and push you to the next level. - Michael Driscoll, RFA Engineering, creator of Teach Me Python The explanations, paired with Reuven’s storytelling and personal tone, make the concepts simple. I’ll never get them wrong again! - Rodrigo Girão Serrão, Python developer and educator The definitive source! - Kiran Anantha, Amazon

Python 3 Data Visualization Using Google Gemini

This book offers a comprehensive guide to leveraging Python-based data visualization techniques with the innovative capabilities of Google Gemini. Tailored for individuals proficient in Python seeking to enhance their visualization skills, it explores essential libraries like Pandas, Matplotlib, and Seaborn, along with insights into the innovative Gemini platform. With a focus on practicality and efficiency, it delivers a rapid yet thorough exploration of data visualization methodologies, supported by Gemini-generated code samples. Companion files with source code and figures are available for downloading. FEATURES: Covers Python-based data visualization libraries and techniques Includes practical examples and Gemini-generated code samples for efficient learning Integrates Google Gemini for advanced data visualization capabilities Sets up a conducive development environment for a seamless coding experience Includes companion files for downloading with source code and figures

Software Engineering for Data Scientists

Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's success—and is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering, and clearly explains how to apply the best practices from software engineering to data science. Examples are provided in Python, drawn from popular packages such as NumPy and pandas. If you want to write better data science code, this guide covers the essential topics that are often missing from introductory data science or coding classes, including how to: Understand data structures and object-oriented programming Clearly and skillfully document your code Package and share your code Integrate data science code with a larger code base Learn how to write APIs Create secure code Apply best practices to common tasks such as testing, error handling, and logging Work more effectively with software engineers Write more efficient, maintainable, and robust code in Python Put your data science projects into production And more

Python's dominance in data science streamlines workflows, but large-scale data processing challenges persist. Discover how BigQuery DataFrames, a Pandas and scikit-learn-like abstraction over the BigQuery engine, revolutionizes this process.

Join this session to learn about BigQuery DataFrames and witness how you can: - Effortlessly transform terabytes of data - Build efficient ML applications on massive datasets by leveraging large language models - Use your familiar Python environment

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

Atelier en ligne présentant les bases de la programmation en Python et l’analyse de données avec Jupyter Notebook et Pandas, en utilisant les jeux de données Airbnb. L’atelier couvre les bases de Python, les bases de Pandas, la visualisation de données et comment sourcer et analyser les jeux de données Airbnb. Accès à la plateforme d'apprentissage du Wagon à l’issue du cours.

Architecting a Modern Data Warehouse for Large Enterprises: Build Multi-cloud Modern Distributed Data Warehouses with Azure and AWS

Design and architect new generation cloud-based data warehouses using Azure and AWS. This book provides an in-depth understanding of how to build modern cloud-native data warehouses, as well as their history and evolution. The book starts by covering foundational data warehouse concepts, and introduces modern features such as distributed processing, big data storage, data streaming, and processing data on the cloud. You will gain an understanding of the synergy, relevance, and usage data warehousing standard practices in the modern world of distributed data processing. The authors walk you through the essential concepts of Data Mesh, Data Lake, Lakehouse, and Delta Lake. And they demonstrate the services and offerings available on Azure and AWS that deal with data orchestration, data democratization, data governance, data security, and business intelligence. After completing this book, you will be ready to design and architect enterprise-grade, cloud-based modern data warehouses using industry best practices and guidelines. What You Will Learn Understand the core concepts underlying modern data warehouses Design and build cloud-native data warehousesGain a practical approach to architecting and building data warehouses on Azure and AWS Implement modern data warehousing components such as Data Mesh, Data Lake, Delta Lake, and Lakehouse Process data through pandas and evaluate your model’s performance using metrics such as F1-score, precision, and recall Apply deep learning to supervised, semi-supervised, and unsupervised anomaly detection tasks for tabular datasets and time series applications Who This Book Is For Experienced developers, cloud architects, and technology enthusiasts looking to build cloud-based modern data warehouses using Azure and AWS

Matt Harrison is the author of many of the most successful Python books, including Effective Pandas, Effective XGBoost, The Machine Learning Pocket Reference, and many more. I consider him the top author of Python books and content on the planet.

He stopped by my house to chat about self-publishing technical books, the pros and cons of using a publisher, book piracy, and much more. We both talk about our experiences as best-selling technical authors, and don't hold back in this wide ranging and very candid conversation. Enjoy!

Note - my audio got a bit clippy in spots. Sorry if I blew up your speaker.

Jay Alexander Clifford: A 101 in Time Series Analytics with Apache Arrow, Pandas and Parquet

Join Jay Alexander Clifford in a deep dive into Time Series Analytics with Apache Arrow, Pandas, and Parquet. 📈🐍 Explore the power of columnar databases, and learn how to build efficient and scalable analytics applications for time series data using open-source tools. 🚀 #TimeSeries #analytics

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Distributed Machine Learning with PySpark: Migrating Effortlessly from Pandas and Scikit-Learn

Migrate from pandas and scikit-learn to PySpark to handle vast amounts of data and achieve faster data processing time. This book will show you how to make this transition by adapting your skills and leveraging the similarities in syntax, functionality, and interoperability between these tools. Distributed Machine Learning with PySpark offers a roadmap to data scientists considering transitioning from small data libraries (pandas/scikit-learn) to big data processing and machine learning with PySpark. You will learn to translate Python code from pandas/scikit-learn to PySpark to preprocess large volumes of data and build, train, test, and evaluate popular machine learning algorithms such as linear and logistic regression, decision trees, random forests, support vector machines, Naïve Bayes, and neural networks. After completing this book, you will understand the foundational concepts of data preparation and machine learning and will have the skills necessary toapply these methods using PySpark, the industry standard for building scalable ML data pipelines. What You Will Learn Master the fundamentals of supervised learning, unsupervised learning, NLP, and recommender systems Understand the differences between PySpark, scikit-learn, and pandas Perform linear regression, logistic regression, and decision tree regression with pandas, scikit-learn, and PySpark Distinguish between the pipelines of PySpark and scikit-learn Who This Book Is For Data scientists, data engineers, and machine learning practitioners who have some familiarity with Python, but who are new to distributed machine learning and the PySpark framework.

Hands-On Web Scraping with Python - Second Edition

In "Hands-On Web Scraping with Python," you'll learn how to harness the power of Python libraries to extract, process, and analyze data from the web. This book provides a practical, step-by-step guide for beginners and data enthusiasts alike. What this Book will help me do Master the use of Python libraries like requests, lxml, Scrapy, and Beautiful Soup for web scraping. Develop advanced techniques for secure browsing and data extraction using APIs and Selenium. Understand the principles behind regex and PDF data parsing for comprehensive scraping. Analyze and visualize data using data science tools such as Pandas and Plotly. Build a portfolio of real-world scraping projects to demonstrate your capabilities. Author(s) Anish Chapagain, the author of "Hands-On Web Scraping with Python," is an experienced programmer and instructor who specializes in Python and data-related technologies. With his vast experience in teaching individuals from diverse backgrounds, Anish approaches complex concepts with clarity and a hands-on methodology. Who is it for? This book is perfect for aspiring data scientists, Python beginners, and anyone who wants to delve into web scraping. Readers should have a basic understanding of how websites work but no prior coding experience is required. If you aim to develop scraping skills and understand data analysis, this book is the ideal starting point.

Streamlit for Data Science - Second Edition

Streamlit for Data Science is your complete guide to mastering the creation of powerful, interactive data-driven applications using Python and Streamlit. With this comprehensive resource, you'll learn everything from foundational Streamlit skills to advanced techniques like integrating machine learning models and deploying apps to cloud platforms, enabling you to significantly enhance your data science toolkit. What this Book will help me do Master building interactive applications using Streamlit, including techniques for user interfaces and integrations. Develop visually appealing and functional data visualizations using Python libraries in Streamlit. Learn to integrate Streamlit applications with machine learning frameworks and tools like Hugging Face and OpenAI. Understand and apply best practices to deploy Streamlit apps to cloud platforms such as Streamlit Community Cloud and Heroku. Improve practical Python skills through implementing end-to-end data applications and prototyping data workflows. Author(s) Tyler Richards, the author of Streamlit for Data Science, is a senior data scientist with in-depth practical experience in building data-driven applications. With a passion for Python and data visualization, Tyler leverages his knowledge to help data professionals craft effective and compelling tools. His teaching approach combines clarity, hands-on exercises, and practical relevance. Who is it for? This book is written for data scientists, engineers, and enthusiasts who use Python and want to create dynamic data-driven applications. With a focus on those who have some familiarity with Python and libraries like Pandas or NumPy, it assists readers in building on their knowledge by offering tailored guidance. Perfect for those looking to prototype data projects or enhance their programming toolkit.

Learning Data Science

As an aspiring data scientist, you appreciate why organizations rely on data for important decisions—whether it's for companies designing websites, cities deciding how to improve services, or scientists discovering how to stop the spread of disease. And you want the skills required to distill a messy pile of data into actionable insights. We call this the data science lifecycle: the process of collecting, wrangling, analyzing, and drawing conclusions from data. Learning Data Science is the first book to cover foundational skills in both programming and statistics that encompass this entire lifecycle. It's aimed at those who wish to become data scientists or who already work with data scientists, and at data analysts who wish to cross the "technical/nontechnical" divide. If you have a basic knowledge of Python programming, you'll learn how to work with data using industry-standard tools like pandas. Refine a question of interest to one that can be studied with data Pursue data collection that may involve text processing, web scraping, etc. Glean valuable insights about data through data cleaning, exploration, and visualization Learn how to use modeling to describe the data Generalize findings beyond the data

Python Data Analytics: With Pandas, NumPy, and Matplotlib

Explore the latest Python tools and techniques to help you tackle the world of data acquisition and analysis. You'll review scientific computing with NumPy, visualization with matplotlib, and machine learning with scikit-learn. This third edition is fully updated for the latest version of Python and its related libraries, and includes coverage of social media data analysis, image analysis with OpenCV, and deep learning libraries. Each chapter includes multiple examples demonstrating how to work with each library. At its heart lies the coverage of pandas, for high-performance, easy-to-use data structures and tools for data manipulation Author Fabio Nelli expertly demonstrates using Python for data processing, management, and information retrieval. Later chapters apply what you've learned to handwriting recognition and extending graphical capabilities with the JavaScript D3 library. Whether you are dealing with sales data, investment data, medical data, web page usage, or other data sets, Python Data Analytics, Third Edition is an invaluable reference with its examples of storing, accessing, and analyzing data. What You'll Learn Understand the core concepts of data analysis and the Python ecosystem Go in depth with pandas for reading, writing, and processing data Use tools and techniques for data visualization and image analysis Examine popular deep learning libraries Keras, Theano,TensorFlow, and PyTorch Who This Book Is For Experienced Python developers who need to learn about Pythonic tools for data analysis

If a Duck Quacks in the Forest and Everyone Hears, Should You Care?

YES! "Duck posting" has become an internet meme for praising DuckDB on Twitter. Nearly every quack using DuckDB has done it once or twice. But, why all the fuss? With advances in CPUs, memory, SSDs, and the software that enables it all, our personal machines are powerful beasts relegated to handling a few Chrome tabs and sitting 90% idle. As data engineers and data analysts, this seems like a waste that's not only expensive, but also impacting the environment.

In this session, you will see how DuckDB brings SQL analytics capabilities to a 2MB standalone executable on your laptop that only recently required a large cluster. This session will explain the architecture of DuckDB that enables high performance analytics on a laptop: great query optimization, vectorized execution, continuous improvements in compression and more. We will show its capabilities using live demos, from the pandas library to WASM, to the command-line. We'll demonstrate performance on large datasets, and talk about how we're exploring using the laptop to augment cloud analytics workloads.

Talk by: Ryan Boyd

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc