talk-data.com talk-data.com

Topic

Data Science

machine_learning statistics analytics

84

tagged

Activity Trend

68 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: O'Reilly Data Engineering Books ×
Practical Statistics for Data Scientists, 3rd Edition

Statistical methods are a key part of data science, yet few data scientists have formal statistical training. Courses and books on basic statistics rarely cover the topic from a data science perspective. And many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you're familiar with the R or Python programming languages and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

High Performance Spark, 2nd Edition

Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you expected or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau, Rachel Warren, and Anya Bida walk you through the secrets of the Spark code base, and demonstrate performance optimizations that will help your data pipelines run faster, scale to larger datasets, and avoid costly antipatterns. Ideal for data engineers, software engineers, data scientists, and system administrators, the second edition of High Performance Spark presents new use cases, code examples, and best practices for Spark 3.x and beyond. This book gives you a fresh perspective on this continually evolving framework and shows you how to work around bumps on your Spark and PySpark journey. With this book, you'll learn how to: Accelerate your ML workflows with integrations including PyTorch Handle key skew and take advantage of Spark's new dynamic partitioning Make your code reliable with scalable testing and validation techniques Make Spark high performance Deploy Spark on Kubernetes and similar environments Take advantage of GPU acceleration with RAPIDS and resource profiles Get your Spark jobs to run faster Use Spark to productionize exploratory data science projects Handle even larger datasets with Spark Gain faster insights by reducing pipeline running times

Microsoft 365 Access For Dummies, 2nd Edition

Join the millions of people already using Microsoft Access and become a database power-user in no time! In the newly revised edition of Microsoft Access For Dummies, professional database developer and Access extraordinaire Laurie Ulrich-Fuller walks you through the ins-and-outs of one of the world's most popular database platforms. This is the perfect beginner's guide to Microsoft Access, showing you how to create databases, extract data, create reports, and more. The author demonstrates a ton of tips, tricks, and best practices you can use immediately to create, maintain, and improve your databases. You'll also find: Updates outlining edge browser controls in forms Step-by-step guides explaining how to import, export, and edit data Easy-to-follow query-writing tutorials to help you find the exact data you're looking for when you need it Whether you're a database novice or a data science whiz, Microsoft Access For Dummies has the info you need to supercharge your database skills. It's the perfect, how-to guide to get you up-to-speed on everything you need to know to get started with Microsoft's world-famous database app.

Applied Data Science Using PySpark: Learn the End-to-End Predictive Model-Building Cycle

This comprehensive guide, featuring hand-picked examples of daily use cases, will walk you through the end-to-end predictive model-building cycle using the latest techniques and industry tricks. In Chapters 1, 2, and 3, we will begin by setting up the environment and covering the basics of PySpark, focusing on data manipulation. Chapter 4 delves into the art of variable selection, demonstrating various techniques available in PySpark. In Chapters 5, 6, and 7, we explore machine learning algorithms, their implementations, and fine-tuning techniques. Chapters 8 and 9 will guide you through machine learning pipelines and various methods to operationalize and serve models using Docker/API. Chapter 10 will demonstrate how to unlock the power of predictive models to create a meaningful impact on your business. Chapter 11 introduces some of the most widely used and powerful modeling frameworks to unlock real value from data. In this new edition, you will learn predictive modeling frameworks that can quantify customer lifetime values and estimate the return on your predictive modeling investments. This edition also includes methods to measure engagement and identify actionable populations for effective churn treatments. Additionally, a dedicated chapter on experimentation design has been added, covering steps to efficiently design, conduct, test, and measure the results of your models. All code examples have been updated to reflect the latest stable version of Spark. You will: Gain an overview of end-to-end predictive model building Understand multiple variable selection techniques and their implementations Learn how to operationalize models Perform data science experiments and learn useful tips

Databricks Data Intelligence Platform: Unlocking the GenAI Revolution

This book is your comprehensive guide to building robust Generative AI solutions using the Databricks Data Intelligence Platform. Databricks is the fastest-growing data platform offering unified analytics and AI capabilities within a single governance framework, enabling organizations to streamline their data processing workflows, from ingestion to visualization. Additionally, Databricks provides features to train a high-quality large language model (LLM), whether you are looking for Retrieval-Augmented Generation (RAG) or fine-tuning. Databricks offers a scalable and efficient solution for processing large volumes of both structured and unstructured data, facilitating advanced analytics, machine learning, and real-time processing. In today's GenAI world, Databricks plays a crucial role in empowering organizations to extract value from their data effectively, driving innovation and gaining a competitive edge in the digital age. This book will not only help you master the Data Intelligence Platform but also help power your enterprise to the next level with a bespoke LLM unique to your organization. Beginning with foundational principles, the book starts with a platform overview and explores features and best practices for ingestion, transformation, and storage with Delta Lake. Advanced topics include leveraging Databricks SQL for querying and visualizing large datasets, ensuring data governance and security with Unity Catalog, and deploying machine learning and LLMs using Databricks MLflow for GenAI. Through practical examples, insights, and best practices, this book equips solution architects and data engineers with the knowledge to design and implement scalable data solutions, making it an indispensable resource for modern enterprises. Whether you are new to Databricks and trying to learn a new platform, a seasoned practitioner building data pipelines, data science models, or GenAI applications, or even an executive who wants to communicate the value of Databricks to customers, this book is for you. With its extensive feature and best practice deep dives, it also serves as an excellent reference guide if you are preparing for Databricks certification exams. What You Will Learn Foundational principles of Lakehouse architecture Key features including Unity Catalog, Databricks SQL (DBSQL), and Delta Live Tables Databricks Intelligence Platform and key functionalities Building and deploying GenAI Applications from data ingestion to model serving Databricks pricing, platform security, DBRX, and many more topics Who This Book Is For Solution architects, data engineers, data scientists, Databricks practitioners, and anyone who wants to deploy their Gen AI solutions with the Data Intelligence Platform. This is also a handbook for senior execs who need to communicate the value of Databricks to customers. People who are new to the Databricks Platform and want comprehensive insights will find the book accessible.

Data Security Blueprints

Once you decide to implement a data security strategy, it can be difficult to know where to start. With so many potential threats and challenges to resolve, teams often try to fix everything at once. But this boil-the-ocean approach is difficult to manage efficiently and ultimately leads to frustration, confusion, and halted progress. There's a better way to go. In this report, data science and AI leader Federico Castanedo shows you what to look for in a data security platform that will deliver the speed, scale, and agility you need to be successful in today's fast-paced, distributed data ecosystems. Unlike other resources that focus solely on data security concepts, this guide provides a road map for putting those concepts into practice. This report reveals: The most common data security use cases and their potential challenges What to look for in a data security solution that's built for speed and scale Why increasingly decentralized data architectures require centralized, dynamic data security mechanisms How to implement the steps required to put common use cases into production Methods for assessing risks—and controls necessary to mitigate those risks How to facilitate cross-functional collaboration to put data security into practice in a scalable, efficient way You'll examine the most common data security use cases that global enterprises across every industry aim to achieve, including the specific steps needed for implementation as well as the potential obstacles these use cases present. Federico Castanedo is a data science and AI leader with extensive experience in academia, industry, and startups. Having held leadership positions at DataRobot and Vodafone, he has a successful track record of leading high-performing data science teams and developing data science and AI products with business impact.

Databricks Certified Associate Developer for Apache Spark Using Python

This book serves as the ultimate preparation for aspiring Databricks Certified Associate Developers specializing in Apache Spark. Deep dive into Spark's components, its applications, and exam techniques to achieve certification and expand your practical skills in big data processing and real-time analytics using Python. What this Book will help me do Deeply understand Apache Spark's core architecture for building big data applications. Write optimized SQL queries and leverage Spark DataFrame API for efficient data manipulation. Apply advanced Spark functions, including UDFs, to solve complex data engineering tasks. Use Spark Streaming capabilities to implement real-time and near-real-time processing solutions. Get hands-on preparation for the certification exam with mock tests and practice questions. Author(s) Saba Shah is a seasoned data engineer with extensive experience working at Databricks and leading data science teams. With her in-depth knowledge of big data applications and Spark, she delivers clear, actionable insights in this book. Her approach emphasizes practical learning and real-world applications. Who is it for? This book is ideal for data professionals such as engineers and analysts aiming to achieve Databricks certification. It is particularly helpful for individuals with moderate Python proficiency who are keen to understand Spark from scratch. If you're transitioning into big data roles, this guide prepares you comprehensively.

The Ultimate Guide to Snowpark

The Ultimate Guide to Snowpark serves as a comprehensive resource to help you master the Snowflake Snowpark framework using Python. You'll learn how to manage data engineering, data science, and data applications in Snowpark, coupled with practical implementations and examples. By following this guide, you'll gain the skills needed to efficiently process and analyze data in the Snowflake Data Cloud. What this Book will help me do Master Snowpark with Python for data engineering, data science, and data application workloads. Develop and deploy robust data pipelines using Snowpark in Python. Design, implement, and produce machine learning models using Snowpark. Learn to monetize and operationalize Snowflake-native applications. Effectively adopt Snowpark in production for scalable, efficient data solutions. Author(s) Shankar Narayanan SGS and Vivekanandan SS are experienced professionals in data engineering and Snowflake technologies. Shankar has extensive experience in utilizing Snowflake Snowpark to manage and enhance data solutions. Vivekanandan brings expertise in the intersection of Python programming and cloud-based data processing. Together, their combined knowledge and approachable writing style make this book an invaluable resource to readers. Who is it for? This book is designed for data engineers, data scientists, developers, and seasoned data practitioners. Ideal candidates are those looking to expand their skills in implementing Snowpark solutions using Python. A prior understanding of SQL, Python programming, and familiarity with Snowflake is beneficial for readers to fully leverage the techniques presented.

Databricks ML in Action

Dive into the Databricks Data Intelligence Platform and learn how to harness its full potential for creating, deploying, and maintaining machine learning solutions. This book covers everything from setting up your workspace to integrating state-of-the-art tools such as AutoML and VectorSearch, imparting practical skills through detailed examples and code. What this Book will help me do Set up and manage a Databricks workspace tailored for effective data science workflows. Implement monitoring to ensure data quality and detect drift efficiently. Build, fine-tune, and deploy machine learning models seamlessly using Databricks tools. Operationalize AI projects including feature engineering, data pipelines, and workflows on the Databricks Lakehouse architecture. Leverage integrations with popular tools like OpenAI's ChatGPT to expand your AI project capabilities. Author(s) This book is authored by Stephanie Rivera, Anastasia Prokaieva, Amanda Baker, and Hayley Horn, seasoned experts in data science and machine learning from Databricks. Their collective years of expertise in big data and AI technologies ensure a rich and insightful perspective. Through their work, they strive to make complex concepts accessible and actionable. Who is it for? This book serves as an ideal guide for machine learning engineers, data scientists, and technically inclined managers. It's well-suited for those transitioning to the Databricks environment or seeking to deepen their Databricks-based machine learning implementation skills. Whether you're an ambitious beginner or an experienced professional, this book provides clear pathways to success.

Software Engineering for Data Scientists

Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's success—and is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering, and clearly explains how to apply the best practices from software engineering to data science. Examples are provided in Python, drawn from popular packages such as NumPy and pandas. If you want to write better data science code, this guide covers the essential topics that are often missing from introductory data science or coding classes, including how to: Understand data structures and object-oriented programming Clearly and skillfully document your code Package and share your code Integrate data science code with a larger code base Learn how to write APIs Create secure code Apply best practices to common tasks such as testing, error handling, and logging Work more effectively with software engineers Write more efficient, maintainable, and robust code in Python Put your data science projects into production And more

Data Engineering and Data Science

DATA ENGINEERING and DATA SCIENCE Written and edited by one of the most prolific and well-known experts in the field and his team, this exciting new volume is the “one-stop shop” for the concepts and applications of data science and engineering for data scientists across many industries. The field of data science is incredibly broad, encompassing everything from cleaning data to deploying predictive models. However, it is rare for any single data scientist to be working across the spectrum day to day. Data scientists usually focus on a few areas and are complemented by a team of other scientists and analysts. Data engineering is also a broad field, but any individual data engineer doesn’t need to know the whole spectrum of skills. Data engineering is the aspect of data science that focuses on practical applications of data collection and analysis. For all the work that data scientists do to answer questions using large sets of information, there have to be mechanisms for collecting and validating that information. In this exciting new volume, the team of editors and contributors sketch the broad outlines of data engineering, then walk through more specific descriptions that illustrate specific data engineering roles. Data-driven discovery is revolutionizing the modeling, prediction, and control of complex systems. This book brings together machine learning, engineering mathematics, and mathematical physics to integrate modeling and control of dynamical systems with modern methods in data science. It highlights many of the recent advances in scientific computing that enable data-driven methods to be applied to a diverse range of complex systems, such as turbulence, the brain, climate, epidemiology, finance, robotics, and autonomy. Whether for the veteran engineer or scientist working in the field or laboratory, or the student or academic, this is a must-have for any library.

The Unrealized Opportunities with Real-Time Data

The amount of data generated from various processes and platforms has increased exponentially in the past decade, and the challenges of filtering useful data out of streams of raw data has become even greater. Meanwhile, the essence of making useful insights from that data has become even more important. In this incisive report, Federico Castanedo examines the challenges companies face when acting on data at rest as well as the benefits you unlock when acting on data as it's generated. Data engineers, enterprise architects, CTOs, and CIOs will explore the tools, processes, and mindset your company needs to process streaming data in real time. Learn how to make quick data-driven decisions to gain an edge on competitors. This report helps you: Explore gaps in today's real-time data architectures, including the limitations of real-time analytics to act on data immediately Examine use cases that can't be served efficiently with real-time analytics Understand how stream processing engines work with real-time data Learn how distributed data processing architectures, stream processing, streaming analytics, and event-based architectures relate to real-time data Understand how to transition from traditional batch processing environments to stream processing Federico Castanedo is an academic director and adjunct professor at IE University in Spain. A data science and AI leader, he has extensive experience in academia, industry, and startups.

Geospatial Data Analytics on AWS

In "Geospatial Data Analytics on AWS," you will learn how to store, manage, and analyze geospatial data effectively using various AWS services. This book provides insight into building geospatial data lakes, leveraging AWS databases, and applying best practices to derive insights from spatial data in the cloud. What this Book will help me do Design and manage geospatial data lakes on AWS leveraging S3 and other storage solutions. Analyze geospatial data using AWS services such as Athena and Redshift. Utilize machine learning models for geospatial data processing and analytics using SageMaker. Visualize geospatial data through services like Amazon QuickSight and OpenStreetMap integration. Avoid common pitfalls when managing geospatial data in the cloud. Author(s) Scott Bateman, Janahan Gnanachandran, and Jeff DeMuth bring their extensive experience in cloud computing and geospatial analytics to this book. With backgrounds in cloud architecture, data science, and geospatial applications, they aim to make complex topics accessible. Their collaborative approach ensures readers can practically apply concepts to real-world challenges. Who is it for? This book is ideal for GIS and data professionals, including developers, analysts, and scientists. It suits readers with a basic understanding of geographical concepts but no prior AWS experience. If you're aiming to enhance your cloud-based geospatial data management and analytics skills, this is the guide for you.

Data for All

Do you know what happens to your personal data when you are browsing, buying, or using apps? Discover how your data is harvested and exploited, and what you can do to access, delete, and monetize it. Data for All empowers everyone—from tech experts to the general public—to control how third parties use personal data. Read this eye-opening book to learn: The types of data you generate with every action, every day Where your data is stored, who controls it, and how much money they make from it How you can manage access and monetization of your own data Restricting data access to only companies and organizations you want to support The history of how we think about data, and why that is changing The new data ecosystem being built right now for your benefit The data you generate every day is the lifeblood of many large companies—and they make billions of dollars using it. In Data for All, bestselling author John K. Thompson outlines how this one-sided data economy is about to undergo a dramatic change. Thompson pulls back the curtain to reveal the true nature of data ownership, and how you can turn your data from a revenue stream for companies into a financial asset for your benefit. About the Technology Do you know what happens to your personal data when you’re browsing and buying? New global laws are turning the tide on companies who make billions from your clicks, searches, and likes. This eye-opening book provides an inspiring vision of how you can take back control of the data you generate every day. About the Book Data for All gives you a step-by-step plan to transform your relationship with data and start earning a “data dividend”—hundreds or thousands of dollars paid out simply for your online activities. You’ll learn how to oversee who accesses your data, how much different types of data are worth, and how to keep private details private. What's Inside The types of data you generate with every action, every day How you can manage access and monetization of your own data The history of how we think about data, and why that is changing The new data ecosystem being built right now for your benefit About the Reader For anyone who is curious or concerned about how their data is used. No technical knowledge required. About the Author John K. Thompson is an international technology executive with over 37 years of experience in the fields of data, advanced analytics, and artificial intelligence. Quotes An honest, direct, pull-no-punches source on one of the most important personal issues of our time....I changed some of my own behaviors after reading the book, and I suggest you do so as well. You have more to lose than you may think. - From the Foreword by Thomas H. Davenport, author of Competing on Analytics and The AI Advantage A must-read for anyone interested in the future of data. It helped me understand the reasons behind the current data ecosystem and the laws that are shaping its future. A great resource for both professionals and individuals. I highly recommend it. - Ravit Jain, Founder & Host of The Ravit Show, Data Science Evangelist

Data Science for Civil Engineering

This book explains use of data science-based techniques for modelling and providing optimal solutions to complex problems in civil engineering. It deals with the basics of data science and essential mathematics and covers pertinent applications in structural and environmental engineering, construction management, and transportation.

Practical Data Privacy

Between major privacy regulations like the GDPR and CCPA and expensive and notorious data breaches, there has never been so much pressure to ensure data privacy. Unfortunately, integrating privacy into data systems is still complicated. This essential guide will give you a fundamental understanding of modern privacy building blocks, like differential privacy, federated learning, and encrypted computation. Based on hard-won lessons, this book provides solid advice and best practices for integrating breakthrough privacy-enhancing technologies into production systems. Practical Data Privacy answers important questions such as: What do privacy regulations like GDPR and CCPA mean for my data workflows and data science use cases? What does "anonymized data" really mean? How do I actually anonymize data? How does federated learning and analysis work? Homomorphic encryption sounds great, but is it ready for use? How do I compare and choose the best privacy-preserving technologies and methods? Are there open-source libraries that can help? How do I ensure that my data science projects are secure by default and private by design? How do I work with governance and infosec teams to implement internal policies appropriately?

Graph Data Science with Neo4j

"Graph Data Science with Neo4j" teaches you how to utilize Neo4j 5 and its Graph Data Science Library 2.0 for analyzing and making predictions with graph data. By integrating graph algorithms into actionable machine learning pipelines using Python, you'll harness the power of graph-based data models. What this Book will help me do Query and manipulate graph data using Cypher in Neo4j. Design and implement graph datasets using your data and public sources. Utilize graph-specific algorithms for tasks such as link prediction. Integrate graph data science pipelines into machine learning projects. Understand and apply predictive modeling using the GDS Library. Author(s) None Scifo, the author of "Graph Data Science with Neo4j," is an experienced data scientist with expertise in graph databases and advanced machine learning techniques. Their technical approach combines practical implementation with clear, step-by-step guidance to provide readers the skills they need to excel. Who is it for? This book is ideal for data scientists and analysts familiar with basic Neo4j concepts and Python-based data science workflows who wish to deepen their skills in graph algorithms and machine learning integration. It is particularly suited for professionals aiming to advance their expertise in graph data science for practical applications.

Numerical Methods Using Kotlin: For Data Science, Analysis, and Engineering

This in-depth guide covers a wide range of topics, including chapters on linear algebra, root finding, curve fitting, differentiation and integration, solving differential equations, random numbers and simulation, a whole suite of unconstrained and constrained optimization algorithms, statistics, regression and time series analysis. The mathematical concepts behind the algorithms are clearly explained, with plenty of code examples and illustrations to help even beginners get started. In this book, you'll implement numerical algorithms in Kotlin using NM Dev, an object-oriented and high-performance programming library for applied and industrial mathematics. Discover how Kotlin has many advantages over Java in its speed, and in some cases, ease of use. In this book, you’ll see how it can help you easily create solutions for your complex engineering and data science problems. After reading this book, you'll come away with the knowledge to create your own numerical models and algorithms using the Kotlin programming language. What You Will Learn Program in Kotlin using a high-performance numerical library Learn the mathematics necessary for a wide range of numerical computing algorithms Convert ideas and equations into code Put together algorithms and classes to build your own engineering solutions Build solvers for industrial optimization problems Perform data analysis using basic and advanced statistics Who This Book Is For Programmers, data scientists, and analysts with prior experience programming in any language, especially Kotlin or Java.

Python for Data Analysis, 3rd Edition

Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.10 and pandas 1.4, the third edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You'll learn the latest versions of pandas, NumPy, and Jupyter in the process. Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It's ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub. Use the Jupyter notebook and IPython shell for exploratory computing Learn basic and advanced features in NumPy Get started with data analysis tools in the pandas library Use flexible tools to load, clean, transform, merge, and reshape data Create informative visualizations with matplotlib Apply the pandas groupby facility to slice, dice, and summarize datasets Analyze and manipulate regular and irregular time series data Learn how to solve real-world data analysis problems with thorough, detailed examples