Python

Cyber Resilient Infrastructure: Detect, Protect, and Mitigate Threats Against Brocade SAN FOS with IBM QRadar

2022-03-02 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by IBM Storage

IBM Fabric Cyber Security data data-engineering

Enterprise networks are large and rely on numerous connected endpoints to ensure smooth operational efficiency. However, they also present a challenge from a security perspective. The focus of this Blueprint is to demonstrate an early threat detection against the network fabric that is powered by Brocade that uses IBM® QRadar®. It also protects the same if a cyberattack or an internal threat by rouge user within the organization occurs. The publication also describes how to configure the syslog that is forwarding on Brocade SAN FOS. Finally, it explains how the forwarded audit events are used for detecting the threat and runs the custom action to mitigate the threat. The focus of this publication is to proactively start a cyber resilience workflow from IBM QRadar to block an IP address when multiple failed logins on Brocade switch are detected. As part of early threat detection, a sample rule that us used by IBM QRadar is shown. A Python script that also is used as a response to block the user's IP address in the switch is provided. Customers are encouraged to create control path or data path use cases, customized IBM QRadar rules, and custom response scripts that are best-suited to their environment. The use cases, QRadar rules, and Python script that are presented here are templates only and cannot be used as-is in an environment.

Essential PySpark for Scalable Data Analytics

2021-10-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sreeram Nudurupati

AI/ML Analytics Big Data Data Analytics Data Lakehouse Data Science PySpark SQL apache-spark data data-engineering

Dive into the world of scalable data processing with 'Essential PySpark for Scalable Data Analytics'. This book is a comprehensive guide that helps beginners understand and utilize PySpark to process, analyze, and draw insights from large datasets effectively. With hands-on tutorials and clear explanations, you will gain the confidence to tackle big data analytics challenges. What this Book will help me do Understand and apply the distributed computing paradigm for big data. Learn to perform scalable data ingestion, cleansing, and preparation using PySpark. Create and utilize data lakes and the Lakehouse paradigm for efficient data storage and access. Develop and deploy machine learning models with scalability in mind. Master real-time analytics pipelines and create impactful data visualizations. Author(s) None Nudurupati is an experienced data engineer and educator, specializing in distributed systems and big data technologies. With years of practical experience in the field, None brings a clear and approachable teaching style to technical topics. Passionate about empowering readers, the author has designed this book to be both practical and inspirational for aspiring data practitioners. Who is it for? This book is ideal for data professionals including data scientists, engineers, and analysts looking to scale their data analytics processes. It assumes familiarity with basic data science concepts and Python, as well as some experience with SQL-like data analysis. This is particularly suitable for individuals aiming to expand their knowledge in distributed computing and PySpark to handle big data challenges. Achieving scalable and efficient data solutions is at the core of this guide.

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

2021-10-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Manoj Kukreja

Analytics Big Data Data Engineering Data Lakehouse Data Science Delta Spark SQL apache-spark data data-engineering

Data Engineering with Apache Spark, Delta Lake, and Lakehouse is a comprehensive guide packed with practical knowledge for building robust and scalable data pipelines. Throughout this book, you will explore the core concepts and applications of Apache Spark and Delta Lake, and learn how to design and implement efficient data engineering workflows using real-world examples. What this Book will help me do Master the core concepts and components of Apache Spark and Delta Lake. Create scalable and secure data pipelines for efficient data processing. Learn best practices and patterns for building enterprise-grade data lakes. Discover how to operationalize data models into production-ready pipelines. Gain insights into deploying and monitoring data pipelines effectively. Author(s) None Kukreja is a seasoned data engineer with over a decade of experience working with big data platforms. He specializes in implementing efficient and scalable data solutions to meet the demands of modern analytics and data science. Writing with clarity and a practical approach, he aims to provide actionable insights that professionals can apply to their projects. Who is it for? This book is tailored for aspiring data engineers and data analysts who wish to delve deeper into building scalable data platforms. It is suitable for those with basic knowledge of Python, Spark, and SQL, and seeking to learn Delta Lake and advanced data engineering concepts. Readers should be eager to develop practical skills for tackling real-world data engineering challenges.

Advanced Analytics with Transact-SQL: Exploring Hidden Patterns and Rules in Your Data

2021-07-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dejan Sarka

AI/ML Analytics Azure BI Data Quality Data Science Microsoft SQL Synapse data data-engineering microsoft-sql-server +2 more

Learn about business intelligence (BI) features in T-SQL and how they can help you with data science and analytics efforts without the need to bring in other languages such as R and Python. This book shows you how to compute statistical measures using your existing skills in T-SQL. You will learn how to calculate descriptive statistics, including centers, spreads, skewness, and kurtosis of distributions. You will also learn to find associations between pairs of variables, including calculating linear regression formulas and confidence levels with definite integration. No analysis is good without data quality. Advanced Analytics with Transact-SQL introduces data quality issues and shows you how to check for completeness and accuracy, and measure improvements in data quality over time. The book also explains how to optimize queries involving temporal data, such as when you search for overlapping intervals. More advanced time-oriented information in the book includes hazard and survival analysis. Forecasting with exponential moving averages and autoregression is covered as well. Every web/retail shop wants to know the products customers tend to buy together. Trying to predict the target discrete or continuous variable with few input variables is important for practically every type of business. This book helps you understand data science and the advanced algorithms use to analyze data, and terms such as data mining, machine learning, and text mining. Key to many of the solutions in this book are T-SQL window functions. Author Dejan Sarka demonstrates efficient statistical queries that are based on window functions and optimized through algorithms built using mathematical knowledge and creativity. The formulas and usage of those statistical procedures are explained so you can understand and modify the techniques presented. T-SQL is supported in SQL Server,Azure SQL Database, and in Azure Synapse Analytics. There are so many BI features in T-SQL that it might become your primary analytic database language. If you want to learn how to get information from your data with the T-SQL language that you already are familiar with, then this is the book for you. What You Will Learn Describe distribution of variables with statistical measures Find associations between pairs of variables Evaluate the quality of the data you are analyzing Perform time-series analysis on your data Forecast values of a continuous variable Perform market-basket analysis to predict customer purchasing patterns Predict target variable outcomes from one or more input variables Categorize passages of text by extracting and analyzing keywords Who This Book Is For Database developers and database administrators who want to translate their T-SQL skills into the world of business intelligence (BI) and data science. For readers who want to analyze large amounts of data efficiently by using their existing knowledge of T-SQL and Microsoft’s various database platforms such as SQL Server and Azure SQL Database. Also for readers who want to improve their querying by learning new and original optimization techniques.

Distributed Data Systems with Azure Databricks

2021-05-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alan Bernardo Palacio

AI/ML Azure ADF Big Data Cloud Computing Databricks Delta ETL/ELT Microsoft Data Streaming TensorFlow data +3 more

In 'Distributed Data Systems with Azure Databricks', you will explore the capabilities of Microsoft Azure Databricks as a platform for building and managing big data pipelines. Learn how to process, transform, and analyze data at scale while developing expertise in training distributed machine learning models and integrating them into enterprise workflows. What this Book will help me do Design and implement Extract, Transform, Load (ETL) pipelines using Azure Databricks. Conduct distributed training of machine learning models using TensorFlow and Horovod. Integrate Azure Databricks with Azure Data Factory for optimized data pipeline orchestration. Utilize Delta Engine for efficient querying and analysis of data within Delta Lake. Employ Databricks Structured Streaming to manage real-time production-grade data flows. Author(s) None Palacio is an experienced data engineer and cloud computing specialist, with extensive knowledge of the Microsoft Azure platform. With years of practical application of Databricks in enterprise settings, Palacio provides clear, actionable insights through relatable examples. They bring a passion for innovative solutions to the field of big data automation. Who is it for? This book is ideal for data engineers, machine learning engineers, and software developers looking to master Azure Databricks for large-scale data processing and analysis. Readers should have basic familiarity with cloud platforms, understanding of data pipelines, and a foundational grasp of Python and machine learning concepts. It is perfect for those wanting to create scalable and manageable data workflows.

Data Pipelines with Apache Airflow

2021-05-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Julian de Ruiter (Xebia) , Bas Harenslak (Astronomer)

AI/ML Airflow Cloud Computing Data Management DevOps Snowflake apache-airflow data data-engineering

A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack. About the Technology Data pipelines manage the flow of data from initial collection through consolidation, cleaning, analysis, visualization, and more. Apache Airflow provides a single platform you can use to design, implement, monitor, and maintain your pipelines. Its easy-to-use UI, plug-and-play options, and flexible Python scripting make Airflow perfect for any data management task. About the Book Data Pipelines with Apache Airflow teaches you how to build and maintain effective data pipelines. You’ll explore the most common usage patterns, including aggregating multiple data sources, connecting to and from data lakes, and cloud deployment. Part reference and part tutorial, this practical guide covers every aspect of the directed acyclic graphs (DAGs) that power Airflow, and how to customize them for your pipeline’s needs. What's Inside Build, test, and deploy Airflow pipelines as DAGs Automate moving and transforming data Analyze historical datasets using backfilling Develop custom components Set up Airflow in production environments About the Reader For DevOps, data engineers, machine learning engineers, and sysadmins with intermediate Python skills. About the Authors Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies. Bas is also an Airflow committer. Quotes An Airflow bible. Useful for all kinds of users, from novice to expert. - Rambabu Posa, Sai Aashika Consultancy An easy-to-follow exploration of the benefits of orchestrating your data pipeline jobs with Airflow. - Daniel Lamblin, Coupang The one reference you need to create, author, schedule, and monitor workflows with Apache Airflow. Clear recommendation. - Thorsten Weber, bbv Software Services AG By far the best resource for Airflow. - Jonathan Wood, LexisNexis

Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets

2021-04-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ed Elliott

AI/ML API Big Data Hive Linux Microsoft Scala Spark SQL Data Streaming apache-spark data +1 more

Get started using Apache Spark via C# or F# and the .NET for Apache Spark bindings. This book is an introduction to both Apache Spark and the .NET bindings. Readers new to Apache Spark will get up to speed quickly using Spark for data processing tasks performed against large and very large datasets. You will learn how to combine your knowledge of .NET with Apache Spark to bring massive computing power to bear by distributed processing of extremely large datasets across multiple servers. This book covers how to get a local instance of Apache Spark running on your developer machine and shows you how to create your first .NET program that uses the Microsoft .NET bindings for Apache Spark. Techniques shown in the book allow you to use Apache Spark to distribute your data processing tasks over multiple compute nodes. You will learn to process data using both batch mode and streaming mode so you can make the right choice depending on whether you are processing an existing dataset or are working against new records in micro-batches as they arrive. The goal of the book is leave you comfortable in bringing the power of Apache Spark to your favorite .NET language. What You Will Learn Install and configure Spark .NET on Windows, Linux, and macOS Write Apache Spark programs in C# and F# using the .NET bindings Access and invoke the Apache Spark APIs from .NET with the same high performance as Python, Scala, and R Encapsulate functionality in user-defined functions Transform and aggregate large datasets Execute SQL queries against files through Apache Hive Distribute processing of large datasets across multiple servers Create your own batch, streaming, and machine learning programs Who This Book Is For .NETdevelopers who want to perform big data processing without having to migrate to Python, Scala, or R; and Apache Spark developers who want to run natively on .NET and take advantage of the C# and F# ecosystems

Data Engineering with Python

2020-10-23 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Paul Crickard

Analytics Data Engineering data data-engineering

Discover the inner workings of data pipelines with 'Data Engineering with Python', a practical guide to mastering the art of data engineering. Through hands-on examples, you'll explore the process of designing data models, implementing data pipelines, and automating data flows, all within the context of Python. What this Book will help me do Understand the fundamentals of designing data architectures and capturing data requirements. Extract, clean, and transform data from various sources, refining it for precise applications. Implement end-to-end data pipelines, including staging, validation, and production deployment. Leverage Python to connect with databases, perform data manipulations, and build analytics workflows. Monitor and log data pipelines to ensure smooth, real-time operations and high quality. Author(s) Paul Crickard is a seasoned expert in data engineering and analytics, bringing years of practical experience to this technical guide. His unique ability to make complex technical concepts accessible makes this book invaluable for learners and professionals alike. A lifelong technologist, Paul focuses on actionable skills and building confidence to work with data pipelines and models. Who is it for? This book is ideal for aspiring data engineers, data analysts aiming to elevate their technical skillsets, or IT professionals transitioning into data-driven roles. Whether you're just stepping into the field or enhance your Python-based data capabilities, this book is tailored to provide solid grounding and practical expertise. Beginners in data engineering will find it accessible and easy to get started, while those refreshing their knowledge will benefit from its focused projects.

Learn MongoDB 4.x

2020-09-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Doug Bierer

MongoDB NoSQL Cyber Security data data-engineering nosql-databases

Explore the capabilities of MongoDB 4.x with this comprehensive guide designed for developers and administrators working with NoSQL databases. Dive into topics such as database design, advanced query handling, and security configuration, and gain hands-on experience through practical examples and insights. What this Book will help me do Learn to configure and install MongoDB 4.x for development and administration. Understand the principles of NoSQL schema design for optimal performance. Perform complex queries and operations to manage your MongoDB databases. Secure your MongoDB setup with role-based access control and encryption techniques. Monitor and optimize database performance for production environments. Author(s) None Bierer, the author of 'Learn MongoDB 4.x,' is a seasoned database expert with extensive experience in NoSQL technologies. With a focus on practicality and clear explanations, None brings deep insights into MongoDB's development and administration. Who is it for? This book is ideal for early-career developers, system administrators, and database enthusiasts eager to break into NoSQL technologies. If you are familiar with Python and basic database concepts, this book will guide you through mastering MongoDB. It's perfect for those building dynamic backend systems.

Learning ArcGIS Pro 2 - Second Edition

2020-07-24 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tripp Corbin, GISP

GIS arcgis data data-engineering geographic-information-system-gis location-data

Learning ArcGIS Pro 2 is your comprehensive guide to mastering the capabilities of ArcGIS Pro for geospatial analysis and cartography. You'll learn to create both 2D and 3D maps, edit and visualize geospatial data, and automate workflows using Python and ModelBuilder. This book provides the foundational skills you need to effectively work with GIS data and projects. What this Book will help me do Navigate the ArcGIS Pro interface to create, analyze, and share GIS projects efficiently. Visualize and interpret geographic data using 2D and 3D mapping techniques. Use Arcade language to customize labels and symbology for better map clarity. Automate GIS workflows through Python scripts and ModelBuilder for increased efficiency. Create and share professional-quality map layouts and series with ease. Author(s) Tripp Corbin, GISP, is a GIS Professional with extensive experience in geographic data analysis and ArcGIS software. As a seasoned instructor and author, Tripp aims to make GIS accessible by breaking down complex topics into manageable concepts. His hands-on teaching approach is reflected throughout this book, providing clear guidance and practical knowledge. Who is it for? This book is ideal for beginner GIS enthusiasts or professionals looking to transition to ArcGIS Pro. It is well-suited for those with minimal exposure to GIS or no prior experience with ArcGIS software. Whether you aim to explore geospatial concepts or acquire skills for professional applications, this book provides a solid foundation.

Learning Spark, 2nd Edition

2020-07-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Denny Lee (Databricks) , Brooke Wenig , Jules S. Damji (Anyscale Inc) , Tathagata Das (Databricks)

AI/ML Analytics API Avro CSV Data Analytics Delta Hive Java JSON Kafka ORC +9 more

Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark. Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to: Learn Python, SQL, Scala, or Java high-level Structured APIs Understand Spark operations and SQL Engine Inspect, tune, and debug Spark operations with Spark configurations and Spark UI Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka Perform analytics on batch and streaming data using Structured Streaming Build reliable data pipelines with open source Delta Lake and Spark Develop machine learning pipelines with MLlib and productionize models using MLflow

Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud

2020-06-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Robert Ilijason

AI/ML Analytics AWS Azure Big Data Cloud Computing Confluence Data Analytics Databricks Hadoop Hive Microsoft +5 more

Analyze vast amounts of data in record time using Apache Spark with Databricks in the Cloud. Learn the fundamentals, and more, of running analytics on large clusters in Azure and AWS, using Apache Spark with Databricks on top. Discover how to squeeze the most value out of your data at a mere fraction of what classical analytics solutions cost, while at the same time getting the results you need, incrementally faster. This book explains how the confluence of these pivotal technologies gives you enormous power, and cheaply, when it comes to huge datasets. You will begin by learning how cloud infrastructure makes it possible to scale your code to large amounts of processing units, without having to pay for the machinery in advance. From there you will learn how Apache Spark, an open source framework, can enable all those CPUs for data analytics use. Finally, you will see how services such as Databricks provide the power of Apache Spark, without you having to know anything aboutconfiguring hardware or software. By removing the need for expensive experts and hardware, your resources can instead be allocated to actually finding business value in the data. This book guides you through some advanced topics such as analytics in the cloud, data lakes, data ingestion, architecture, machine learning, and tools, including Apache Spark, Apache Hadoop, Apache Hive, Python, and SQL. Valuable exercises help reinforce what you have learned. What You Will Learn Discover the value of big data analytics that leverage the power of the cloud Get started with Databricks using SQL and Python in either Microsoft Azure or AWS Understand the underlying technology, and how the cloud and Apache Spark fit into the bigger picture See how these tools are used in the real world Run basic analytics, including machine learning, on billions of rows at a fraction of a cost or free Who This Book Is For Data engineers, data scientists, and cloud architects who want or need to run advanced analytics in the cloud. It is assumed that the reader has data experience, but perhaps minimal exposure to Apache Spark and Azure Databricks. The book is also recommended for people who want to get started in the analytics field, as it provides a strong foundation.

Spark in Action, Second Edition

2020-06-05 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jean-Georges Perrin (Actian)

AI/ML Analytics API Big Data ELK GitHub Hadoop IBM Java Scala Spark SQL +4 more

The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop. About the Technology Analyzing enterprise data starts by reading, filtering, and merging files and streams from many sources. The Spark data processing engine handles this varied volume like a champ, delivering speeds 100 times faster than Hadoop systems. Thanks to SQL support, an intuitive interface, and a straightforward multilanguage API, you can use Spark without learning a complex new ecosystem. About the Book Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you’ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you’ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms. What's Inside Writing Spark applications in Java Spark application architecture Ingestion through files, databases, streaming, and Elasticsearch Querying distributed datasets with Spark SQL About the Reader This book does not assume previous experience with Spark, Scala, or Hadoop. About the Author Jean-Georges Perrin is an experienced data and software architect. He is France’s first IBM Champion and has been honored for 12 consecutive years. Quotes This book reveals the tools and secrets you need to drive innovation in your company or community. - Rob Thomas, IBM An indispensable, well-paced, and in-depth guide. A must-have for anyone into big data and real-time stream processing. - Anupam Sengupta, GuardHat Inc. This book will help spark a love affair with distributed processing. - Conor Redmond, InComm Product Control Currently the best book on the subject! - Markus Breuer, Materna IPS

Cassandra: The Definitive Guide, 3rd Edition

2020-04-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Eben Hewitt , Jeff Carpenter

Cassandra Data Modelling Java JavaScript data data-engineering nosql-databases

Imagine what you could do if scalability wasn't a problem. With this hands-on guide, you’ll learn how the Cassandra database management system handles hundreds of terabytes of data while remaining highly available across multiple data centers. This third edition—updated for Cassandra 4.0—provides the technical details and practical examples you need to put this database to work in a production environment. Authors Jeff Carpenter and Eben Hewitt demonstrate the advantages of Cassandra’s nonrelational design, with special attention to data modeling. If you’re a developer, DBA, or application architect looking to solve a database scaling issue or future-proof your application, this guide helps you harness Cassandra’s speed and flexibility. Understand Cassandra’s distributed and decentralized structure Use the Cassandra Query Language (CQL) and cqlsh—the CQL shell Create a working data model and compare it with an equivalent relational model Develop sample applications using client drivers for languages including Java, Python, and Node.js Explore cluster topology and learn how nodes exchange data

Mastering Large Datasets with Python

2020-01-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by John Wolohan

AI/ML AWS Cloud Computing Data Science Hadoop PySpark S3 Spark data data-engineering

Modern data science solutions need to be clean, easy to read, and scalable. In Mastering Large Datasets with Python, author J.T. Wolohan teaches you how to take a small project and scale it up using a functionally influenced approach to Python coding. You’ll explore methods and built-in Python tools that lend themselves to clarity and scalability, like the high-performing parallelism method, as well as distributed technologies that allow for high data throughput. The abundant hands-on exercises in this practical tutorial will lock in these essential skills for any large-scale data science project. About the Technology Programming techniques that work well on laptop-sized data can slow to a crawl—or fail altogether—when applied to massive files or distributed datasets. By mastering the powerful map and reduce paradigm, along with the Python-based tools that support it, you can write data-centric applications that scale efficiently without requiring codebase rewrites as your requirements change. About the Book Mastering Large Datasets with Python teaches you to write code that can handle datasets of any size. You’ll start with laptop-sized datasets that teach you to parallelize data analysis by breaking large tasks into smaller ones that can run simultaneously. You’ll then scale those same programs to industrial-sized datasets on a cluster of cloud servers. With the map and reduce paradigm firmly in place, you’ll explore tools like Hadoop and PySpark to efficiently process massive distributed datasets, speed up decision-making with machine learning, and simplify your data storage with AWS S3. What's Inside An introduction to the map and reduce paradigm Parallelization with the multiprocessing module and pathos framework Hadoop and Spark for distributed computing Running AWS jobs to process large datasets About the Reader For Python programmers who need to work faster with more data. About the Author J. T. Wolohan is a lead data scientist at Booz Allen Hamilton, and a PhD researcher at Indiana University, Bloomington. Quotes A clear and efficient path to mastery of the map and reduce paradigm for developers of all levels. - Justin Fister, GrammarBot An amazing book for anybody looking to add parallel processing and the map/reduce pattern to their toolkit. - Gary Bake, Radius Payment Solutions Learn fundamentals of MapReduce and other core concepts and save money on expensive hardware! - Al Krinker, USPTO A comprehensive guide to the fundamentals of efficient Python data processing. - Craig Pfeifer, MITRE Corporation

Hands On Google Cloud SQL and Cloud Spanner: Deployment, Administration and Use Cases with Python

2019-12-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Shakuntala Gupta Edward , Navin Sabharwal

Big Data Cloud Computing GCP SQL data data-engineering google-cloud-sql relational-databases

Discover the methodologies and best practices for getting started with Google Cloud Platform relational services – CloudSQL and CloudSpanner. The book begins with the basics of working with the Google Cloud Platform along with an introduction to the database technologies available for developers from Google Cloud. You'll then take an in-depth hands on journey into Google CloudSQL and CloudSpanner, including choosing the right platform for your application needs, planning, provisioning, designing and developing your application. Sample applications are given that use Python to connect to CloudSQL and CloudSpanner, along with helpful features provided by the engines. You''ll also implement practical best practices in the last chapter. Hands On Google Cloud SQL and Cloud Spanner is a great starting point to apply GCP data offerings in your technology stack and the code used allows you to try out the examples and extend them in interestingways. What You'll Learn Get started with Big Data technologies on the Google Cloud Platform Review CloudSQL and Cloud Spanner from basics to administration Apply best practices and use Google’s CloudSQL and CloudSpanner offering Work with code in Python notebooks and scripts Who This Book Is For Application architects, database architects, software developers, data engineers, cloud architects.

Learn PySpark: Build Python-based Machine Learning and Deep Learning Models

2019-09-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pramod Singh

AI/ML Airflow Analytics Big Data GitHub PySpark Spark Data Streaming apache-spark data data-engineering

Leverage machine and deep learning models to build applications on real-time data using PySpark. This book is perfect for those who want to learn to use this language to perform exploratory data analysis and solve an array of business challenges. You'll start by reviewing PySpark fundamentals, such as Spark’s core architecture, and see how to use PySpark for big data processing like data ingestion, cleaning, and transformations techniques. This is followed by building workflows for analyzing streaming data using PySpark and a comparison of various streaming platforms. You'll then see how to schedule different spark jobs using Airflow with PySpark and book examine tuning machine and deep learning models for real-time predictions. This book concludes with a discussion on graph frames and performing network analysis using graph algorithms in PySpark. All the code presented in the book will be available in Python scripts on Github. What You'll Learn Develop pipelines for streaming data processing using PySpark Build Machine Learning & Deep Learning models using PySpark latest offerings Use graph analytics using PySpark Create Sequence Embeddings from Text data Who This Book is For Data Scientists, machine learning and deep learning engineers who want to learn and use PySpark for real time analysis on streaming data.

Introducing MySQL Shell: Administration Made Easy with Python

2019-09-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Charles Bell

DevOps JavaScript JSON MySQL NoSQL SQL data data-engineering relational-databases

Use MySQL Shell, the first modern and advanced client for connecting to and interacting with MySQL. It supports SQL, Python, and JavaScript. That’s right! You can write Python scripts and execute them within the shell interactively, or in batch mode. The level of automation available from Python combined with batch mode is especially helpful to those practicing DevOps methods in their database environments. Introducing MySQL Shell covers everything you need to know about MySQL Shell. You will learn how to use the shell for SQL, as well as the new application programming interfaces for working with a document store and even automating your management of MySQL servers using Python. The book includes a look at the supporting technologies and concepts such as JSON, schema-less documents, NoSQL, MySQL Replication, Group Replication, InnoDB Cluster, and more. MySQL Shell is the client that developers and databaseadministrators have been waiting for. Far more powerful than the legacy client, MySQL Shell enables levels of automation that are useful not only for MySQL, but in the broader context of your career as well. Automate your work and build skills in one of the most in-demand languages. With MySQL Shell, you can do both! What You'll Learn Use MySQL Shell with the newest features in MySQL 8 Discover what a Document Store is and how to manage it with MySQL Shell Configure Group Replication and InnoDB Cluster from MySQL Shell Understand the new MySQL Python application programming interfaces Write Python scripts for managing your data and the MySQL high availability features Who This Book Is For Developers and database professionals who want to automate their work and remain on the cutting edge of what MySQLhas to offer. Anyone not happy with the limited automation capabilities of the legacy command-line client will find much to like in this book on the MySQL Shell that supports powerful automation through the Python scripting language.

Mastering SQL Server 2017

2019-08-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Christian Cote , Milos Radivojevic , William Durkin , Dejan Sarka , Matija Lah

AI/ML Azure BI Docker DWH ETL/ELT JSON Linux Microsoft SQL SQL Server SSIS +4 more

Leverage the power of SQL Server 2017 Integration Services to build data integration solutions with ease Key Features Work with temporal tables to access information stored in a table at any time Get familiar with the latest features in SQL Server 2017 Integration Services Program and extend your packages to enhance their functionality Book Description Microsoft SQL Server 2017 uses the power of R and Python for machine learning and containerization-based deployment on Windows and Linux. By learning how to use the features of SQL Server 2017 effectively, you can build scalable apps and easily perform data integration and transformation. You'll start by brushing up on the features of SQL Server 2017. This Learning Path will then demonstrate how you can use Query Store, columnstore indexes, and In-Memory OLTP in your apps. You'll also learn to integrate Python code in SQL Server and graph database implementations for development and testing. Next, you'll get up to speed with designing and building SQL Server Integration Services (SSIS) data warehouse packages using SQL server data tools. Toward the concluding chapters, you'll discover how to develop SSIS packages designed to maintain a data warehouse using the data flow and other control flow tasks. By the end of this Learning Path, you'll be equipped with the skills you need to design efficient, high-performance database applications with confidence. This Learning Path includes content from the following Packt books: SQL Server 2017 Developer's Guide by Milos Radivojevic, Dejan Sarka, et. al SQL Server 2017 Integration Services Cookbook by Christian Cote, Dejan Sarka, et. al What you will learn Use columnstore indexes to make storage and performance improvements Extend database design solutions using temporal tables Exchange JSON data between applications and SQL Server Migrate historical data to Microsoft Azure by using Stretch Database Design the architecture of a modern Extract, Transform, and Load (ETL) solution Implement ETL solutions using Integration Services for both on-premise and Azure data Who this book is for This Learning Path is for database developers and solution architects looking to develop ETL solutions with SSIS, and explore the new features in SSIS 2017. Advanced analysis practitioners, business intelligence developers, and database consultants dealing with performance tuning will also find this book useful. Basic understanding of database concepts and T-SQL is required to get the best out of this Learning Path.

Big Data Simplified

2019-06-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sayan Goswami , Sourabh Mukherjee , Amit Kumar Das

Big Data Cassandra Data Science Hadoop Hive IoT Kafka MongoDB NoSQL Spark data data-engineering

"Big Data Simplified blends technology with strategy and delves into applications of big data in specialized areas, such as recommendation engines, data science and Internet of Things (IoT) and enables a practitioner to make the right technology choice. The steps to strategize a big data implementation are also discussed in detail. This book presents a holistic approach to the topic, covering a wide landscape of big

data technologies like Hadoop 2.0 and package implementations, such as Cloudera. In-depth discussion of associated technologies, such as MapReduce, Hive, Pig, Oozie, ApacheZookeeper, Flume, Kafka, Spark, Python and NoSQL databases like Cassandra, MongoDB, GraphDB, etc., is also included.

talk-data.com

Activity Trend

Top Events

Top Speakers

Cyber Resilient Infrastructure: Detect, Protect, and Mitigate Threats Against Brocade SAN FOS with IBM QRadar

Essential PySpark for Scalable Data Analytics

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Advanced Analytics with Transact-SQL: Exploring Hidden Patterns and Rules in Your Data

Distributed Data Systems with Azure Databricks

Data Pipelines with Apache Airflow

Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets

Data Engineering with Python

Learn MongoDB 4.x

Learning ArcGIS Pro 2 - Second Edition

Learning Spark, 2nd Edition

Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud

Spark in Action, Second Edition

Cassandra: The Definitive Guide, 3rd Edition

Mastering Large Datasets with Python

Hands On Google Cloud SQL and Cloud Spanner: Deployment, Administration and Use Cases with Python

Learn PySpark: Build Python-based Machine Learning and Deep Learning Models

Introducing MySQL Shell: Administration Made Easy with Python

Mastering SQL Server 2017

Big Data Simplified