Spark

Hadoop: The Definitive Guide, 4th Edition

2015-04-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tom White

Avro Hadoop Apache HBase HDFS Hive Parquet Data Streaming data data-engineering

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, youâ??ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. Youâ??ll learn about recent changes to Hadoop, and explore new case studies on Hadoopâ??s role in healthcare systems and genomics data processing. Learn fundamental components such as MapReduce, HDFS, and YARN Explore MapReduce in depth, including steps for developing applications with it Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN Learn two data formats: Avro for data serialization and Parquet for nested data Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer) Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop Learn the HBase distributed database and the ZooKeeper distributed configuration service

Data Science For Dummies

2015-03-09 · O'Reilly Data Science Books O'Reilly Amazon

book

by Lillian Pierson

AI/ML Big Data Data Science DataViz Hadoop RDBMS data data-science

Discover how data science can help you gain in-depth insight into your business - the easy way! Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the perfect starting point for IT professionals and students interested in making sense of their organization's massive data sets and applying their findings to real-world business scenarios. From uncovering rich data sources to managing large amounts of data within hardware and software limitations, ensuring consistency in reporting, merging various data sources, and beyond, you'll develop the know-how you need to effectively interpret data and tell a story that can be understood by anyone in your organization. Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis Details different data visualization techniques that can be used to showcase and summarize your data Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques Includes coverage of big data processing tools like MapReduce, Hadoop, Dremel, Storm, and Spark It's a big, big data world out there - let Data Science For Dummies help you harness its power and gain a competitive edge for your organization.

Field Guide to Hadoop

2015-03-02 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Marshall Presser , Kevin Sitto

Avro Big Data Cassandra Chef Cloud Computing Data Management Docker Hadoop Apache HBase HDFS Hive JSON +5 more

If your organization is about to enter the world of big data, you not only need to decide whether Apache Hadoop is the right platform to use, but also which of its many components are best suited to your task. This field guide makes the exercise manageable by breaking down the Hadoop ecosystem into short, digestible sections. You’ll quickly understand how Hadoop’s projects, subprojects, and related technologies work together. Each chapter introduces a different topic—such as core technologies or data transfer—and explains why certain components may or may not be useful for particular needs. When it comes to data, Hadoop is a whole new ballgame, but with this handy reference, you’ll have a good grasp of the playing field. Topics include: Core technologies—Hadoop Distributed File System (HDFS), MapReduce, YARN, and Spark Database and data management—Cassandra, HBase, MongoDB, and Hive Serialization—Avro, JSON, and Parquet Management and monitoring—Puppet, Chef, Zookeeper, and Oozie Analytic helpers—Pig, Mahout, and MLLib Data transfer—Scoop, Flume, distcp, and Storm Security, access control, auditing—Sentry, Kerberos, and Knox Cloud computing and virtualization—Serengeti, Docker, and Whirr

Learning Spark

2015-02-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Andy Konwinski (Databricks) , Holden Karau (Fight Health Insurance) , Matei Zaharia (Databricks) , Patrick Wendell (Databricks)

Analytics API Data Analytics Java Python Scala SQL Data Streaming apache-spark data data-engineering

Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.

Learning Hadoop 2

2015-02-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by GABRIELE MODENA

Big Data Cloud Computing Hadoop Java Linux Unix data data-engineering

Delve into the world of big data with 'Learning Hadoop 2', a comprehensive guide to leveraging the capabilities of Hadoop 2 for data processing and analysis. In this book, you will explore the tools and frameworks that integrate with Hadoop, discovering the best ways to design and deploy effective workflows for managing and analyzing large datasets. What this Book will help me do Understand the fundamentals of the MapReduce framework and its applications. Utilize advanced tools such as Samza and Spark for real-time and iterative data processing. Manage large datasets with data mining techniques tailored for Hadoop environments. Deploy Hadoop applications across various infrastructures, including local clusters and cloud services. Create and orchestrate sophisticated data workflows and pipelines with Apache Pig and Oozie. Author(s) Gabriele Modena is an experienced developer and trained data specialist with a keen focus on distributed data processing frameworks. Having worked extensively with big data platforms, Gabriele brings practical insights and a hands-on perspective to technical subjects. His writing is concise and engaging, aiming to render complex concepts accessible. Who is it for? This book is ideal for system and application developers eager to learn practical implementations of the Hadoop framework. Readers should be familiar with the Unix/Linux command-line interface and Java programming. Prior experience with Hadoop will be advantageous, but not necessary.

Big Data Now: 2014 Edition

2014-12-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by O'Reilly Media, Inc.

AI/ML Analytics API Big Data Hadoop data data-engineering

In the four years that O'Reilly Media, Inc. has produced its annual Big Data Now report, the data field has grown from infancy into young adulthood. Data is now a leader in some fields and a driver of innovation in others, and companies that use data and analytics to drive decision-making are outperforming their peers. And while access to big data tools and techniques once required significant expertise, today many tools have improved and communities have formed to share best practices. Companies have also started to emphasize the importance of processes, culture, and people. The topics in represent the major forces currently shaping the data world: Big Data Now: 2014 Edition Cognitive augmentation: predictive APIs, graph analytics, and Network Science dashboards Intelligence matters: defining AI, modeling intelligence, deep learning, and "summoning the demon" Cheap sensors, fast networks, and distributed computing: stream processing, hardware data flows, and computing at the edge Data (science) pipelines: broadening the coverage of analytic pipelines with specialized tools Evolving marketplace of big data components: SSDs, Hadoop 2, Spark; and why datacenters need operating systems Design and social science: human-centered design, wearables and real-time communications, and wearable etiquette Building a data culture: moving from prediction to real-time adaptation; and why you need to become a data skeptic Perils of big data: data redlining, intrusive data analysis, and the state of big data ethics

Hadoop in Practice, Second Edition

2014-09-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alex Holmes

AI/ML Analytics Big Data Hadoop Java Kafka SQL data data-engineering

Hadoop in Practice, Second Edition provides over 100 tested, instantly useful techniques that will help you conquer big data, using Hadoop. This revised new edition covers changes and new features in the Hadoop core architecture, including MapReduce 2. Brand new chapters cover YARN and integrating Kafka, Impala, and Spark SQL with Hadoop. You'll also get new and updated techniques for Flume, Sqoop, and Mahout, all of which have seen major new versions recently. In short, this is the most practical, up-to-date coverage of Hadoop available anywhere About the Technology About the Book It's always a good time to upgrade your Hadoop skills! Hadoop in Practice, Second Edition provides a collection of 104 tested, instantly useful techniques for analyzing real-time streams, moving data securely, machine learning, managing large-scale clusters, and taming big data using Hadoop. This completely revised edition covers changes and new features in Hadoop core, including MapReduce 2 and YARN. You'll pick up hands-on best practices for integrating Spark, Kafka, and Impala with Hadoop, and get new and updated techniques for the latest versions of Flume, Sqoop, and Mahout. In short, this is the most practical, up-to-date coverage of Hadoop available. Readers need to know a programming language like Java and have basic familiarity with Hadoop. What's Inside Thoroughly updated for Hadoop 2 How to write YARN applications Integrate real-time technologies like Storm, Impala, and Spark Predictive analytics using Mahout and RR About the Reader About the Author Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects. Quotes Very insightful. A deep dive into the Hadoop world. - Andrea Tarocchi, Red Hat, Inc. The most complete material on Hadoop and its ecosystem known to mankind! - Arthur Zubarev, Vital Insights Clear and concise, full of insights and highly applicable information. - Edward de Oliveira Ribeiro, DataStax, Inc. Comprehensive up-to-date coverage of Hadoop 2. - Muthusamy Manigandan, OzoneMedia

Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives

2014-05-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vijay Srinivas Agneeswaran Ph.D

AI/ML Analytics Big Data Data Analytics Data Governance Hadoop Cyber Security data data-engineering

Master alternative Big Data technologies that can do what Hadoop can't: real-time analytics and iterative machine learning. When most technical professionals think of Big Data analytics today, they think of Hadoop. But there are many cutting-edge applications that Hadoop isn't well suited for, especially real-time analytics and contexts requiring the use of iterative machine learning algorithms. Fortunately, several powerful new technologies have been developed specifically for use cases such as these. Big Data Analytics Beyond Hadoop is the first guide specifically designed to help you take the next steps beyond Hadoop. Dr. Vijay Srinivas Agneeswaran introduces the breakthrough Berkeley Data Analysis Stack (BDAS) in detail, including its motivation, design, architecture, Mesos cluster management, performance, and more. He presents realistic use cases and up-to-date example code for: Spark, the next generation in-memory computing technology from UC Berkeley Storm, the parallel real-time Big Data analytics technology from Twitter GraphLab, the next-generation graph processing paradigm from CMU and the University of Washington (with comparisons to alternatives such as Pregel and Piccolo) Halo also offers architectural and design guidance and code sketches for scaling machine learning algorithms to Big Data, and then realizing them in real-time. He concludes by previewing emerging trends, including real-time video analytics, SDNs, and even Big Data governance, security, and privacy issues. He identifies intriguing startups and new research possibilities, including BDAS extensions and cutting-edge model-driven analytics. Big Data Analytics Beyond Hadoop is an indispensable resource for everyone who wants to reach the cutting edge of Big Data analytics, and stay there: practitioners, architects, programmers, data scientists, researchers, startup entrepreneurs, and advanced students.

AB Testing Workshop Mastering Experimentation and Statistical Analysis

· AB Testing Workshop Mastering Experimentation and Statistical Analysis

talk

by Dr. Yasin Ceran (KAIST)

Python SQL

A/B testing is a powerful technique used in data-driven decision-making to evaluate the impact of changes or interventions. In this comprehensive workshop, you will gain a deep understanding of A/B testing principles, learn how to design effective experiments, and analyze the results using statistical methods.

Workshop Curriculum:

Module 1: Introduction to A/B Testing Module 2: Formulating Hypotheses and Experiment Design Module 3: Statistical Analysis Techniques Module 4: Interpreting and Presenting Results Module 5: A/B Testing in Practice with Python

Prerequisites:

Basic understanding of statistics and hypothesis testing
Familiarity with basic concepts of data analysis
Familiarity with Python

Agenda:

7:45 pm - 7:50 pm Arrival, socializing, and Opening 7:50 pm - 9:45 pm Dr. Yasin Ceran, "AB Testing Workshop Mastering Experimentation and Statistical Analysis" 9:45 pm - 9:50 pm Q&A

About Dr. Yasin Ceran:

Yasin Ceran is passionate about all things data and holds a vast experience in data analysis, mathematical modeling and Apache Spark, and in SQL, Python and R. He is currently an associate professor at KAIST, South Korea, as well as teaching at San Jose State University at the heart of Silicon Valley. Yasin has worked rigorously on an array of data-related projects encompassing data mining, statistics, modeling, and is dedicated to sharing his experience and expertise with learners.

Powering Azure Fabric and AI with Eon’s Data Lake

· Microsoft Ignite 2025 Watch

talk

AI/ML Analytics Azure Cloud Computing Data Lake Iceberg LLM Microsoft Fabric SQL

With Eon on Azure, backups don’t just sit idle—they become a first-class data source. Eon transforms cloud backups into Iceberg tables in Blob Storage, instantly queryable through Microsoft Fabric and OneLake. Learn how backup data flows into Fabric engines like SQL, Spark, and KQL, and how it fuels AI innovation with Azure OpenAI. See how organizations can collaborate more effectively by unifying protection, analytics, and AI on Eon’s data lake.

Powering Up Your Power BI Development Workflow with Microsoft Fabric Notebooks

· April Meetup - Double session with Gigi Chiang and Prathy Kamasani

talk

Python copilot microsoft fabric notebooks power bi sempy

Session about Microsoft Fabric Notebooks, basics of Python and Spark, using Notebooks to write and run code, leveraging CoPilot for code suggestions and corrections, and using SemPy to enrich Power BI development with advanced analytics.

Scaling Vector Search Up & Out with Spark

· PyTorch Meetup #21

talk

by Ash Vardanian (Unum Cloud)

jvm-native lucene usearch vector search

Apache Lucene was never built for AI-scale vector search. In this talk, I’ll show how USearch bridges the JVM–native gap, pairing Spark’s horizontal scale with native vertical speed.