talk-data.com talk-data.com

Topic

Scala

programming_language functional_programming jvm

34

tagged

Activity Trend

12 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: O'Reilly Data Engineering Books ×
Spark in Action

Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0. About the Technology Big data systems distribute datasets across clusters of machines, making it a challenge to efficiently query, stream, and interpret them. Spark can help. It is a processing system designed specifically for distributed data. It provides easy-to-use interfaces, along with the performance you need for production-quality analytics and machine learning. Spark 2 also adds improved programming APIs, better performance, and countless other upgrades. About the Book Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. You'll get comfortable with the Spark CLI as you work through a few introductory examples. Then, you'll start programming Spark using its core APIs. Along the way, you'll work with structured data using Spark SQL, process near-real-time streaming data, apply machine learning algorithms, and munge graph data using Spark GraphX. For a zero-effort startup, you can download the preconfigured virtual machine ready for you to try the book's code. What's Inside Updated for Spark 2.0 Real-life case studies Spark DevOps with Docker Examples in Scala, and online in Java and Python About the Reader Written for experienced programmers with some background in big data or machine learning. About the Authors Petar Zečević and Marko Bonaći are seasoned developers heavily involved in the Spark community. Quotes Dig in and get your hands dirty with one of the hottest data processing engines today. A great guide. - Jonathan Sharley, Pandora Media Must-have! Speed up your learning of Spark as a distributed computing framework. - Robert Ormandi, Yahoo! An easy-to-follow, step-by-step guide. - Gaurav Bhardwaj, 3Pillar Global An ambitiously comprehensive overview of Spark and its diverse ecosystem. - Jonathan Miller, Optensity

Fast Data Processing with Spark 2 - Third Edition

Fast Data Processing with Spark 2 takes you through the essentials of leveraging Spark for big data analysis. You will learn how to install and set up Spark, handle data using its APIs, and apply advanced functionality like machine learning and graph processing. By the end of the book, you will be well-equipped to use Spark in real-world data processing tasks. What this Book will help me do Install and configure Apache Spark for optimal performance. Interact with distributed datasets using the resilient distributed dataset (RDD) API. Leverage the flexibility of DataFrame API for efficient big data analytics. Apply machine learning models using Spark MLlib to solve complex problems. Perform graph analysis using GraphX to uncover structural insights in data. Author(s) Krishna Sankar is an experienced data scientist and thought leader in big data technologies. With a deep understanding of machine learning, distributed systems, and Apache Spark, Krishna has guided numerous projects in data engineering and big data processing. Matei Zaharia, the co-author, is also widely recognized in the field of distributed systems and cloud computing, contributing to Apache Spark development. Who is it for? This book is catered to software developers and data engineers with a foundational understanding of Scala or Java programming. Beginner to medium-level understanding of big data processing concepts is recommended for readers. If you are aspiring to solve big data problems using scalable distributed computing frameworks, this book is perfect for you. By the end, you will be confident in building Spark-powered applications and analyzing data efficiently.

Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka

Learn how to integrate full-stack open source big data architecture and to choose the correct technology—Scala/Spark, Mesos, Akka, Cassandra, and Kafka—in every layer. Big data architecture is becoming a requirement for many different enterprises. So far, however, the focus has largely been on collecting, aggregating, and crunching large data sets in a timely manner. In many cases now, organizations need more than one paradigm to perform efficient analyses. Big Data SMACK explains each of the full-stack technologies and, more importantly, how to best integrate them. It provides detailed coverage of the practical benefits of these technologies and incorporates real-world examples in every situation. This book focuses on the problems and scenarios solved by the architecture, as well as the solutions provided by every technology. It covers the six main concepts of big data architecture and how integrate, replace, and reinforce every layer: What You'll Learn The language: Scala The engine: Spark (SQL, MLib, Streaming, GraphX) The container: Mesos, Docker The view: Akka The storage: Cassandra The message broker: Kafka What You Will Learn: Make big data architecture without using complex Greek letter architectures Build a cheap but effective cluster infrastructure Make queries, reports, and graphs that business demands Manage and exploit unstructured and No-SQL data sources Use tools to monitor the performance of your architecture Integrate all technologies and decide which ones replace and which ones reinforce Who This Book Is For Developers, data architects, and data scientists looking to integrate the most successful big data open stack architecture and to choose the correct technology in every layer

Big Data Analytics

Dive into the world of big data with "Big Data Analytics: Real Time Analytics Using Apache Spark and Hadoop." This comprehensive guide introduces readers to the fundamentals and practical applications of Apache Spark and Hadoop, covering essential topics like Spark SQL, DataFrames, structured streaming, and more. Learn how to harness the power of real-time analytics and big data tools effectively. What this Book will help me do Master the key components of Apache Spark and Hadoop ecosystems, including Spark SQL and MapReduce. Gain an understanding of DataFrames, DataSets, and structured streaming for seamless data handling. Develop skills in real-time analytics using Spark Streaming and technologies like Kafka and HBase. Learn to implement machine learning models using Spark's MLlib and ML Pipelines. Explore graph analytics with GraphX and leverage data visualization tools like Jupyter and Zeppelin. Author(s) Venkat Ankam, an expert in big data technologies, has years of experience working with Apache Hadoop and Spark. As an educator and technical consultant, Venkat has enabled numerous professionals to gain critical insights into big data ecosystems. With a pragmatic approach, his writings aim to guide readers through complex systems in a structured and easy-to-follow manner. Who is it for? This book is perfect for data analysts, data scientists, software architects, and programmers aiming to expand their knowledge of big data analytics. Readers should ideally have a basic programming background in languages like Python, Scala, R, or SQL. Prior hands-on experience with big data environments is not necessary but is an added advantage. This guide is created to cater to a range of skill levels, from beginners to intermediate learners.

Sams Teach Yourself Apache Spark™ in 24 Hours

Apache Spark is a fast, scalable, and flexible open source distributed processing engine for big data systems and is one of the most active open source big data projects to date. In just 24 lessons of one hour or less, Sams Teach Yourself Apache Spark in 24 Hours helps you build practical Big Data solutions that leverage Spark’s amazing speed, scalability, simplicity, and versatility. This book’s straightforward, step-by-step approach shows you how to deploy, program, optimize, manage, integrate, and extend Spark–now, and for years to come. You’ll discover how to create powerful solutions encompassing cloud computing, real-time stream processing, machine learning, and more. Every lesson builds on what you’ve already learned, giving you a rock-solid foundation for real-world success. Whether you are a data analyst, data engineer, data scientist, or data steward, learning Spark will help you to advance your career or embark on a new career in the booming area of Big Data. Learn how to • Discover what Apache Spark does and how it fits into the Big Data landscape • Deploy and run Spark locally or in the cloud • Interact with Spark from the shell • Make the most of the Spark Cluster Architecture • Develop Spark applications with Scala and functional Python • Program with the Spark API, including transformations and actions • Apply practical data engineering/analysis approaches designed for Spark • Use Resilient Distributed Datasets (RDDs) for caching, persistence, and output • Optimize Spark solution performance • Use Spark with SQL (via Spark SQL) and with NoSQL (via Cassandra) • Leverage cutting-edge functional programming techniques • Extend Spark with streaming, R, and Sparkling Water • Start building Spark-based machine learning and graph-processing applications • Explore advanced messaging technologies, including Kafka • Preview and prepare for Spark’s next generation of innovations Instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Spark to solve a wide spectrum of Big Data problems.

Interactive Spark using PySpark

Apache Spark is an in-memory framework that allows data scientists to explore and interact with big data much more quickly than with Hadoop. Python users can work with Spark using an interactive shell called PySpark. Why is it important? PySpark makes the large-scale data processing capabilities of Apache Spark accessible to data scientists who are more familiar with Python than Scala or Java. This also allows for reuse of a wide variety of Python libraries for machine learning, data visualization, numerical analysis, etc. What you'll learn—and how you can apply it Compare the different components provided by Spark, and what use cases they fit. Learn how to use RDDs (resilient distributed datasets) with PySpark. Write Spark applications in Python and submit them to the cluster as Spark jobs. Get an introduction to the Spark computing framework. Apply this approach to a worked example to determine the most frequent airline delays in a specific month and year. This lesson is for you because… You're a data scientist, familiar with Python coding, who needs to get up and running with PySpark You're a Python developer who needs to leverage the distributed computing resources available on a Hadoop cluster, without learning Java or Scala first Prerequisites Familiarity with writing Python applications Some familiarity with bash command-line operations Basic understanding of how to use simple functional programming constructs in Python, such as closures, lambdas, maps, etc. Materials or downloads needed in advance Apache Spark This lesson is taken from by Jenny Kim and Benjamin Bengfort. Data Analytics with Hadoop

Spark GraphX in Action

Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and how to use it interactively. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data. About the Technology GraphX is a powerful graph processing API for the Apache Spark analytics engine that lets you draw insights from large datasets. GraphX gives you unprecedented speed and capacity for running massively parallel and machine learning algorithms. About the Book Spark GraphX in Action begins with the big picture of what graphs can be used for. This example-based tutorial teaches you how to use GraphX interactively. You'll start with a crystal-clear introduction to building big data graphs from regular data, and then explore the problems and possibilities of implementing graph algorithms and architecting graph processing pipelines. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data. What's Inside Understanding graph technology Using the GraphX API Developing algorithms for big graphs Machine learning with graphs Graph visualization About the Reader Readers should be comfortable writing code. Experience with Apache Spark and Scala is not required. About the Authors Michael Malak has worked on Spark applications for Fortune 500 companies since early 2013. Robin East has worked as a consultant to large organizations for over 15 years and is a data scientist at Worldpay. Quotes Learn complex graph processing from two experienced authors…A comprehensive guide. - Gaurav Bhardwaj, 3Pillar Global The best resource to go from GraphX novice to expert in the least amount of time. - Justin Fister, PaperRater A must-read for anyone serious about large-scale graph data mining! - Antonio Magnaghi, OpenMail Reveals the awesome and elegant capabilities of working with linked data for large-scale datasets. - Sumit Pal, Independent consultant

Spark

Production-targeted Spark guidance with real-world use cases Spark: Big Data Cluster Computing in Production goes beyond general Spark overviews to provide targeted guidance toward using lightning-fast big-data clustering in production. Written by an expert team well-known in the big data community, this book walks you through the challenges in moving from proof-of-concept or demo Spark applications to live Spark in production. Real use cases provide deep insight into common problems, limitations, challenges, and opportunities, while expert tips and tricks help you get the most out of Spark performance. Coverage includes Spark SQL, Tachyon, Kerberos, ML Lib, YARN, and Mesos, with clear, actionable guidance on resource scheduling, db connectors, streaming, security, and much more. Spark has become the tool of choice for many Big Data problems, with more active contributors than any other Apache Software project. General introductory books abound, but this book is the first to provide deep insight and real-world advice on using Spark in production. Specific guidance, expert tips, and invaluable foresight make this guide an incredibly useful resource for real production settings. Review Spark hardware requirements and estimate cluster size Gain insight from real-world production use cases Tighten security, schedule resources, and fine-tune performance Overcome common problems encountered using Spark in production Spark works with other big data tools including MapReduce and Hadoop, and uses languages you already know like Java, Scala, Python, and R. Lightning speed makes Spark too good to pass up, but understanding limitations and challenges in advance goes a long way toward easing actual production implementation. Spark: Big Data Cluster Computing in Production tells you everything you need to know, with real-world production insight and expert guidance, tips, and tricks.

Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing

This book is a step-by-step guide for learning how to use Spark for different types of big-data analytics projects, including batch, interactive, graph, and stream data analysis as well as machine learning. It covers Spark core and its add-on libraries, including Spark SQL, Spark Streaming, GraphX, MLlib, and Spark ML. Big Data Analytics with Spark shows you how to use Spark and leverage its easy-to-use features to increase your productivity. You learn to perform fast data analysis using its in-memory caching and advanced execution engine, employ in-memory computing capabilities for building high-performance machine learning and low-latency interactive analytics applications, and much more. Moreover, the book shows you how to use Spark as a single integrated platform for a variety of data processing tasks, including ETL pipelines, BI, live data stream processing, graph analytics, and machine learning. The book also includes a chapter on Scala, the hottest functional programming language, and the language that underlies Spark. You’ll learn the basics of functional programming in Scala, so that you can write Spark applications in it. What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, such as HDFS, Avro, Parquet, Kafka, Cassandra, HBase, Mesos, and so on. It also provides an introduction to machine learning and graph concepts. So the book is self-sufficient; all the technologies that you need to know to use Spark are covered. The only thing that you are expected to have is some programming knowledge in any language.

Apache Spark Graph Processing

Dive into the world of large-scale graph data processing with Apache Spark's GraphX API. This book introduces you to the core concepts of graph analytics and teaches you how to leverage Spark for handling and analyzing massive graphs. From building to analyzing, you'll acquire a comprehensive skillset to work with graph data efficiently. What this Book will help me do Learn to utilize Apache Spark GraphX API to process and analyze graph data. Master transforming raw datasets into sophisticated graph structures. Explore visualization and analysis techniques for understanding graphs. Understand and build custom graph operations tailored to your needs. Implement advanced graph algorithms like clustering and iterative processing. Author(s) Rindra Ramamonjison is a seasoned data engineer with vast experience in big data technologies and graph processing. With a passion for explaining complex concepts in simple terms, Rindra builds on his professional expertise to guide readers in mastering cutting-edge Spark tools. Who is it for? This book is tailored for data scientists and software developers looking to delve into graph data processing at scale. Ideal for those with basic knowledge of Scala and Apache Spark, it equips readers with the tools and techniques to derive insights from complex network datasets. Whether you're diving deeper into big data or exploring graph-specific analytics, this book is your guide.

Neo4j Cookbook

Dive into Neo4j and uncover how to harness its powerful capabilities in graph data analysis with the Neo4j Cookbook. Across 75 well-structured recipes, you'll learn to apply practical techniques in modeling, querying, and visualizing graph databases, enabling you to address real-world challenges efficiently. What this Book will help me do Access Neo4j from popular programming languages such as Java, Python, and Scala, enabling easier integration into your projects. Migrate data seamlessly from various data stores, including SQL and NoSQL, into Neo4j, maintaining data consistency. Use best practices for data modeling with Neo4j to optimize performance and scalability for your applications. Analyze social data from sources like Facebook and Twitter, revealing valuable insights from connections and relationships. Integrate geospatial data to enable location-based queries and nearest-point searches, opening up advanced application features. Author(s) Ankur Goel, the author of Neo4j Cookbook, is an experienced technologist with an extensive background in handling database solutions and applications. Passionate about simplifying complex systems, Ankur excels in teaching essential database concepts through clear and actionable recipes. His writing is rooted in practical insights, reflecting his hands-on experience in the industry. Who is it for? This book is ideal for developers and data engineers who currently use or plan to integrate Neo4j into their workflows. If you are migrating from a traditional database system or delving into graph databases for the first time, this book offers structured guidance. Readers should have a fundamental understanding of programming and familiarity with database concepts for the best experience. It caters to individuals aiming to build or enhance data-driven applications using Neo4j's robust graph modeling.

Advanced Analytics with Spark

In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—classification, collaborative filtering, and anomaly detection among others—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find these patterns useful for working on your own data applications.

Learning Spark

Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.

Camel in Action

NEWER EDITION AVAILABLE Camel in Action, Second Edition is now available. An eBook of this older edition is included at no additional cost when you buy the revised edition! A limited number of pBook copies of this edition are still available. Please contact Manning Support to inquire about purchasing previous edition copies. Camel in Action is a Camel tutorial full of small examples showing how to work with the integration patterns. It starts with core concepts like sending, receiving, routing, and transforming data. It then shows you the entire lifecycle and goes in depth on how to test, deal with errors, scale, deploy, and even monitor your app—details you can find only in the Camel code itself. Written by the developers of Camel, this book distills their experience and practical insights so that you can tackle integration tasks like a pro. About the Technology Apache Camel is a Java framework that lets you implement the standard enterprise integration patterns in a few lines of code. With a concise but sophisticated DSL you snap integration logic into your app, Lego-style, using Java, XML, or Scala. Camel supports over 80 common transports such as HTTP, REST, JMS, and Web Services. About the Book What's Inside Valuable examples in Java and XML Explanations of complex patterns Error handling, testing, deploying, managing, and running Camel Accessible to beginners, useful to experts About the Reader About the Authors Claus Ibsen is a principal engineer working for FuseSource specializing in the enterprise integration space. He has worked on Apache Camel for the last three years where he is a a PMC member, a key contributor, and heads the development and roadmap. Claus lives in Sweden near Malmo with his wife and dog. Jonathan Anstey is a software engineer with varied experience in manufacturing control systems, build infrastructure, and enterprise integration. Lately, Jon has been working on Apache Camel as a PMC member and an active committer while at FuseSource. When he is not hacking on Camel he likes to spend time with his wife and daughter in St. John's, Newfoundland. Quotes I highly recommend this book. It kicks ass! - James Strachan, Cofounder of Apache Camel Strikes the right balance between core concepts and running code. - Gregor Hohpe, Coauthor of Enterprise Integration Patterns Comprehensive guide to enterprise integration with Camel. - Gordon Dickens, Chariot Solutions A deep book... with great examples. - Jeroen Benckhuijsen, Atos Origin Great content from the source developers. - Domingo Suarez Torres, SynergyJ A must-have. - Tijs Rademakers, Atos Origin