Data Streaming

Hadoop: The Definitive Guide, 4th Edition

2015-04-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tom White

Avro Hadoop Apache HBase HDFS Hive Parquet Spark data data-engineering

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, youâ??ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. Youâ??ll learn about recent changes to Hadoop, and explore new case studies on Hadoopâ??s role in healthcare systems and genomics data processing. Learn fundamental components such as MapReduce, HDFS, and YARN Explore MapReduce in depth, including steps for developing applications with it Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN Learn two data formats: Avro for data serialization and Parquet for nested data Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer) Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop Learn the HBase distributed database and the ZooKeeper distributed configuration service

Apache Flume: Distributed Log Collection for Hadoop - Second Edition

2015-02-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Steven Hoffman

Analytics Big Data ELK Hadoop HDFS data data-engineering log-data

"Apache Flume: Distributed Log Collection for Hadoop - Second Edition" is your hands-on guide to learning how to use Apache Flume to reliably collect and move logs and data streams into your Hadoop ecosystem. Through practical examples and real-world scenarios, this book will help you master the setup, configuration, and optimization of Flume for various data ingestion use cases. What this Book will help me do Understand the key concepts and architecture behind Apache Flume to build reliable and scalable data ingestion systems. Set up Flume agents to collect and transfer data into the Hadoop File System (HDFS) or other storage solutions effectively. Learn stream data processing techniques, such as filtering, transforming, and enriching data during transit to improve data usability. Integrate Flume with other tools like Elasticsearch and Solr to enhance analytics and search capabilities. Implement monitoring and troubleshooting workflows to maintain healthy and optimized Flume data pipelines. Author(s) Steven Hoffman, a seasoned software developer and data engineer, brings years of practical experience working with big data technologies to this book. He has a strong background in distributed systems and big data solutions, having implemented enterprise-scale analytics projects. Through clear and approachable writing, he aims to empower readers to successfully deploy reliable data pipelines using Apache Flume. Who is it for? This book is written for Hadoop developers, data engineers, and IT professionals who seek to build robust pipelines for streaming data into Hadoop environments. It is ideal for readers who have a basic understanding of Hadoop and HDFS but are new to Apache Flume. If you are looking to enhance your analytics capabilities by efficiently ingesting, routing, and processing streaming data, this book is for you. Beginners as well as experienced engineers looking to dive deeper into Flume will find it insightful.

Learning Spark

2015-02-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Andy Konwinski (Databricks) , Holden Karau (Fight Health Insurance) , Matei Zaharia (Databricks) , Patrick Wendell (Databricks)

Analytics API Data Analytics Java Python Scala Spark SQL apache-spark data data-engineering

Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.

Using Flume

2014-09-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Hari Shreedharan

API ELK GitHub Hadoop Apache HBase HDFS data data-engineering log-data

How can you get your data from frontend servers to Hadoop in near real time? With this complete reference guide, you’ll learn Flume’s rich set of features for collecting, aggregating, and writing large amounts of streaming data to the Hadoop Distributed File System (HDFS), Apache HBase, SolrCloud, Elastic Search, and other systems. Using Flume shows operations engineers how to configure, deploy, and monitor a Flume cluster, and teaches developers how to write Flume plugins and custom components for their specific use-cases. You’ll learn about Flume’s design and implementation, as well as various features that make it highly scalable, flexible, and reliable. Code examples and exercises are available on GitHub. Learn how Flume provides a steady rate of flow by acting as a buffer between data producers and consumers Dive into key Flume components, including sources that accept data and sinks that write and deliver it Write custom plugins to customize the way Flume receives, modifies, formats, and writes data Explore APIs for sending data to Flume agents from your own applications Plan and deploy Flume in a scalable and flexible way—and monitor your cluster once it’s running

Google BigQuery Analytics

2014-06-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Siddartha Naidu , Jordan Tigani (MotherDuck)

Analytics API BigQuery Hadoop Python Tableau data data-engineering google-bigquery

How to effectively use BigQuery, avoid common mistakes, and execute sophisticated queries against large datasets Google BigQuery Analytics is the perfect guide for business and data analysts who want the latest tips on running complex queries and writing code to communicate with the BigQuery API. The book uses real-world examples to demonstrate current best practices and techniques, and also explains and demonstrates streaming ingestion, transformation via Hadoop in Google Compute engine, AppEngine datastore integration, and using GViz with Tableau to generate charts of query results. In addition to the mechanics of BigQuery, the book also covers the architecture of the underlying Dremel query engine, providing a thorough understanding that leads to better query results. Features a companion website that includes all code and data sets from the book Uses real-world examples to explain everything analysts need to know to effectively use BigQuery Includes web application examples coded in Python

IBM InfoSphere Streams: Accelerating Deployments with Analytic Accelerators

2014-02-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tetsuya Shimada , Robert Uleman , Oliver Brandt , Roger Rea , Bharath Devaraju , Peter Nicholls , Ankit Pasricha , John Thorson , Kevin Foster , Chris Howard , Chuck Ballard , Daniel Farrell , Sandra Tucker , Norbert Schulz

AI/ML Analytics Big Data Hadoop HDFS IBM NLP SPSS data data-engineering infosphere

This IBM® Redbooks® publication describes visual development, visualization, adapters, analytics, and accelerators for IBM InfoSphere® Streams (V3), a key component of the IBM Big Data platform. Streams was designed to analyze data in motion, and can perform analysis on incredibly high volumes with high velocity, using a wide variety of analytic functions and data types. The Visual Development environment extends Streams Studio with drag-and-drop development, provides round tripping with existing text editors, and is ideal for rapid prototyping. Adapters facilitate getting data in and out of Streams, and V3 supports WebSphere MQ, Apache Hadoop Distributed File System, and IBM InfoSphere DataStage. Significant analytics include the native Streams Processing Language, SPSS Modeler analytics, Complex Event Processing, TimeSeries Toolkit for machine learning and predictive analytics, Geospatial Toolkit for location-based applications, and Annotation Query Language for natural language processing applications. Accelerators for Social Media Analysis and Telecommunications Event Data Analysis sample programs can be modified to build production level applications. Want to learn how to analyze high volumes of streaming data or implement systems requiring high performance across nodes in a cluster? Then this book is for you. Please note that the additional material referenced in the text is not available from IBM.

Joe Celko’s Complete Guide to NoSQL

2013-10-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Joe Celko

Big Data NoSQL SQL data data-engineering nosql-databases

Joe Celko's Complete Guide to NoSQL provides a complete overview of non-relational technologies so that you can become more nimble to meet the needs of your organization. As data continues to explode and grow more complex, SQL is becoming less useful for querying data and extracting meaning. In this new world of bigger and faster data, you will need to leverage non-relational technologies to get the most out of the information you have. Learn where, when, and why the benefits of NoSQL outweigh those of SQL with Joe Celko's Complete Guide to NoSQL. This book covers three areas that make today's new data different from the data of the past: velocity, volume and variety. When information is changing faster than you can collect and query it, it simply cannot be treated the same as static data. Celko will help you understand velocity, to equip you with the tools to drink from a fire hose. Old storage and access models do not work for big data. Celko will help you understand volume, as well as different ways to store and access data such as petabytes and exabytes. Not all data can fit into a relational model, including genetic data, semantic data, and data generated by social networks. Celko will help you understand variety, as well as the alternative storage, query, and management frameworks needed by certain kinds of data. Gain a complete understanding of the situations in which SQL has more drawbacks than benefits so that you can better determine when to utilize NoSQL technologies for maximum benefit Recognize the pros and cons of columnar, streaming, and graph databases Make the transition to NoSQL with the expert guidance of best-selling SQL expert Joe Celko

Instant PostgreSQL Backup and Restore How-to

2013-03-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Shaun Thomas

data data-engineering postgresql relational-databases

Are you tasked with managing and protecting your PostgreSQL databases? "Instant PostgreSQL Backup and Restore How-to" provides practical, step-by-step guidance for backing up and restoring both simple and complex PostgreSQL databases safely and efficiently. You'll learn essential skills to ensure your critical data is always secure and available. What this Book will help me do Master the process of backing up and restoring PostgreSQL databases effectively. Learn to target specific data for backup with partial dumps for higher flexibility. Utilize advanced compression techniques to optimize backup time and storage. Implement streaming replication for up-to-date standby servers. Apply file system snapshot techniques to ensure consistent online binary backups. Author(s) The authors of this book are experienced database administrators and PostgreSQL experts. They bring years of hands-on expertise in safeguarding and managing enterprise-level databases. Known for their engaging teaching style, they focus on delivering clear instructions and actionable insights to enable all database professionals to succeed with PostgreSQL. Who is it for? This book is designed for database administrators and IT professionals responsible for the durability, reliability, and recovery of data housed in PostgreSQL systems. It is well-suited for professionals ranging from beginners looking to understand PostgreSQL backup basics to experienced admins seeking to refine advanced restoration techniques. Readers should possess a basic familiarity with database concepts but do not need prior experience with PostgreSQL backup procedures.

Getting Started with Storm

2012-08-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Gabriel Eisbruch , Dario Simonassi , Jonathan Leibiusky

Analytics Big Data Java JavaScript Python Redis data data-engineering storm streaming-messaging

Even as big data is turning the world upside down, the next phase of the revolution is already taking shape: real-time data analysis. This hands-on guide introduces you to Storm, a distributed, JVM-based system for processing streaming data. Through simple tutorials, sample Java code, and a complete real-world scenario, you’ll learn how to build fast, fault-tolerant solutions that process results as soon as the data arrives. Discover how easy it is to set up Storm clusters for solving various problems, including continuous data computation, distributed remote procedure calls, and data stream processing. Learn how to program Storm components: spouts for data input and bolts for data transformation Discover how data is exchanged between spouts and bolts in a Storm topology Make spouts fault-tolerant with several commonly used design strategies Explore bolts—their life cycle, strategies for design, and ways to implement them Scale your solution by defining each component’s level of parallelism Study a real-time web analytics system built with Node.js, a Redis server, and a Storm topology Write spouts and bolts with non-JVM languages such as Python, Ruby, and Javascript

Programming Microsoft® SQL Server® 2012

2012-07-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Leonard Lobel and Andrew Brust

Azure BI C#/.NET Cloud Computing Microsoft SQL SQL Server data data-engineering microsoft-sql-server relational-databases

Your essential guide to key programming features in Microsoft SQL Server 2012 Take your database programming skills to a new level—and build customized applications using the developer tools introduced with SQL Server 2012. This hands-on reference shows you how to design, test, and deploy SQL Server databases through tutorials, practical examples, and code samples. If you’re an experienced SQL Server developer, this book is a must-read for learning how to design and build effective SQL Server 2012 applications. Discover how to: Build and deploy databases using the SQL Server Data Tools IDE Query and manipulate complex data with powerful Transact-SQL enhancements Integrate non-relational features, including native file streaming and geospatial data types Consume data with Microsoft ADO.NET, LINQ, and Entity Framework Deliver data using Windows Communication Foundation (WCF) Data Services and WCF RIA Services Move your database to the cloud with Windows Azure SQL Database Develop Windows Phone cloud applications using SQL Data Sync Use SQL Server BI components, including xVelocity in-memory technologies

IBM InfoSphere Streams: Assembling Continuous Insight in the Information Revolution

2011-10-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vitali N. Zoubov , Roger Rea , Deepak Rajan , Bugra Gedik , Michael P. Koranda , Kevin Foster , Chuck Ballard , Mike Spicer , Senthil Nathan , Andy Frenkiel , Brian Williams

Analytics Big Data Data Analytics HTML IBM data data-engineering infosphere

In this IBM® Redbooks® publication, we discuss and describe the positioning, functions, capabilities, and advanced programming techniques for IBM InfoSphere™ Streams (V2), a new paradigm and key component of IBM Big Data platform. Data has traditionally been stored in files or databases, and then analyzed by queries and applications. With stream computing, analysis is performed moment by moment as the data is in motion. In fact, the data might never be stored (perhaps only the analytic results). The ability to analyze data in motion is called real-time analytic processing (RTAP). IBM InfoSphere Streams takes a fundamentally different approach to Big Data analytics and differentiates itself with its distributed runtime platform, programming model, and tools for developing and debugging analytic applications that have a high volume and variety of data types. Using in-memory techniques and analyzing record by record enables high velocity. Volume, variety and velocity are the key attributes of Big Data. The data streams that are consumable by IBM InfoSphere Streams can originate from sensors, cameras, news feeds, stock tickers, and a variety of other sources, including traditional databases. It provides an execution platform and services for applications that ingest, filter, analyze, and correlate potentially massive volumes of continuous data streams. This book is intended for professionals that require an understanding of how to process high volumes of streaming data or need information about how to implement systems to satisfy those requirements. See: http://www.redbooks.ibm.com/abstracts/sg247865.html for the IBM InfoSphere Streams (V1) release.

Oracle 10g Developing Media Rich Applications

2011-04-08 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Lynne Dunckley , Larry Guros

Java Oracle Cyber Security data data-engineering oracle-11g oracle-database-solutions

Oracle 10g Developing Media Rich Applications is focused squarely on database administrators and programmers as the foundation of multimedia database applications. With the release of Oracle8 Database in 1997, Oracle became the first commercial database with integrated multimedia technology for application developers. Since that time, Oracle has enhanced and extended these features to include native support for image, audio, video and streaming media storage; indexing, retrieval and processing in the Oracle Database, Application Server; and development tools. Databases are not only words and numbers for accountants, but they also should utilize a full range of media to satisfy customer needs, from race car engineers, to manufacturing processes to security. The full range of audio, video and integration of media into databases is mission critical to these applications. This book details the most recent features in Oracle’s multimedia technology including those of the Oracle10gR2 Database and the Oracle9i Application Server. The technology covered includes: object relational media storage and services within the database, middle tier application development interfaces, wireless delivery mechanisms, and Java-based tools. * Gives broad coverage to integration of multimedia features such as audio and instrumentation video, from race cars to analyze performance, to voice and picture recognition for security data bases. As well as full multimedia for presentations * Includes field tested examples in enterprise environments * Provides coverage in a thorough and clear fashion developed in a London University Professional Course

Hadoop in Action

2010-11-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Chuck Lam

API Big Data Hadoop Hive Java NoSQL data data-engineering

Hadoop in Action introduces the subject and teaches you how to write programs in the MapReduce style. It starts with a few easy examples and then moves quickly to show Hadoop use in more complex data analysis tasks. Included are best practices and design patterns of MapReduce programming. About the Technology Big data can be difficult to handle using traditional databases. Apache Hadoop is a NoSQL applications framework that runs on distributed clusters. This lets it scale to huge datasets. If you need analytic information from your data, Hadoop's the way to go. About the Book What's Inside Introduction to MapReduce Examples illustrating ideas in practice Hadoop's Streaming API Other related tools, like Pig and Hive About the Reader This book requires basic Java skills. Knowing basic statistical concepts can help with the more advanced examples. About the Author Chuck Lam is a Senior Engineer at RockYou! He has a PhD in pattern recognition from Stanford University. Quotes A guide for beginners, a source of insight for advanced users. - Philipp K. Janert, Principal Value, LLC A nice mix of the what, why, and how of Hadoop. - Paul Stusiak, Falcon Technologies Corp. Demystifies Hadoop. A great resource! - Rick Wagner, Acxiom Corp. Covers it all! Plus, gives you sweet extras no one else does. - John S. Griffin, Overstock.com An excellent introduction to Hadoop and MapReduce. - Kenneth DeLong, BabyCenter, LLC

Programming Microsoft® SQL Server™ 2008

2008-10-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Leonard Lobel, Andrew Brust, and Stephen Forte

Microsoft SQL XML data data-engineering microsoft-sql-server relational-databases

Extend your programming skills with a comprehensive study of the key features of SQL Server 2008. Delve into the new core capabilities, get practical guidance from expert developers, and put their code samples to work. This is a must-read for Microsoft .NET and SQL Server developers who work with data access—at the database, business logic, or presentation levels. Discover how to: Query complex data with powerful Transact-SQL enhancements Use new, non-relational features: hierarchical tables, native file streaming, and geospatial capabilities Exploit XML inside the database to design XML-aware applications Consume and deliver your data using Microsoft LINQ, Entity Framework, and data binding Implement database-level encryption and server auditing Build and maintain data warehouses Use Microsoft Excel to build front ends for OLAP cubes, and MDX to query them Integrate data mining into applications quickly and effectively. Get code samples on the Web.

talk-data.com

Activity Trend

Top Events

Top Speakers

Hadoop: The Definitive Guide, 4th Edition

Apache Flume: Distributed Log Collection for Hadoop - Second Edition

Learning Spark

Using Flume

Google BigQuery Analytics

IBM InfoSphere Streams: Accelerating Deployments with Analytic Accelerators

Joe Celko’s Complete Guide to NoSQL

Instant PostgreSQL Backup and Restore How-to

Getting Started with Storm

Programming Microsoft® SQL Server® 2012

IBM InfoSphere Streams: Assembling Continuous Insight in the Information Revolution

Oracle 10g Developing Media Rich Applications

Hadoop in Action

Programming Microsoft® SQL Server™ 2008