Java

Kafka in Action

2022-02-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dave Klein , Dylan Scott , Viktor Gamov (Confluent)

Analytics ETL/ELT Kafka Data Streaming data data-engineering streaming-messaging

Master the wicked-fast Apache Kafka streaming platform through hands-on examples and real-world projects. In Kafka in Action you will learn: Understanding Apache Kafka concepts Setting up and executing basic ETL tasks using Kafka Connect Using Kafka as part of a large data project team Performing administrative tasks Producing and consuming event streams Working with Kafka from Java applications Implementing Kafka as a message queue Kafka in Action is a fast-paced introduction to every aspect of working with Apache Kafka. Starting with an overview of Kafka's core concepts, you'll immediately learn how to set up and execute basic data movement tasks and how to produce and consume streams of events. Advancing quickly, you’ll soon be ready to use Kafka in your day-to-day workflow, and start digging into even more advanced Kafka topics. About the Technology Think of Apache Kafka as a high performance software bus that facilitates event streaming, logging, analytics, and other data pipeline tasks. With Kafka, you can easily build features like operational data monitoring and large-scale event processing into both large and small-scale applications. About the Book Kafka in Action introduces the core features of Kafka, along with relevant examples of how to use it in real applications. In it, you’ll explore the most common use cases such as logging and managing streaming data. When you’re done, you’ll be ready to handle both basic developer- and admin-based tasks in a Kafka-focused team. What's Inside Kafka as an event streaming platform Kafka producers and consumers from Java applications Kafka as part of a large data project About the Reader For intermediate Java developers or data engineers. No prior knowledge of Kafka required. About the Authors Dylan Scott is a software developer in the insurance industry. Viktor Gamov is a Kafka-focused developer advocate. At Confluent, Dave Klein helps developers, teams, and enterprises harness the power of event streaming with Apache Kafka. Quotes The authors have had many years of real-world experience using Kafka, and this book’s on-the-ground feel really sets it apart. - From the foreword by Jun Rao, Confluent Cofounder A surprisingly accessible introduction to a very complex technology. Developers will want to keep a copy close by. - Conor Redmond, InComm Payments A comprehensive and practical guide to Kafka and the ecosystem. - Sumant Tambe, Linkedin It quickly gave me insight into how Kafka works, and how to design and protect distributed message applications. - Gregor Rayman, Cloudfarms

Building Big Data Pipelines with Apache Beam

2022-01-21 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jan Lukavský

Beam Big Data apache-beam data data-engineering streaming-messaging

Building Big Data Pipelines with Apache Beam is the essential guide for mastering data processing using Apache Beam. This book covers both the basics and advanced concepts, from implementing pipelines to extending functionalities with custom I/O connectors. By the end, you'll be equipped to build scalable and reusable big data solutions. What this Book will help me do Understand the core principles of Apache Beam and its architecture. Learn how to create efficient data processing pipelines for diverse scenarios. Master the use of stateful processing for real-time data handling. Gain skills in using Beam's portability features for various languages. Explore advanced functionalities like creating custom I/O connectors. Author(s) None Lukavský is a seasoned data engineer with extensive experience in big data technologies and Apache Beam. Having worked on innovative data solutions across industries, None brings hands-on insights and practical expertise to this book. Their approach to teaching ensures readers can directly apply concepts to real-world scenarios. Who is it for? This book is designed for professionals involved in big data, such as data engineers, analysts, and scientists. It is particularly suited for those with an intermediate level of understanding of Java, aiming to expand their skill set to include advanced data pipeline construction. Whether you're stepping into Apache Beam for the first time or looking to deepen your expertise, this book offers valuable, actionable insights.

Numerical Methods Using Java: For Data Science, Analysis, and Engineering

2022-01-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Haksun Li, PhD

Data Science jvm-languages programming-languages software-development

Implement numerical algorithms in Java using NM Dev, an object-oriented and high-performance programming library for mathematics.You’ll see how it can help you easily create a solution for your complex engineering problem by quickly putting together classes. Numerical Methods Using Java covers a wide range of topics, including chapters on linear algebra, root finding, curve fitting, differentiation and integration, solving differential equations, random numbers and simulation, a whole suite of unconstrained and constrained optimization algorithms, statistics, regression and time series analysis. The mathematical concepts behind the algorithms are clearly explained, with plenty of code examples and illustrations to help even beginners get started. What You Will Learn Program in Java using a high-performance numerical library Learn the mathematics for a wide range of numerical computing algorithms Convert ideas and equations into code Put together algorithms and classes to build your own engineering solution Build solvers for industrial optimization problems Do data analysis using basic and advanced statistics Who This Book Is For Programmers, data scientists, and analysts with prior experience with programming in any language, especially Java.

Apache Pulsar in Action

2021-12-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by David Kjerrumgaard

Analytics Cloud Computing IoT Kafka Kubernetes Splunk SQL Data Streaming apache-pulsar data data-engineering

Deliver lightning fast and reliable messaging for your distributed applications with the flexible and resilient Apache Pulsar platform. In Apache Pulsar in Action you will learn how to: Publish from Apache Pulsar into third-party data repositories and platforms Design and develop Apache Pulsar functions Perform interactive SQL queries against data stored in Apache Pulsar Apache Pulsar in Action is a comprehensive and practical guide to building high-traffic applications with Pulsar. You’ll learn to use this mature and battle-tested platform to deliver extreme levels of speed and durability to your messaging. Apache Pulsar committer David Kjerrumgaard teaches you to apply Pulsar’s seamless scalability through hands-on case studies, including IOT analytics applications and a microservices app based on Pulsar functions. About the Technology Reliable server-to-server messaging is the heart of a distributed application. Apache Pulsar is a flexible real-time messaging platform built to run on Kubernetes and deliver the scalability and resilience required for cloud-based systems. Pulsar supports both streaming and message queuing, and unlike other solutions, it can communicate over multiple protocols including MQTT, AMQP, and Kafka’s binary protocol. About the Book Apache Pulsar in Action teaches you to build scalable streaming messaging systems using Pulsar. You’ll start with a rapid introduction to enterprise messaging and discover the unique benefits of Pulsar. Following crystal-clear explanations and engaging examples, you’ll use the Pulsar Functions framework to develop a microservices-based application. Real-world case studies illustrate how to implement the most important messaging design patterns. What's Inside Publish from Pulsar into third-party data repositories and platforms Design and develop Apache Pulsar functions Create an event-driven food delivery application About the Reader Written for experienced Java developers. No prior knowledge of Pulsar required. About the Author David Kjerrumgaard is a committer on the Apache Pulsar project. He currently serves as a Developer Advocate for StreamNative, where he develops Pulsar best practices and solutions. Quotes Apache Pulsar in Action is able to seamlessly mix the theory and abstract concepts with the clarity of practical step-by-step examples. I’d recommend to anyone! - Matteo Merli, co-creator of Apache Pulsar Gives readers insights into how the ‘magic’ works… Definitely recommended. - Henry Saputra, Splunk A complete, practical, fun-filled book. - Satej Kumar Sahu, Honeywell A definitive guide that will help you scale your applications. - Alessandro Campeis, Vimar The best book to start working with Pulsar. - Emanuele Piccinelli, Empirix

Cloud-Native Microservices with Apache Pulsar: Build Distributed Messaging Microservices

2021-12-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rahul Sharma , Mohammad Atyab

AWS AWS Lambda Cloud Computing JSON Kubernetes Cyber Security Data Streaming apache-pulsar data data-engineering

Apply different enterprise integration and processing strategies available with Pulsar, Apache's multi-tenant, high-performance, cloud-native messaging and streaming platform. This book is a comprehensive guide that examines using Pulsar Java libraries to build distributed applications with message-driven architecture. You'll begin with an introduction to Apache Pulsar architecture. The first few chapters build a foundation of message-driven architecture. Next, you'll perform a setup of all the required Pulsar components. The book also covers work with Apache Pulsar client library to build producers and consumers for the discussed patterns. You'll then explore the transformation, filter, resiliency, and tracing capabilities available with Pulsar. Moving forward, the book will discuss best practices when building message schemas and demonstrate integration patterns using microservices. Security is an important aspect of any application;the book will cover authentication and authorization in Apache Pulsar such as Transport Layer Security (TLS), OAuth 2.0, and JSON Web Token (JWT). The final chapters will cover Apache Pulsar deployment in Kubernetes. You'll build microservices and serverless components such as AWS Lambda integrated with Apache Pulsar on Kubernetes. After completing the book, you'll be able to comfortably work with the large set of out-of-the-box integration options offered by Apache Pulsar. What You'll Learn Examine the important Apache Pulsar components Build applications using Apache Pulsar client libraries Use Apache Pulsar effectively with microservices Deploy Apache Pulsar to the cloud Who This Book Is For Cloud architects and software developers who build systems in the cloud-native technologies.

Beginning Hibernate 6: Java Persistence from Beginner to Pro

2021-10-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dave Minter , Joseph Ottinger , Jeff Linwood

data data-engineering database-management-tools hibernate object-relational-mapping

Get started with Hibernate, an open source Java persistence layer and gain a clear introduction to the current standard for object-relational persistence in Java. This updated edition includes the new Hibernate 6.0 framework which covers new configuration, new object relational mapping changes, and enhanced integration with the more general Spring, Boot and Quarkus and other Java frameworks.The book keeps its focus on Hibernate without wasting time on nonessential third-party tools, so you’ll be able to immediately start building transaction-based engines and applications. Experienced authors Joseph Ottinger with Dave Minter and Jeff Linwood provide more in-depth examples than any other book for Hibernate beginners. They present their material in a lively, example-based manner—not a dry, theoretical, hard-to-read fashion. What You'll Learn Build enterprise Java-based transaction-type applications that access complex data with Hibernate Work with Hibernate 6 using a present-day build process Integrate into the persistence life cycle Search and query with the new version of Hibernate Keep track of versioned data with Hibernate Envers Who This Book Is For Programmers experienced in Java with databases (the traditional, or connected, approach), but new to open-source, lightweight Hibernate.

Cloud Native Integration with Apache Camel: Building Agile and Scalable Integrations for Kubernetes Platforms

2021-08-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Guilherme Camposo

Agile/Scrum API Cloud Computing Kafka Kubernetes camel data data-engineering streaming-messaging

Address the most common integration challenges, by understanding the ins and outs of the choices and exemplifying the solutions with practical examples on how to create cloud native applications using Apache Camel. Camel will be our main tool, but we will also see some complementary tools and plugins that can make our development and testing easier, such as Quarkus, and tools for more specific use cases, such as Apache Kafka and Keycloak. You will learn to connect with databases, create REST APIs, transform data, connect with message oriented software (MOMs), secure your services, and test using Camel. You will also learn software architecture patterns for integration and how to leverage container platforms, such as Kubernetes. This book is suitable for those who are eager to learn an integration tool that fits the Kubernetes world, and who want to explore the integration challenges that can be solved using containers. What You Will Learn Focus on how to solve integration challenges Understand the basics of the Quarkus as it’s the foundation for the application Acquire a comprehensive view on Apache Camel Deploy an application in Kubernetes Follow good practices Who This Book Is For Java developers looking to learn Apache Camel; Apache Camel developers looking to learn more about Kubernetes deployments; software architects looking to study integration patterns for Kubernetes based systems; system administrators (operations teams) looking to get a better understand of how technologies are integrated.

Building the AirflowEventStream

2021-07-01 · Airflow Summit 2021

session

by Jelle Munk (Adyen)

AI/ML Airflow Big Data

Or how to keep our traditional java application up-to-date on everything big data. At Adyen we process tens of millions of transactions a day, a number that rises every day. This means that generating reports, training machine learning models or any other operation that requires a bird’s eye view on weeks or months of data requires the use of Big Data technologies. We recently migrated to Airflow for scheduling all batch operations on our on-premise Big Data cluster. Some of these operations require input from our merchants or our support team. Merchants can for instance subscribe to reports, choose their preferred time zone, and even specify which columns they want included. After generating the reports, these reports then need to become available in our customer portal. So how do we keep track in our Customer Area which reports have been generated in Airflow? How do we launch ad-hoc backfills when one of our merchants subscribes to a new report? How do we integrate all of this into our existing monitoring pipeline? This talk will focus on how we have successfully integrated our big data platform with our existing Java web applications and how Airflow (with some simple add-ons) played a crucial role in achieving this.

R2DBC Revealed: Reactive Relational Database Connectivity for Java and JVM Programmers

2021-04-02 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Robert Hedgpeth

API MariaDB Oracle RDBMS SQL data data-engineering relational-databases

Understand the newest trend in database programming for developers working in Java, Kotlin, Clojure, and other JVM-based languages. This book introduces Reactive Relational Database Connectivity (R2DBC), a modern way of connecting to and querying relational databases from Java and other JVM languages. The book begins by helping you understand not only what reactive programming is, but why it is necessary. Then building on those fundamentals, the book takes you into the world of databases and the newly released Reactive Relational Database Connectivity (R2DBC) specification. Examples in the book are worked using the freely available MariaDB database along with MariaDB’s vendor-implementation of the R2DBC service-provider interface (SPI). Following along with the examples and the provided example code helps prepare you to work with any of the growing number of R2DBC implementations for popular enterprise databases such as Oracle Database and SQL Server. You’ll be well prepared for what is becoming the future of database access from Java and other languages built on the JVM. What You Will Learn Understand why R2DBC was created and how it utilizes the Reactive Streams API Understand the components of the R2DBC service-provider interface Create and manage reactive database connections and connection pools using an R2DBC client Programmatically execute queries on a relational database using an R2DBC client Effectively utilize transactions using an R2DBC client Build relational database-driven applications that are event-driven and non-blocking Who This Book Is For Software developers building solutions using JVM languages and the JVM ecosystem, and developers who need an introduction to the R2DBC specification and reactive programming with relational databases and want to understand what Reactive Relational Database Connectivity is and why it came about. This book includes practical examples of using the R2DBC specification with Java and MariaDB that will provide developers with the knowledge they need to create their own solutions.

Self Service Open Source Data Integration With AirByte

2021-02-23 · Data Engineering Podcast Listen

podcast_episode

by Michel Tricot (Airbyte) , John Lafleur (Airbyte) , Tobias Macey

Airbyte Airflow Avro BI BigQuery CI/CD Cloud Computing Dagster Data Engineering Data Management Data Quality Datacoral +21 more

Summary Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Michel Tricot and John Lafleur about Airbyte, an open source framework for building data integration pipelines.

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Airbyte is and the story behind it? Businesses and data engineers have a variety of options for how to manage their data integration. How would you characterize the overall landscape and how does Airbyte distinguish itself in that space? How would you characterize your target users?

How have those personas instructed the priorities and design of Airbyte? What do you see as the benefits and tradeoffs of a UI oriented data integration platform as compared to a code first approach?

what are the complex/challenging elements of data integration that makes it such a slippery problem? motivation for creating open source ELT as a business Can you describe how the Airbyte platform is implemented?

What was your motivation for choosing Java as the primary language?

incidental complexity of forcing all connectors to be packaged as containers shortcomings of the Singer specification/motivation for creating a backwards incompatible interface perceived potential for community adoption of Airbyte specification tradeoffs of using JSON as interchange format vs. e.g. protobuf/gRPC/Avro/etc.

information lost when converting records to JSON types/how to preserve that information (e.g. field constraints, valid enums, etc.)

interfaces/extension points for integrating with other tools, e.g. Dagster abstraction layers for simplifying implementation of new connectors tradeoffs of storing all connectors in a monorepo with the Airbyte core

impact of community adoption/contributions

What is involved in setting up an Airbyte installation? What are the available axes for scaling an Airbyte deployment? challenges of setting up and maintaining CI environment for Airbyte How are you managing governance and long term sustainability of the project? What are some of the most interesting, unexpected, or innovative ways that you have seen Airbyte used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Airbyte? When is Airbyte the wrong choice? What do you have planned for the future of the project?

Contact Info

Michel

LinkedIn @MichelTricot on Twitter michel-tricot on GitHub

John

LinkedIn @JeanLafleur on Twitter johnlafleur on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Airbyte Liveramp Fivetran

Podcast Episode

Stitch Data Matillion DataCoral

Podcast Episode

Singer Meltano

Podcast Episode

Airflow

Podcast.init Episode

Kotlin Docker Monorepo Airbyte Specification Great Expectations

Podcast Episode

Dagster

Data Engineering Podcast Episode Podcast.init Episode

Prefect

Podcast Episode

DBT

Podcast Episode

Kubernetes Snowflake

Podcast Episode

Redshift Presto Spark Parquet

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Mastering Kafka Streams and ksqlDB

2021-02-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Mitch Seymour

Kafka Pub/Sub data data-engineering streaming-messaging

Working with unbounded and fast-moving data streams has historically been difficult. But with Kafka Streams and ksqlDB, building stream processing applications is easy and fun. This practical guide shows data engineers how to use these tools to build highly scalable stream processing applications for moving, enriching, and transforming large amounts of data in real time. Mitch Seymour, data services engineer at Mailchimp, explains important stream processing concepts against a backdrop of several interesting business problems. You'll learn the strengths of both Kafka Streams and ksqlDB to help you choose the best tool for each unique stream processing project. Non-Java developers will find the ksqlDB path to be an especially gentle introduction to stream processing. Learn the basics of Kafka and the pub/sub communication pattern Build stateless and stateful stream processing applications using Kafka Streams and ksqlDB Perform advanced stateful operations, including windowed joins and aggregations Understand how stateful processing works under the hood Learn about ksqlDB's data integration features, powered by Kafka Connect Work with different types of collections in ksqlDB and perform push and pull queries Deploy your Kafka Streams and ksqlDB applications to production

MATLAB Recipes: A Problem-Solution Approach

2021-02-02 · O'Reilly Data Science Books O'Reilly Amazon

book

by Michael Paluszek , Stephanie Thomas

API Big Data Cloud Computing JSON MATLAB data data-science data-science-tools

Learn from state-of-the-art examples in robotics, motors, detection filters, chemical processes, aircraft, and spacecraft. With this book you will review contemporary MATLAB coding including the latest MATLAB language features and use MATLAB as a software development environment including code organization, GUI development, and algorithm design and testing. Features now covered include the new graph and digraph classes for charts and networks; interactive documents that combine text, code, and output; a new development environment for building apps; locally defined functions in scripts; automatic expansion of dimensions; tall arrays for big data; the new string type; new functions to encode/decode JSON; handling non-English languages; the new class architecture; the Mocking framework; an engine API for Java; the cloud-based MATLAB desktop; the memoize function; and heatmap charts. MATLAB Recipes: A Problem-Solution Approach, Second Edition provides practical, hands-on code snippets and guidance for using MATLAB to build a body of code you can turn to time and again for solving technical problems in your work. Develop algorithms, test them, visualize the results, and pass the code along to others to create a functional code base for your firm. What You Will Learn Get up to date with the latest MATLAB up to and including MATLAB 2020b Code in MATLAB Write applications in MATLAB Build your own toolbox of MATLAB code to increase your efficiency and effectiveness Who This Book Is For Engineers, data scientists, and students wanting a book rich in examples using MATLAB.

Episode 2: Our Favorite Data Structures

2020-12-04 · ADSP: Algorithms + Data Structures = Programs Listen

podcast_episode

by Conor Hoekstra , Bryce Adelstein Lelbach (NVIDIA)

JavaScript Python Rust Scala

In this episode, Bryce and Conor talk about each of their favorite data structures. Date Recorded: 2020-11-28 Date Released: 2020-12-04 C++ | Containers OCaml | Containers Java | Collections Python | Collections Kotlin | Collections Scala | Collections Rust | Collections Go | Collections Haskell | Collections TS | Collections Ruby | Collections JS | Collections F# | Collection Types Racket | Data Structures Clojure | Data Structures What do you mean by “cache friendly”? - Björn Fahller - code::dive 2019Alan J. Perlis’ Epigrams on Programmingstd::vectorP1072 basic_string::resize_default_initstd::arraystd::unique_ptr (Array Specialization)P0316 allocate_unique and allocator_deletethurst::allocate_uniqueIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application

2020-10-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Atri Sharma

AI/ML data data-engineering lucene search

Gain a thorough knowledge of Lucene's capabilities and use it to develop your own search applications. This book explores the Java-based, high-performance text search engine library used to build search capabilities in your applications. Starting with the basics of Lucene and searching, you will learn about the types of queries used in it and also take a look at scoring models. Applying this basic knowledge, you will develop a hello world app using basic Lucene queries and explore functions like scoring and document level boosting. Along the way you will also uncover the concepts of partial searching and matching in Lucene and then learn how to integrate geographical information (geospatial data) in Lucene using spatial queries and n-dimensional indexing. This will prepare you to build a location-aware search engine with a representative data set that allows location constraints to be specified during a search. You’ll also develop atext classifier using Lucene and Apache Mahout, a popular machine learning framework. After a detailed review of performance bench-marking and common issues associated with it, you’ll learn some of the best practices of tuning the performance of your application. By the end of the book you’ll be able to build your first Lucene patch, where you will not only write your patch, but also test it and ensure it adheres to community coding standards. What You’ll Learn Master the basics of Apache Lucene Utilize different query types in Apache Lucene Explore scoring and document level boosting Integrate geospatial data into your application Who This Book Is For Developers wanting to learn the finer details of Apache Lucene by developing a series of projects with it.

Learning Spark, 2nd Edition

2020-07-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Denny Lee (Databricks) , Brooke Wenig , Jules S. Damji (Anyscale Inc) , Tathagata Das (Databricks)

AI/ML Analytics API Avro CSV Data Analytics Delta Hive JSON Kafka ORC Parquet +9 more

Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark. Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to: Learn Python, SQL, Scala, or Java high-level Structured APIs Understand Spark operations and SQL Engine Inspect, tune, and debug Spark operations with Spark configurations and Spark UI Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka Perform analytics on batch and streaming data using Structured Streaming Build reliable data pipelines with open source Delta Lake and Spark Develop machine learning pipelines with MLlib and productionize models using MLflow

Democratised data workflows at scale

2020-07-01 · Airflow Summit 2020

session

by Mihail Petkov (Financial Times) , Emil Todorov (Financial Times)

Airflow BI Kubernetes Python Cyber Security Spark

Financial Times is increasing its digital revenue by allowing business people to make data-driven decisions. Providing an Airflow based platform where data engineers, data scientists, BI experts and others can run language agnostic jobs was a huge swing. One of the most successful steps in the platform’s development was building our own execution environment, allowing stakeholders to self deploy jobs without cross team dependencies on top of the unlimited scale of Kubernetes. In this talk we share how we have integrated and extended Airflow at Financial Times. The main topics we will cover include: Providing team level security isolation Removing cross team dependencies Creating execution environment for independently creating and deploying R, Python, JAVA, Spark, etc jobs Reducing latency when sharing data between task instances Integrating all these features on top of Kubernetes

Spark in Action, Second Edition

2020-06-05 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jean-Georges Perrin (Actian)

AI/ML Analytics API Big Data ELK GitHub Hadoop IBM Python Scala Spark SQL +4 more

The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop. About the Technology Analyzing enterprise data starts by reading, filtering, and merging files and streams from many sources. The Spark data processing engine handles this varied volume like a champ, delivering speeds 100 times faster than Hadoop systems. Thanks to SQL support, an intuitive interface, and a straightforward multilanguage API, you can use Spark without learning a complex new ecosystem. About the Book Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you’ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you’ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms. What's Inside Writing Spark applications in Java Spark application architecture Ingestion through files, databases, streaming, and Elasticsearch Querying distributed datasets with Spark SQL About the Reader This book does not assume previous experience with Spark, Scala, or Hadoop. About the Author Jean-Georges Perrin is an experienced data and software architect. He is France’s first IBM Champion and has been honored for 12 consecutive years. Quotes This book reveals the tools and secrets you need to drive innovation in your company or community. - Rob Thomas, IBM An indispensable, well-paced, and in-depth guide. A must-have for anyone into big data and real-time stream processing. - Anupam Sengupta, GuardHat Inc. This book will help spark a love affair with distributed processing. - Conor Redmond, InComm Product Control Currently the best book on the subject! - Markus Breuer, Materna IPS

Cassandra: The Definitive Guide, 3rd Edition

2020-04-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Eben Hewitt , Jeff Carpenter

Cassandra Data Modelling JavaScript Python data data-engineering nosql-databases

Imagine what you could do if scalability wasn't a problem. With this hands-on guide, you’ll learn how the Cassandra database management system handles hundreds of terabytes of data while remaining highly available across multiple data centers. This third edition—updated for Cassandra 4.0—provides the technical details and practical examples you need to put this database to work in a production environment. Authors Jeff Carpenter and Eben Hewitt demonstrate the advantages of Cassandra’s nonrelational design, with special attention to data modeling. If you’re a developer, DBA, or application architect looking to solve a database scaling issue or future-proof your application, this guide helps you harness Cassandra’s speed and flexibility. Understand Cassandra’s distributed and decentralized structure Use the Cassandra Query Language (CQL) and cqlsh—the CQL shell Create a working data model and compare it with an equivalent relational model Develop sample applications using client drivers for languages including Java, Python, and Node.js Explore cluster topology and learn how nodes exchange data

Building The DataDog Platform For Processing Timeseries Data At Massive Scale

2019-12-30 · Data Engineering Podcast Listen

podcast_episode

by Vadim Semenov (DataDog) , Tobias Macey

AI/ML Airflow API Big Data Cassandra Chef Cloud Computing Dagster Data Engineering Data Management Databricks Datadog +13 more

Summary DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Vadim Semenov about how data engineers work at DataDog

Interview

Introduction How did you get involved in the area of data management? For anyone who isn’t familiar with DataDog, can you start by describing the types and volumes of data that you’re dealing with? What are the main components of your platform for managing that information? How are the data teams at DataDog organized and what are your primary responsibilities in the organization? What are some of the complexities and challenges that you face in your work as a result of the volume of data that you are processing?

What are some of the strategies which have proven to be most useful in overcoming those challenges?

Who are the main consumers of your work and how do you build in feedback cycles to ensure that their needs are being met? Given that the majority of the data being ingested by DataDog is timeseries, what are your lifecycle and retention policies for that information? Most of the data that you are working with is customer generated from your deployed agents and API integrations. How do you manage cleanliness and schema enforcement for the events as they are being delivered? What are some of the upcoming projects that you have planned for the upcoming months and years? What are some of the technologies, patterns, or practices that you are hoping to adopt?

Contact Info

LinkedIn @databuryat on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

DataDog Hadoop Hive Yarn Chef SRE == Site Reliability Engineer Application Performance Management (APM) Apache Kafka RocksDB Cassandra Apache Parquet data serialization format SLA == Service Level Agreement WatchDog Apache Spark

Podcast Episode

Apache Pig Databricks JVM == Java Virtual Machine Kubernetes SSIS (SQL Server Integration Services) Pentaho JasperSoft Apache Airflow

Podcast.init Episode

Apache NiFi

Podcast Episode

Luigi Dagster

Podcast Episode

Prefect

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Building The Materialize Engine For Interactive Streaming Analytics In SQL

2019-12-23 · Data Engineering Podcast Listen

podcast_episode

by Frank McSherry (Materialize) , Tobias Macey

AI/ML Analytics Big Data Data Engineering Data Management Dataflow Rust SQL Data Streaming

Summary Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Materialize is and the problems that you are aiming to solve with it?

What was your motivation for creating it?

What use cases does Materialize enable?

What are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize? How does it fit into the broader ecosystem of data tools and platforms?

What are some of the use cases that Materialize is uniquely able to support? How is Materialize architected and how has the design evolved since you first began working on it? Materialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided?

What are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems?

In the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize?

A majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or

talk-data.com

Activity Trend

Top Events

Top Speakers

Kafka in Action

Building Big Data Pipelines with Apache Beam

Numerical Methods Using Java: For Data Science, Analysis, and Engineering

Apache Pulsar in Action

Cloud-Native Microservices with Apache Pulsar: Build Distributed Messaging Microservices

Beginning Hibernate 6: Java Persistence from Beginner to Pro

Cloud Native Integration with Apache Camel: Building Agile and Scalable Integrations for Kubernetes Platforms

Building the AirflowEventStream

R2DBC Revealed: Reactive Relational Database Connectivity for Java and JVM Programmers

Self Service Open Source Data Integration With AirByte

Mastering Kafka Streams and ksqlDB

MATLAB Recipes: A Problem-Solution Approach

Episode 2: Our Favorite Data Structures

Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application

Learning Spark, 2nd Edition

Democratised data workflows at scale

Spark in Action, Second Edition

Cassandra: The Definitive Guide, 3rd Edition

Building The DataDog Platform For Processing Timeseries Data At Massive Scale

Building The Materialize Engine For Interactive Streaming Analytics In SQL