Java

Solving Data Lineage Tracking And Data Discovery At WeWork

2019-12-16 · Data Engineering Podcast Listen

podcast_episode

by Willy Lulciuc (WeWork) , Julien Le Dem (Astronomer) , Tobias Macey

AI/ML Airflow Analytics Big Data Dagster Data Engineering Data Management Data Modelling Data Quality Google Dataform dbt Delta +18 more

Summary Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email [email protected] with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Marquez is?

What was missing in existing metadata management platforms that necessitated the creation of Marquez?

How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?

How does it compare to the Amundsen platform that Lyft recently released?

What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see? What are some of the capabilities that are unique to Marquez and how are you using them at WeWork? What are the primary resource types that you support in Marquez?

What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?

Can you explain how Marquez is architected and how the design has evolved since you first began working on it?

Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?

What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?

How is the metadata itself stored and managed in Marquez?

How much up-front data modeling is necessary and what types of schema representations are supported?

Can you talk through the overall workflow of someone using Marquez in their environment?

What is involved in registering and updating datasets? How do you define and track the health of a given dataset? What are some of the interesting questions that can be answered from the information stored in Marquez?

What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases? For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it? What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform? When is Marquez the wrong choice for a metadata repository? What do you have planned for the future of Marquez?

Contact Info

Julien Le Dem

@J_ on Twitter Email julienledem on GitHub

Willy

LinkedIn @wslulciuc on Twitter wslulciuc on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Marquez

DataEngConf Presentation

WeWork Canary Yahoo Dremio Hadoop Pig Parquet

Podcast Episode

Airflow Apache Atlas Amundsen

Podcast Episode

Uber DataBook LinkedIn DataHub Iceberg Table Format

Podcast Episode

Delta Lake

Podcast Episode

Great Expectations data pipeline unit testing framework

Podcast.init Episode

Redshift SnowflakeDB

Podcast Episode

Apache Kafka Schema Registry

Podcast Episode

Open Tracing Jaeger Zipkin DropWizard Java framework Marquez UI Cayley Graph Database Kubernetes Marquez Helm Chart Marquez Docker Container Dagster

Podcast Episode

Luigi DBT

Podcast Episode

Thrift Protocol Buffers

The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss"…

Monitoring and Managing the IBM Elastic Storage Server Using the GUI

2019-11-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Stefan Roth , Liju Jose , Przemyslaw Podfigurny , Alexander Wolf-Reber , Markus Rohwedder

ELK IBM data data-engineering

The IBM® Elastic Storage Server GUI provides an easy way to configure and monitor various features that are available with the IBM ESS system. It is a web application that runs on common web browsers, such as Chrome, Firefox, and Edge. The ESS GUI uses Java Script and Ajax technologies to enable smooth and desktop-like interfacing. This IBM Redpaper publication provides a broad understanding of the architecture and features of the ESS GUI. It includes information about how to install and configure the GUI and in-depth information about the use of the GUI options. The primary audience for this paper includes experienced and new users of the ESS system.

SQL Server 2019 Revealed: Including Big Data Clusters and Machine Learning

2019-10-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bob Ward (Azure Data)

AI/ML Analytics Azure Big Data Data Analytics Data Lake ETL/ELT Hadoop HDFS Kubernetes Linux MongoDB +8 more

Get up to speed on the game-changing developments in SQL Server 2019. No longer just a database engine, SQL Server 2019 is cutting edge with support for machine learning (ML), big data analytics, Linux, containers, Kubernetes, Java, and data virtualization to Azure. This is not a book on traditional database administration for SQL Server. It focuses on all that is new for one of the most successful modernized data platforms in the industry. It is a book for data professionals who already know the fundamentals of SQL Server and want to up their game by building their skills in some of the hottest new areas in technology. SQL Server 2019 Revealed begins with a look at the project's team goal to integrate the world of big data with SQL Server into a major product release. The book then dives into the details of key new capabilities in SQL Server 2019 using a “learn by example” approach for Intelligent Performance, security, mission-criticalavailability, and features for the modern developer. Also covered are enhancements to SQL Server 2019 for Linux and gain a comprehensive look at SQL Server using containers and Kubernetes clusters. The book concludes by showing you how to virtualize your data access with Polybase to Oracle, MongoDB, Hadoop, and Azure, allowing you to reduce the need for expensive extract, transform, and load (ETL) applications. You will then learn how to take your knowledge of containers, Kubernetes, and Polybase to build a comprehensive solution called Big Data Clusters, which is a marquee feature of 2019. You will also learn how to gain access to Spark, SQL Server, and HDFS to build intelligence over your own data lake and deploy end-to-end machine learning applications. What You Will Learn Implement Big Data Clusters with SQL Server, Spark, and HDFS Create a Data Hub with connections to Oracle, Azure, Hadoop, and other sources Combine SQL and Spark to build a machine learning platform for AI applications Boost your performance with no application changes using Intelligent Performance Increase security of your SQL Server through Secure Enclaves and Data Classification Maximize database uptime through online indexing and Accelerated Database Recovery Build new modern applications with Graph, ML Services, and T-SQL Extensibility with Java Improve your ability to deploy SQL Server on Linux Gain in-depth knowledge to run SQL Server with containers and Kubernetes Know all the new database engine features for performance, usability, and diagnostics Use the latest tools and methods to migrate your database to SQL Server 2019 Apply your knowledge of SQL Server 2019 to Azure Who This Book Is For IT professionals and developers who understand the fundamentals of SQL Server and wish to focus on learning about the new, modern capabilities of SQL Server 2019. The book is for those who want to learn about SQL Server 2019 and the new Big Data Clusters and AI feature set, support for machine learning and Java, how to run SQL Server with containers and Kubernetes, and increased capabilities around Intelligent Performance, advanced security, and high availability.

Deep Learning for Search

2019-06-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tommaso Teofili

AI/ML Data Science NLP TensorFlow data data-engineering search

Deep Learning for Search teaches you how to improve the effectiveness of your search by implementing neural network-based techniques. By the time you're finished with the book, you'll be ready to build amazing search engines that deliver the results your users need and that get better as time goes on! About the Technology Deep learning handles the toughest search challenges, including imprecise search terms, badly indexed data, and retrieving images with minimal metadata. And with modern tools like DL4J and TensorFlow, you can apply powerful DL techniques without a deep background in data science or natural language processing (NLP). This book will show you how. About the Book Deep Learning for Search teaches you to improve your search results with neural networks. You’ll review how DL relates to search basics like indexing and ranking. Then, you’ll walk through in-depth examples to upgrade your search with DL techniques using Apache Lucene and Deeplearning4j. As the book progresses, you’ll explore advanced topics like searching through images, translating user queries, and designing search engines that improve as they learn! What's Inside Accurate and relevant rankings Searching across languages Content-based image search Search with recommendations About the Reader For developers comfortable with Java or a similar language and search basics. No experience with deep learning or NLP needed. About the Author Tommaso Teofili is a software engineer with a passion for open source and machine learning. As a member of the Apache Software Foundation, he contributes to a number of open source projects, ranging from topics like information retrieval (such as Lucene and Solr) to natural language processing and machine translation (including OpenNLP, Joshua, and UIMA). He currently works at Adobe, developing search and indexing infrastructure components, and researching the areas of natural language processing, information retrieval, and deep learning. He has presented search and machine learning talks at conferences including BerlinBuzzwords, International Conference on Computational Science, ApacheCon, EclipseCon, and others. You can find him on Twitter at @tteofili. Quotes A practical approach that shows you the state of the art in using neural networks, AI, and deep learning in the development of search engines. - From the Foreword by Chris Mattmann, NASA JPL A thorough and thoughtful synthesis of traditional search and the latest advancements in deep learning. - Greg Zanotti, Marquette Partners A well-laid-out deep dive into the latest technologies that will take your search engine to the next level. - Andrew Wyllie, Thynk Health Hands-on exercises teach you how to master deep learning for search-based products. - Antonio Magnaghi, System1

Mastering Hadoop 3

2019-02-28 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Timothy Wong , Chanchal Singh , Manish Kumar

Flink Big Data Data Engineering Data Modelling Hadoop HDFS Cyber Security Spark data data-engineering

"Mastering Hadoop 3" is your in-depth guide to understanding and mastering the advanced features of the Hadoop ecosystem. With a focus on distributed computing and data processing, this book covers essential tools such as YARN, MapReduce, and Apache Spark to help you build scalable, efficient data pipelines. What this Book will help me do Gain a comprehensive understanding of Hadoop Distributed File System (HDFS) and YARN for effective resource management. Master data processing with MapReduce and learn to integrate with real-time processing engines like Spark and Flink. Develop and secure enterprise-grade Hadoop-based data pipelines by implementing robust security and governance measures. Explore techniques for batch data processing, data modeling, and designing applications tailored for Hadoop environments. Understand best practices for optimizing and troubleshooting Hadoop clusters for enhanced performance and reliability. Author(s) The authors, including None Wong, None Singh, and None Kumar, bring together years of experience in big data engineering, distributed systems, and enterprise application development. They aim to provide a clear pathway to mastering Hadoop ecosystem tools. Who is it for? This book is ideal for budding big data professionals who have some familiarity with Java and basic Hadoop concepts and wish to elevate their expertise. If you're a Hadoop career practitioner keen to expand your understanding of the ecosystem's advanced capabilities or a professional looking to implement Hadoop in organizational workflows, this book is well-suited for you.

Apache Spark Quick Start Guide

2019-01-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Akash Grade , Shrey Mehrotra

AI/ML API Big Data Python Scala Spark SQL Data Streaming apache-spark data data-engineering

Dive into the world of scalable data processing with the "Apache Spark Quick Start Guide." This book offers a foundational introduction to Spark, empowering readers to harness its capabilities for big data processing. With clear explanations and hands-on examples, you'll learn to implement Spark applications that handle complex data tasks efficiently. What this Book will help me do Understand and implement Spark's RDDs and DataFrame APIs to process large datasets effectively. Set up a local development environment for Spark-based projects. Develop skills to debug and optimize slow-performing Spark applications. Harness built-in modules of Spark for SQL, streaming, and machine learning applications. Adopt best practices and optimization techniques for high-performance Spark applications. Author(s) Shrey Mehrotra is a seasoned software developer with expertise in big data technologies, particularly Apache Spark. With years of hands-on industry experience, Shrey focuses on making complex technical concepts accessible to all. Through his writing, he aims to share clear, practical guidance for developers of all levels. Who is it for? This guide is perfect for big data enthusiasts and professionals looking to learn Apache Spark's capabilities from scratch. It's aimed at data engineers interested in optimizing application performance and data scientists wanting to integrate machine learning with Spark. A basic familiarity with either Scala, Python, or Java is recommended.

Java XML and JSON: Document Processing for Java SE

2019-01-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jeff Friesen

API JSON Oracle XML data data-engineering storage-formats

Use this guide to master the XML metalanguage and JSON data format along with significant Java APIs for parsing and creating XML and JSON documents from the Java language. New in this edition is coverage of Jackson (a JSON processor for Java) and Oracle’s own Java API for JSON processing (JSON-P), which is a JSON processing API for Java EE that also can be used with Java SE. This new edition of Java XML and JSON also expands coverage of DOM and XSLT to include additional API content and useful examples. All examples in this book have been tested under Java 11. In some cases, source code has been simplified to use Java 11’s var language feature. The first six chapters focus on XML along with the SAX, DOM, StAX, XPath, and XSLT APIs. The remaining six chapters focus on JSON along with the mJson, GSON, JsonPath, Jackson, and JSON-P APIs. Each chapter ends with select exercises designed to challenge your grasp of the chapter's content.An appendix provides the answers to these exercises. What You'll Learn Master the XML language Create, validate, parse, and transform XML documents Apply Java’s SAX, DOM, StAX, XPath, and XSLT APIs Master the JSON format for serializing and transmitting data Code against third-party APIs such as Jackson, mJson, Gson, JsonPath Master Oracle’s JSON-P API in a Java SE context Who This Book Is For Intermediate and advanced Java programmers who are developing applications that must access data stored in XML or JSON documents. The book also targets developers wanting to understand the XML language and JSON data format.

Apache Kafka Quick Start Guide

2018-12-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Raúl Estrada

Kafka Data Streaming data data-engineering streaming-messaging

Dive into the world of Apache Kafka with this concise guide that focuses on its practical use for real-time data processing in distributed systems. You'll explore Kafka's capabilities, covering essentials like configuration, messaging, serialization, and handling complex data streams using Kafka Streams and KSQL. By the end, you'll be equipped to tackle real-world streaming challenges confidently. What this Book will help me do Understand how to set up and configure Apache Kafka for real-time processing environments. Master key concepts like message validation, enrichment, and serialization. Learn to use the Schema Registry for data validation and versioning. Gain hands-on experience with data streaming and aggregation using Kafka Streams. Develop skills in using KSQL for data manipulation and stream querying. Author(s) None Estrada is an experienced software engineer with a deep understanding of distributed systems and real-time data processing. With expertise in Apache Kafka and other event-streaming platforms, Estrada approaches technical writing with an emphasis on clarity and practical application. Their passion for helping developers achieve success is reflected in their authoritative yet approachable style. Who is it for? This book is perfect for software engineers and backend developers interested in mastering real-time data processing using Apache Kafka. It is designed for readers who are eager to solve practical problems in distributed systems, irrespective of whether they have prior Kafka experience. Some familiarity with Java or other JVM languages will be helpful, although not strictly necessary. This is an ideal resource for learners seeking a hands-on, practical approach to Apache Kafka.

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59

2018-12-03 · Data Engineering Podcast Listen

podcast_episode

by Patrick Hunt , Tobias Macey

Flink Data Engineering Data Management Apache HBase Kafka Kubernetes

Summary Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Zookeeper is and how the project got started?

What are the main motivations for using a centralized coordination service for distributed systems?

What are the distributed systems primitives that are built into Zookeeper?

What are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper? What are some of the types of system level features that application developers will need which aren’t provided by Zookeeper?

Can you discuss how Zookeeper is architected and how that design has evolved over time?

What have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper?

What are the scaling factors for Zookeeper?

What are the edge cases that users should be aware of? Where does it fall on the axes of the CAP theorem?

What are the main failure modes for Zookeeper?

How much of the recovery logic is left up to the end user of the Zookeeper cluster?

Since there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services? In recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects?

What are some of the cases where Zookeeper is the wrong choice?

How have the needs of distributed systems engineers changed since you first began working on Zookeeper? If you were to start the project over today, what would you do differently?

Would you still use Java?

What are some of the most interesting or unexpected ways that you have seen Zookeeper used? What do you have planned for the future of Zookeeper?

Contact Info

@phunt on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Zookeeper Cloudera Google Chubby Sourceforge HBase High Availability Fallacies of distributed computing Falsehoods programmers believe about networking Consul EtcD Apache Curator Raft Consensus Algorithm Zookeeper Atomic Broadcast SSD Write Cliff Apache Kafka Apache Flink

Podcast

PostgreSQL 11 Server Side Programming Quick Start Guide

2018-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Luca Ferrari (Bending Spoons)

Data Management data data-engineering postgresql relational-databases

PostgreSQL 11 Server Side Programming Quick Start Guide introduces you to the world of database programming directly at the database level. This book delves into the concepts of server-side programming, providing you with the necessary tools to author stored procedures, triggers, and extensions for your PostgreSQL instance. What this Book will help me do Learn how to create stored procedures and functions for efficient database logic. Understand how to use triggers and rules to maintain data integrity. Gain expertise in developing extensions to extend PostgreSQL functionality. Master techniques for handling inter-process communication and background workers. Explore custom data types and integration with programming languages like Java and Perl. Author(s) None Ferrari, a seasoned database administrator and developer, specializes in delivering insightful PostgreSQL training. With extensive experience in both database management and software development, None brings practical knowledge and real-world examples to guide readers through mastering PostgreSQL server-side programming. Who is it for? This book is tailored for database administrators, developers, and engineers who have a basic understanding of PostgreSQL and are looking to expand their knowledge into server-side programming. If you're aiming to implement advanced database functionality or streamline data management tasks in PostgreSQL, this book is for you. It is ideal for those who wish to apply database programming techniques to enterprise-grade challenges. Beginner-friendly but designed to empower professionals with actionable insights.

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

2018-11-19 · Data Engineering Podcast Listen

podcast_episode

by Fabian Hueske (Data Artisans) , Tobias Macey

Flink Cloud Computing Data Engineering Data Management Dataflow GCP GitHub Hadoop IBM Kafka Scala Spark +2 more

Summary

Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Flink is and how the project got started? What are some of the primary ways that Flink is used? How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?

What are some use cases that Flink is uniquely qualified to handle?

Where does Flink fit into the current data landscape? How is Flink architected?

How has that architecture evolved? Are there any aspects of the current design that you would do differently if you started over today?

How does scaling work in a Flink deployment?

What are the scaling limits? What are some of the failure modes that users should be aware of?

How is the statefulness of a cluster managed?

What are the mechanisms for managing conflicts? What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose? Can state be shared across processes or tasks within a Flink cluster?

What are the comparative challenges of working with bounded vs unbounded streams of data? How do you handle out of order events in Flink, especially as the delay for a given event increases? For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it? What are some of the most challenging or complicated aspects of building and maintaining Flink? What are some of the most interesting or unexpected ways that you have seen Flink used? What are some of the improvements or new features that are planned for the future of Flink? What are some features or use cases that you are explicitly not planning to support? For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?

What do they find most interesting or exciting?

Contact Info

LinkedIn @fhueske on Twitter fhueske on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Flink Data Artisans IBM DB2 Technische Universität Berlin Hadoop Relational Database Google Cloud Dataflow Spark Cascading Java RocksDB Flink Checkpoints Flink Savepoints Kafka Pulsar Storm Scala LINQ (Language INtegrated Query) SQL Backpressure

Apache Hadoop 3 Quick Start Guide

2018-10-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Hrishikesh Vijay Karambelkar

Analytics Big Data Data Analytics Hadoop HDFS Hive Kafka Spark Data Streaming data data-engineering

Dive into the world of distributed data processing with the 'Apache Hadoop 3 Quick Start Guide.' This comprehensive resource equips you with the knowledge needed to handle large datasets effectively using Apache Hadoop. Learn how to set up and configure Hadoop, work with its core components, and explore its powerful ecosystem tools. What this Book will help me do Understand the fundamental concepts of Apache Hadoop, including HDFS, MapReduce, and YARN, and use them to store and process large datasets. Set up and configure Hadoop 3 in both developer and production environments to suit various deployment needs. Gain hands-on experience with Hadoop ecosystem tools like Hive, Kafka, and Spark to enhance your big data processing capabilities. Learn to manage, monitor, and troubleshoot Hadoop clusters efficiently to ensure smooth operations. Analyze real-time streaming data with tools like Apache Storm and perform advanced data analytics using Apache Spark. Author(s) The author of this guide, Vijay Karambelkar, brings years of experience working with big data technologies and Apache Hadoop in real-world applications. With a passion for teaching and simplifying complex topics, Vijay has compiled his expertise to help learners confidently approach Hadoop 3. His detailed, example-driven approach makes this book a practical resource for aspiring data professionals. Who is it for? This book is ideal for software developers, data engineers, and IT professionals who aspire to dive into the field of big data. If you're new to Apache Hadoop or looking to upgrade your skills to include version 3, this guide is for you. A basic understanding of Java programming is recommended to make the most of the topics covered. Embark on this journey to enhance your career in data-intensive industries.

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

2018-08-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Hien Luu

AI/ML Analytics Big Data Cloud Computing Databricks Hadoop Spark SQL Data Streaming apache-spark data data-engineering

Develop applications for the big data landscape with Spark and Hadoop. This book also explains the role of Spark in developing scalable machine learning and analytics applications with Cloud technologies. Beginning Apache Spark 2 gives you an introduction to Apache Spark and shows you how to work with it. Along the way, you’ll discover resilient distributed datasets (RDDs); use Spark SQL for structured data; and learn stream processing and build real-time applications with Spark Structured Streaming. Furthermore, you’ll learn the fundamentals of Spark ML for machine learning and much more. After you read this book, you will have the fundamentals to become proficient in using Apache Spark and know when and how to apply it to your big data applications. What You Will Learn Understand Spark unified data processing platform Howto run Spark in Spark Shell or Databricks Use and manipulate RDDs Deal with structured data using Spark SQL through its operations and advanced functions Build real-time applications using Spark Structured Streaming Develop intelligent applications with the Spark Machine Learning library Who This Book Is For Programmers and developers active in big data, Hadoop, and Java but who are new to the Apache Spark platform.

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

2018-07-08 · Data Engineering Podcast Listen

podcast_episode

by Andy LoPresto , Kevin Doran , Tobias Macey

Agile/Scrum Airflow Flink API Chef CSV Data Engineering Data Governance Data Management Dataflow DataOps DevOps +13 more

Summary

Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Doran and Andy LoPresto about Apache NiFi

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what NiFi is? What is the motivation for building a GUI as the primary interface for the tool when the current trend is to represent everything as code? How did you get involved with the project?

Where does it sit in the broader landscape of data tools?

Does the data that is processed by NiFi flow through the servers that it is running on (á la Spark/Flink/Kafka), or does it orchestrate actions on other systems (á la Airflow/Oozie)?

How do you manage versioning and backup of data flows, as well as promoting them between environments?

One of the advertised features is tracking provenance for data flows that are managed by NiFi. How is that data collected and managed?

What types of reporting are available across this information?

What are some of the use cases or requirements that lend themselves well to being solved by NiFi?

When is NiFi the wrong choice?

What is involved in deploying and scaling a NiFi installation?

What are some of the system/network parameters that should be considered? What are the scaling limitations?

What have you found to be some of the most interesting, unexpected, and/or challenging aspects of building and maintaining the NiFi project and community? What do you have planned for the future of NiFi?

Contact Info

Kevin Doran

@kevdoran on Twitter Email

Andy LoPresto

@yolopey on Twitter Email

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

NiFi HortonWorks DataFlow HortonWorks Apache Software Foundation Apple CSV XML JSON Perl Python Internet Scale Asset Management Documentum DataFlow NSA (National Security Agency) 24 (TV Show) Technology Transfer Program Agile Software Development Waterfall Spark Flink Kafka Oozie Luigi Airflow FluentD ETL (Extract, Transform, and Load) ESB (Enterprise Service Bus) MiNiFi Java C++ Provenance Kubernetes Apache Atlas Data Governance Kibana K-Nearest Neighbors DevOps DSL (Domain Specific Language) NiFi Registry Artifact Repository Nexus NiFi CLI Maven Archetype IoT Docker Backpressure NiFi Wiki TLS (Transport Layer Security) Mozilla TLS Observatory NiFi Flow Design System Data Lineage GDPR (General Data Protection Regulation)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Data Engineering Weekly with Joe Crobak - Episode 27

2018-04-15 · Data Engineering Podcast Listen

podcast_episode

by Joe Crobak (United States Digital Service (USDS)) , Tobias Macey

Analytics Flink API Amazon EMR Big Data Data Analytics Data Engineering Data Management Data Science ELK Hadoop Kubernetes +1 more

Summary

The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discusses his experiences working as a data engineer in industry and at the USDS, his motivations and methods for creating a newsleteter, and the insights that he has gleaned from it.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Joe Crobak about his work maintaining the Data Engineering Weekly newsletter, and the challenges of keeping up with the data engineering industry.

Interview

Introduction How did you get involved in the area of data management? What are some of the projects that you have been involved in that were most personally fulfilling?

As an engineer at the USDS working on the healthcare.gov and medicare systems, what were some of the approaches that you used to manage sensitive data? Healthcare.gov has a storied history, how did the systems for processing and managing the data get architected to handle the amount of load that it was subjected to?

What was your motivation for starting a newsletter about the Hadoop space?

Can you speak to your reasoning for the recent rebranding of the newsletter?

How much of the content that you surface in your newsletter is found during your day-to-day work, versus explicitly searching for it? After over 5 years of following the trends in data analytics and data infrastructure what are some of the most interesting or surprising developments?

What have you found to be the fundamental skills or areas of experience that have maintained relevance as new technologies in data engineering have emerged?

What is your workflow for finding and curating the content that goes into your newsletter? What is your personal algorithm for filtering which articles, tools, or commentary gets added to the final newsletter? How has your experience managing the newsletter influenced your areas of focus in your work and vice-versa? What are your plans going forward?

Contact Info

Data Eng Weekly Email Twitter – @joecrobak Twitter – @dataengweekly

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

USDS National Labs Cray Amazon EMR (Elastic Map-Reduce) Recommendation Engine Netflix Prize Hadoop Cloudera Puppet healthcare.gov Medicare Quality Payment Program HIPAA NIST National Institute of Standards and Technology PII (Personally Identifiable Information) Threat Modeling Apache JBoss Apache Web Server MarkLogic JMS (Java Message Service) Load Balancer COBOL Hadoop Weekly Data Engineering Weekly Foursquare NiFi Kubernetes Spark Flink Stream Processing DataStax RSS The Flavors of Data Science and Engineering CQRS Change Data Capture Jay Kreps

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Seven NoSQL Databases in a Week

2018-03-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Xun (Brian) Wu , Sudarshan Kadambi

Cassandra DynamoDB Apache HBase MongoDB Neo4j NoSQL Python RDBMS Redis data data-engineering nosql-databases

Learn the fundamentals of seven essential NoSQL databases in just one week with this book. Covering MongoDB, DynamoDB, Redis, Cassandra, Neo4j, InfluxDB, and HBase, you'll explore their functionalities and practical applications. Designed to give you a working understanding of NoSQL database types, this guide helps aspiring DBAs and developers comprehend and utilize modern data solutions. What this Book will help me do Master the fundamentals of MongoDB, including high-performance, high-availability, and scaling features. Gain hands-on experience with Neo4j to perform database queries and integrate with Python and Java applications. Learn efficient querying with Redis for storage and retrieval tasks. Understand Cassandra's powerful solution for scalable and fault-tolerant systems. Get well-versed with HBase for creating tables, and reading and writing data efficiently. Author(s) Sudarshan Kadambi and Xun (Brian) Wu bring a wealth of experience in database technologies. They have worked extensively in the software development and database management fields. With their practical and concise teaching approach, the authors make complex topics accessible for readers. Who is it for? This book is ideal for budding DBAs and developers looking to understand NoSQL databases. It is particularly useful for those transitioning from relational databases who want to learn about modern database technologies. Suitable for both beginners and those with some database knowledge, it aims to bridge skill gaps and expand the reader's technical expertise.

Database Refactoring Patterns with Pramod Sadalage - Episode 22

2018-03-12 · Data Engineering Podcast Listen

podcast_episode

by Pramod Sadalage , Tobias Macey

Agile/Scrum CI/CD Data Engineering Data Management DevOps Docker DWH GitHub Linux MongoDB Neo4j NoSQL +1 more

Summary

As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Pramod Sadalage about refactoring databases and integrating database design into an iterative development workflow

Interview

Introduction How did you get involved in the area of data management? You first co-authored Refactoring Databases in 2006. What was the state of software and database system development at the time and why did you find it necessary to write a book on this subject? What are the characteristics of a database that make them more difficult to manage in an iterative context? How does the practice of refactoring in the context of a database compare to that of software? How has the prevalence of data abstractions such as ORMs or ODMs impacted the practice of schema design and evolution? Is there a difference in strategy when refactoring the data layer of a system when using a non-relational storage system? How has the DevOps movement and the increased focus on automation affected the state of the art in database versioning and evolution? What have you found to be the most problematic aspects of databases when trying to evolve the functionality of a system? Looking back over the past 12 years, what has changed in the areas of database design and evolution?

How has the landscape of tooling for managing and applying database versioning changed since you first wrote Refactoring Databases? What do you see as the biggest challenges facing us over the next few years?

Contact Info

Website pramodsadalage on GitHub @pramodsadalage on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Database Refactoring

Website Book

Thoughtworks Martin Fowler Agile Software Development XP (Extreme Programming) Continuous Integration

The Book Wikipedia

Test First Development DDL (Data Definition Language) DML (Data Modification Language) DevOps Flyway Liquibase DBMaintain Hibernate SQLAlchemy ORM (Object Relational Mapper) ODM (Object Document Mapper) NoSQL Document Database MongoDB OrientDB CouchBase CassandraDB Neo4j ArangoDB Unit Testing Integration Testing OLAP (On-Line Analytical Processing) OLTP (On-Line Transaction Processing) Data Warehouse Docker QA==Quality Assurance HIPAA (Health Insurance Portability and Accountability Act) PCI DSS (Payment Card Industry Data Security Standard) Polyglot Persistence Toplink Java ORM Ruby on Rails ActiveRecord Gem

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Camel in Action, Second Edition

2018-02-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jonathan Anstey , Claus Ibsen

Cloud Computing Docker Kubernetes Cyber Security XML camel data data-engineering streaming-messaging

Camel in Action, Second Edition is the most complete Camel book on the market. Written by core developers of Camel and the authors of the highly acclaimed first edition, this book distills their experience and practical insights so that you can tackle integration tasks like a pro. About the Technology Apache Camel is a Java framework that implements enterprise integration patterns (EIPs) and comes with over 200 adapters to third-party systems. A concise DSL lets you build integration logic into your app with just a few lines of Java or XML. By using Camel, you benefit from the testing and experience of a large and vibrant open source community. About the Book Camel in Action, Second Edition is the definitive guide to the Camel framework. It starts with core concepts like sending, receiving, routing, and transforming data. It then goes in depth on many topics such as how to develop, debug, test, deal with errors, secure, scale, cluster, deploy, and monitor your Camel applications. The book also discusses how to run Camel with microservices, reactive systems, containers, and in the cloud. What's Inside Coverage of all relevant EIPs Camel microservices with Spring Boot Camel on Docker and Kubernetes Error handling, testing, security, clustering, monitoring, and deployment Hundreds of examples in Java and XML About the Reader Readers should be familiar with Java. This book is accessible to beginners and invaluable to experts. About the Authors Claus Ibsen is a senior principal engineer working for Red Hat specializing in cloud and integration. He has worked on Apache Camel for the last nine years where he heads the project. Claus lives in Denmark. Jonathan Anstey is an engineering manager at Red Hat and a core Camel contributor. He lives in Newfoundland, Canada. Quotes I highly recommend this book to anyone with even a passing interest in Apache Camel. Do take Camel for a ride...and don't get the hump! - From the Foreword by James Strachan, Creator of Apache Camel Claus and Jon are great writers, relying on figures and diagrams where needed and presenting lots of code snippets and worked examples. - From the Foreword by Dr. Mark Little, Technical Director of JBoss The second edition of this all-time classic is an indispensable companion for your Apache Camel rides. - Gregor Zurowski, Apache Camel Committer The absolute best way to learn and use Camel - top to bottom, front to back, and all the way through. Camel is a fantastic tool - every Java coder should have a copy of this book. - Rick Wagner, Red Hat An excellent book and the definite reference for experienced engineers. - Yan Guo, EventBrite

Mastering Apache Solr 7.x

2018-02-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Chintan Mehta , Dharmesh Vasoya , Sandeep Nair (Microsoft)

API BI JavaScript Python data data-engineering search solr

"Mastering Apache Solr 7.x" is your practical guide to building, advancing, and optimizing enterprise search solutions using Solr 7. With this book, you will harness the robust features of Solr, implement efficient search capabilities, and tackle complex business intelligence problems to achieve unparalleled search performance. What this Book will help me do Develop and implement efficient schemas using the Solr Schema API. Optimize enterprise search performance with advanced querying and scoring techniques. Implement fault-tolerant and distributed search systems using SolrCloud. Leverage Apache Tika for seamless data indexing and content extraction. Utilize programming languages like JavaScript, Python, and Ruby to integrate with Solr. Author(s) With years of experience in search technologies and deep expertise in Apache Solr, authors None Nair, None Mehta, and Dharmesh Vasoya bring together a wealth of knowledge in this book. Their collaborative insights equip readers to master advanced Solr features, sharing practical examples and real-world applications with a passion for clarity and efficiency. Who is it for? This book is ideal for software developers, data engineers, and database architects who aim to design and implement effective enterprise search systems. It is tailored for readers with prior experience in Apache Solr or Java programming, focusing on those eager to enhance their search solution expertise. Achieve your advanced search system goals here.

SAS Viya

2018-02-08 · O'Reilly Data Science Books O'Reilly Amazon

book

by Kevin D. Smith , Xiangxiang Meng

AI/ML Analytics API Cloud Computing Python SAS analytics-platforms data data-science

Learn how to access analytics from SAS Cloud Analytic Services (CAS) using Python and the SAS Viya platform. SAS Viya : The Python Perspective is an introduction to using the Python client on the SAS Viya platform. SAS Viya is a high-performance, fault-tolerant analytics architecture that can be deployed on both public and private cloud infrastructures. While SAS Viya can be used by various SAS applications, it also enables you to access analytic methods from SAS, Python, Lua, and Java, as well as through a REST interface using HTTP or HTTPS. This book focuses on the perspective of SAS Viya from Python. SAS Viya is made up of multiple components. The central piece of this ecosystem is SAS Cloud Analytic Services (CAS). CAS is the cloud-based server that all clients communicate with to run analytical methods. The Python client is used to drive the CAS component directly using objects and constructs that are familiar to Python programmers. Some knowledge of Python would be helpful before using this book; however, there is an appendix that covers the features of Python that are used in the CAS Python client. Knowledge of CAS is not required to use this book. However, you will need to have a CAS server set up and running to execute the examples in this book. With this book, you will learn how to: Install the required components for accessing CAS from Python Connect to CAS, load data, and run simple analyses Work with CAS using APIs familiar to Python users Grasp general CAS workflows and advanced features of the CAS Python client SAS Viya : The Python Perspective covers topics that will be useful to beginners as well as experienced CAS users. It includes examples from creating connections to CAS all the way to simple statistics and machine learning, but it is also useful as a desktop reference.

talk-data.com

Activity Trend

Top Events

Top Speakers

Solving Data Lineage Tracking And Data Discovery At WeWork

Monitoring and Managing the IBM Elastic Storage Server Using the GUI

SQL Server 2019 Revealed: Including Big Data Clusters and Machine Learning

Deep Learning for Search

Mastering Hadoop 3

Apache Spark Quick Start Guide

Java XML and JSON: Document Processing for Java SE

Apache Kafka Quick Start Guide

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59

PostgreSQL 11 Server Side Programming Quick Start Guide

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Apache Hadoop 3 Quick Start Guide

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

Data Engineering Weekly with Joe Crobak - Episode 27

Seven NoSQL Databases in a Week

Database Refactoring Patterns with Pramod Sadalage - Episode 22

Camel in Action, Second Edition

Mastering Apache Solr 7.x

SAS Viya