Kafka

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

2018-06-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Butch Quinto

Alteryx Analytics BI Big Data Cloud Computing Data Governance DataViz DWH Apache HBase HDFS MySQL Oracle +7 more

Utilize this practical and easy-to-follow guide to modernize traditional enterprise data warehouse and business intelligence environments with next-generation big data technologies. Next-Generation Big Data takes a holistic approach, covering the most important aspects of modern enterprise big data. The book covers not only the main technology stack but also the next-generation tools and applications used for big data warehousing, data warehouse optimization, real-time and batch data ingestion and processing, real-time data visualization, big data governance, data wrangling, big data cloud deployments, and distributed in-memory big data computing. Finally, the book has an extensive and detailed coverage of big data case studies from Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard. What You’ll Learn Install Apache Kudu, Impala, and Spark to modernize enterprise data warehouse and business intelligence environments, complete with real-world, easy-to-follow examples, and practical advice Integrate HBase, Solr, Oracle, SQL Server, MySQL, Flume, Kafka, HDFS, and Amazon S3 with Apache Kudu, Impala, and Spark Use StreamSets, Talend, Pentaho, and CDAP for real-time and batch data ingestion and processing Utilize Trifacta, Alteryx, and Datameer for data wrangling and interactive data processing Turbocharge Spark with Alluxio, a distributed in-memory storage platform Deploy big data in the cloud using Cloudera Director Perform real-time data visualization and time series analysis using Zoomdata, Apache Kudu, Impala, and Spark Understand enterprise big data topics such as big data governance, metadata management, data lineage, impact analysis, and policy enforcement, and how to use Cloudera Navigator to perform common data governance tasks Implement big data use cases such as big data warehousing, data warehouse optimization, Internet of Things, real-time data ingestion and analytics, complex event processing, and scalable predictive modeling Study real-world big data case studies from innovative companies, including Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard Who This Book Is For BI and big data warehouse professionals interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu, Impala, and Spark; and those who want to learn more about other advanced enterprise topics

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

2018-05-28 · Data Engineering Podcast Listen

podcast_episode

by Yair Weinberger (Alooma) , Tobias Macey

API Amazon EMR Kinesis Azure BigQuery Cassandra Cloud Computing Cloud Storage Data Collection Data Engineering Data Management Datadog +24 more

Summary

Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Yair Weinberger about Alooma, a company providing data pipelines as a service

Interview

Introduction How did you get involved in the area of data management? What is Alooma and what is the origin story? How is the Alooma platform architected?

I want to go into stream VS batch here What are the most challenging components to scale?

How do you manage the underlying infrastructure to support your SLA of 5 nines? What are some of the complexities introduced by processing data from multiple customers with various compliance requirements?

How do you sandbox user’s processing code to avoid security exploits?

What are some of the potential pitfalls for automatic schema management in the target database? Given the large number of integrations, how do you maintain the

What are some challenges when creating integrations, isn’t it simply conforming with an external API?

For someone getting started with Alooma what does the workflow look like? What are some of the most challenging aspects of building and maintaining Alooma? What are your plans for the future of Alooma?

Contact Info

LinkedIn @yairwein on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Alooma Convert Media Data Integration ESB (Enterprise Service Bus) Tibco Mulesoft ETL (Extract, Transform, Load) Informatica Microsoft SSIS OLAP Cube S3 Azure Cloud Storage Snowflake DB Redshift BigQuery Salesforce Hubspot Zendesk Spark The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps RDBMS (Relational Database Management System) SaaS (Software as a Service) Change Data Capture Kafka Storm Google Cloud PubSub Amazon Kinesis Alooma Code Engine Zookeeper Idempotence Kafka Streams Kubernetes SOC2 Jython Docker Python Javascript Ruby Scala PII (Personally Identifiable Information) GDPR (General Data Protection Regulation) Amazon EMR (Elastic Map Reduce) Sequoia Capital Lightspeed Investors Redis Aerospike Cassandra MongoDB

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

2018-05-21 · Data Engineering Podcast Listen

podcast_episode

by Kamil Bajda-Pawlikowski (Starburst Data) , Tobias Macey

Analytics API Cassandra Data Engineering Data Management DWH Hadoop Hive Presto Redis SQL Teradata +1 more

Summary

Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Presto is?

What are some of the common use cases and deployment patterns for Presto?

How does Presto compare to Drill or Impala? What is it about Presto that led you to building a business around it? What are some of the most challenging aspects of running and scaling Presto? For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?

How does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with?

What are some cases in which Presto is not the right solution? What types of support have you found to be the most commonly requested? What are some of the types of tooling or improvements that you have made to Presto in your distribution?

What are some of the notable changes that your team has contributed upstream to Presto?

Contact Info

Website E-mail Twitter – @starburstdata Twitter – @prestodb

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Starburst Data Presto Hadapt Hadoop Hive Teradata PrestoCare Cost Based Optimizer ANSI SQL Spill To Disk Tempto Benchto Geospatial Functions Cassandra Accumulo Kafka Redis PostGreSQL

The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utm_source=rss&utm_medium=rss Support Data Engineering Podcast

Designing Event-Driven Systems

2018-05-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ben Stopford

API DevOps Data Streaming data data-engineering streaming-messaging

Many forces affect software today: larger datasets, geographical disparities, complex company structures, and the growing need to be fast and nimble in the face of change. Proven approaches such as service-oriented and event-driven architectures are joined by newer techniques such as microservices, reactive architectures, DevOps, and stream processing. Many of these patterns are successful by themselves, but as this practical ebook demonstrates, they provide a more holistic and compelling approach when applied together. Author Ben Stopford explains how service-based architectures and stream processing tools such as Apache Kafka can help you build business-critical systems. You’ll learn how to apply patterns including Event Sourcing and CQRS, and how to build multi-team systems with microservices and SOA using patterns such as "inside out databases" and "event streams as a source of truth." These approaches provide a unique foundation for how these large, autonomous service ecosystems can communicate and share data. Learn why streaming beats request-response based architectures in complex, contemporary use cases Understand why replayable logs such as Kafka provide a backbone for both service communication and shared datasets Explore how event collaboration and event sourcing patterns increase safety and recoverability with functional, event-driven approaches Build service ecosystems that blend event-driven and request-driven interfaces using a replayable log and Kafka’s Streams API Scale beyond individual teams into larger, department- and company-sized architectures, using event streams as a source of truth

JavaScript and JSON Essentials - Second Edition

2018-04-23 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sai S Sriparasa , Bruno Joseph D'mello

JavaScript JSON data data-engineering storage-formats

Dive into "JavaScript and JSON Essentials" to discover how JSON works as a cornerstone in modern web development. Through hands-on examples and practical guidance, this book equips you with the knowledge to effectively use JSON with JavaScript for creating responsive, scalable, and capable web applications. What this Book will help me do Master JSON structures and utilize them in web development workflows. Integrate JSON data within Angular, Node.js, and other popular frameworks. Implement real-time JSON features using tools like Kafka and Socket.io. Understand BSON, GeoJSON, and JSON-LD formats for specialized applications. Develop efficient JSON handling for distributed and scalable systems. Author(s) None Joseph D'mello and Sai S Sriparasa are seasoned software developers and educators with extensive experience in JavaScript. Their expertise in web application development and JSON usage shines through in this book. They take a clear and engaging approach, ensuring that complex concepts are demystified and actionable. Who is it for? This book is best suited for web developers familiar with JavaScript who want to enhance their abilities to use JSON for building fast, data-driven web applications. Whether you're looking to strengthen your backend skills or learn tools like Angular and Kafka in conjunction with JSON, this book is made for you.

ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

2018-04-01 · Data Engineering Podcast Listen

podcast_episode

by Patrick Cable (ThreatStack) , Pete Cheslock (ThreatStack) , Tobias Macey

API Cloud Computing Data Engineering Data Management Datadog GitHub Cyber Security

Summary

Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Pat Cable about the data infrastructure and security controls at ThreatStack

Interview

Introduction How did you get involved in the area of data management? Why don’t you start by explaining what ThreatStack does?

What was lacking in the existing options (services and self-hosted/open source) that ThreatStack solves for?

Can you describe the type(s) of data that you collect and how it is structured? What is the high level data infrastructure that you use for ingesting, storing, and analyzing your customer data?

How do you ensure a consistent format of the information that you receive? How do you ensure that the various pieces of your platform are deployed using the proper configurations and operating as intended? How much configuration do you provide to the end user in terms of the captured data, such as sampling rate or additional context?

I understand that your original architecture used RabbitMQ as your ingest mechanism, which you then migrated to Kafka. What was your initial motivation for that change?

How much of a benefit has that been in terms of overall complexity and cost (both time and infrastructure)?

How do you ensure the security and provenance of the data that you collect as it traverses your infrastructure? What are some of the most common vulnerabilities that you detect in your client’s infrastructure? For someone who wants to start using ThreatStack, what does the setup process look like? What have you found to be the most challenging aspects of building and managing the data processes in your environment? What are some of the projects that you have planned to improve the capacity or capabilities of your infrastructure?

Contact Info

Pete Cheslock

@petecheslock on Twitter Website petecheslock on GitHub

Patrick Cable

@patcable on Twitter Website patcable on GitHub

ThreatStack

Website @threatstack on Twitter threatstack on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

ThreatStack SecDevO

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

2018-02-11 · Data Engineering Podcast Listen

podcast_episode

by Mike Freedman (Timescale) , Ajay Kulkarni (Timescale) , Tobias Macey

Amazon RDS Azure Cloud Computing Cloudflare Data Engineering Data Management Databricks DevOps Docker ELK GCP GitHub +14 more

Summary

As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ajay Kulkarni and Mike Freedman about Timescale DB, a scalable timeseries database built on top of PostGreSQL

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Timescale is and how the project got started? The landscape of time series databases is extensive and oftentimes difficult to navigate. How do you view your position in that market and what makes Timescale stand out from the other options? In your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. How does Timescale handle out of order timestamps, such as from infrequently connected sensors or mobile devices? How is Timescale implemented and how has the internal architecture evolved since you first started working on it?

What impact has the 10.0 release of PostGreSQL had on the design of the project? Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL?

For someone who wants to start using Timescale what is involved in deploying and maintaining it? What are the axes for scaling Timescale and what are the points where that scalability breaks down?

Are you aware of anyone who has deployed it on top of Citus for scaling horizontally across instances?

What has been the most challenging aspect of building and marketing Timescale? When is Timescale the wrong tool to use for time series data? One of the use cases that you call out on your website is for systems metrics and monitoring. How does Timescale fit into that ecosystem and can it be used along with tools such as Graphite or Prometheus? What are some of the most interesting uses of Timescale that you have seen? Which came first, Timescale the business or Timescale the database, and what is your strategy for ensuring that the open source project and the company around it both maintain their health? What features or improvements do you have planned for future releases of Timescale?

Contact Info

Ajay

LinkedIn @acoustik on Twitter Timescale Blog

Mike

Website LinkedIn @michaelfreedman on Twitter Timescale Blog

Timescale

Website @timescaledb on Twitter GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Timescale PostGreSQL Citus Timescale Design Blog Post MIT NYU Stanford SDN Princeton Machine Data Timeseries Data List of Timeseries Databases NoSQL Online Transaction Processing (OLTP) Object Relational Mapper (ORM) Grafana Tableau Kafka When Boring Is Awesome PostGreSQL RDS Google Cloud SQL Azure DB Docker Continuous Aggregates Streaming Replication PGPool II Kubernetes Docker Swarm Citus Data

Website Data Engineering Podcast Interview

Database Indexing B-Tree Index GIN Index GIST Index STE Energy Redis Graphite Prometheus pg_prometheus OpenMetrics Standard Proposal Timescale Parallel Copy Hadoop PostGIS KDB+ DevOps Internet of Things MongoDB Elastic DataBricks Apache Spark Confluent New Enterprise Associates MapD Benchmark Ventures Hortonworks 2σ Ventures CockroachDB Cloudflare EMC Timescale Blog: Why SQL is beating NoSQL, and what this means for the future of data

The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss" target="_blank"…

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

2018-02-04 · Data Engineering Podcast Listen

podcast_episode

by Matteo Merli , Rajan Dhabalia , Tobias Macey

AI/ML Ansible API Data Engineering Data Management Data Science GitHub Linux Pub/Sub

Summary

One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements:

There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.

Your host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Pulsar is and what the original inspiration for the project was? What have been some of the most challenging aspects of building and promoting Pulsar? For someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components? What are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to? What projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison? The documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka? One of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged? When is Pulsar the wrong tool to use? What are some of the improvements or new features that you have planned for the future of Pulsar?

Contact Info

Matteo

merlimat on GitHub @merlimat on Twitter

Rajan

@dhabaliaraj on Twitter rhabalia on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Pulsar Publish-Subscribe Yahoo Streamlio ActiveMQ Kafka Bookkeeper SLA (Service Level Agreement) Write-Ahead Log Ansible Zookeeper Pulsar Deployme

Complete Guide to Open Source Big Data Stack

2018-01-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Michael Frampton

Big Data Cassandra Cloud Computing Cloud Storage DataViz Hadoop Spark apache-spark data data-engineering

See a Mesos-based big data stack created and the components used. You will use currently available Apache full and incubating systems. The components are introduced by example and you learn how they work together. In the Complete Guide to Open Source Big Data Stack, the author begins by creating a private cloud and then installs and examines Apache Brooklyn. After that, he uses each chapter to introduce one piece of the big data stack—sharing how to source the software and how to install it. You learn by simple example, step by step and chapter by chapter, as a real big data stack is created. The book concentrates on Apache-based systems and shares detailed examples of cloud storage, release management, resource management, processing, queuing, frameworks, data visualization, and more. What You’ll Learn Install a private cloud onto the local cluster using Apache cloud stack Source, install, and configure Apache: Brooklyn, Mesos, Kafka, and Zeppelin See how Brooklyn can be used to install Mule ESB on a cluster and Cassandra in the cloud Install and use DCOS for big data processing Use Apache Spark for big data stack data processing Who This Book Is For Developers, architects, IT project managers, database administrators, and others charged with developing or supporting a big data system. It is also for anyone interested in Hadoop or big data, and those experiencing problems with data size.

Wallaroo with Sean T. Allen - Episode 12

2017-12-25 · Data Engineering Podcast Listen

podcast_episode

by Sean T. Allen (Wallaroo Labs) , Tobias Macey

Ansible Flink Chef CI/CD Data Engineering Data Management Docker GitHub Linux Python Redis

Summary

Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project. He explains the motivation for building Wallaroo, how it is implemented, and how you can start using it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Sean T. Allen about Wallaroo, a framework for building and operating stateful data applications at scale

Interview

Introduction How did you get involved in the area of data engineering? What is Wallaroo and how did the project get started? What is the Pony language, and what features does it have that make it well suited for the problem area that you are focusing on? Why did you choose to focus first on Python as the language for interacting with Wallaroo and how is that integration implemented? How is Wallaroo architected internally to allow for distributed state management?

Is the state persistent, or is it only maintained long enough to complete the desired computation? If so, what format do you use for long term storage of the data?

What have been the most challenging aspects of building the Wallaroo platform? Which axes of the CAP theorem have you optimized for? For someone who wants to build an application on top of Wallaroo, what is involved in getting started? Once you have a working application, what resources are necessary for deploying to production and what are the scaling factors?

What are the failure modes that users of Wallaroo need to account for in their application or infrastructure?

What are some situations or problem types for which Wallaroo would be the wrong choice? What are some of the most interesting or unexpected uses of Wallaroo that you have seen? What do you have planned for the future of Wallaroo?

Contact Info

IRC Mailing List Wallaroo Labs Twitter Email Personal Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Wallaroo Labs Storm Applied Apache Storm Risk Analysis Pony Language Erlang Akka Tail Latency High Performance Computing Python Apache Software Foundation Beyond Distributed Transactions: An Apostate’s View Consistent Hashing Jepsen Lineage Driven Fault Injection Chaos Engineering QCon 2016 Talk Codemesh in London: How did I get here? CAP Theorem CRDT Sync Free Project Basho Wallaroo on GitHub Docker Puppet Chef Ansible SaltStack Kafka TCP Dask Data Engineering Episode About Dask Beowulf Cluster Redis Flink Haskell

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Apache Kafka 1.0 Cookbook

2017-12-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Raúl Estrada

ELK Hadoop Java Scala Spark Data Streaming data data-engineering streaming-messaging

Dive into the essential resource for mastering Apache Kafka with this cookbook of practical recipes. You'll explore the dynamic features of Kafka 1.0, integrate it with enterprise data solutions, and confidently manage messaging and streaming data in real-time. What this Book will help me do Effectively install and configure Apache Kafka in a professional environment. Implement Kafka producers and consumers to manage real-time data streams. Utilize Confluent platforms and Kafka streams for advanced data processing. Monitor Kafka clusters with tools like Graphite and Ganglia for optimal performance. Integrate Kafka seamlessly with tools such as Hadoop, Spark, and Elasticsearch. Author(s) None Estrada and None Zinoviev have extensive experience in enterprise data systems and have been dedicated contributors to the Apache Kafka ecosystem. Their combined expertise encompasses developing robust, real-time distributed systems and delivering insightful technical guidance. Through this book, they share their vast knowledge and practical solutions, tailored for both developers and administrators. Who is it for? This book is tailored for developers and administrators looking to enhance their expertise in Apache Kafka. Developers should be comfortable with Java or Scala to fully utilize examples, while administrators benefit from prior knowledge of Kafka operations. Ideal readers are those seeking actionable techniques to efficiently manage and integrate Kafka into their enterprise systems.

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

2017-12-10 · Data Engineering Podcast Listen

podcast_episode

by Ewen Cheslack-Postava (Confluent) , Tobias Macey

Avro CI/CD Data Engineering Data Management GitHub JSON Linux Parquet Protobuf

Summary

To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it. He also discusses how it can be extended for other deployment targets and use cases, and additional features that are planned for future releases.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ewen Cheslack-Postava about the Confluent Schema Registry

Interview

Introduction How did you get involved in the area of data engineering? What is the schema registry and what was the motivating factor for building it? If you are using Avro, what benefits does the schema registry provide over and above the capabilities of Avro’s built in schemas? How did you settle on Avro as the format to support and what would be involved in expanding that support to other serialization options? Conversely, what would be involved in using a storage backend other than Kafka? What are some of the alternative technologies available for people who aren’t using Kafka in their infrastructure? What are some of the biggest challenges that you faced while designing and building the schema registry? What is the tipping point in terms of system scale or complexity when it makes sense to invest in a shared schema registry and what are the alternatives for smaller organizations? What are some of the features or enhancements that you have in mind for future work?

Contact Info

ewencp on GitHub Website @ewencp on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Kafka Confluent Schema Registry Second Life Eve Online Yes, Virginia, You Really Do Need a Schema Registry JSON-Schema Parquet Avro Thrift Protocol Buffers Zookeeper Kafka Connect

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Practical Real-time Data Processing and Analytics

2017-09-28 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Prateek Bhati , Selva raj Ramasamy , Shilpi Saxena , Saurabh Gupta (The Modern Data Company)

Analytics Flink Data Engineering Java Spark data data-engineering real-time-analytics streaming-messaging

This book provides a comprehensive guide to real-time data processing and analytics using modern frameworks like Apache Spark, Flink, Storm, and Kafka. Through practical examples and in-depth explanations, you will learn how to implement efficient, scalable, real-time processing pipelines. What this Book will help me do Understand real-time data processing essentials and the technology stack Learn integration of components like Apache Spark and Kafka Master the concepts of stream processing with detailed case studies Gain expertise in developing monitoring and alerting solutions for real-time systems Prepare to implement production-grade real-time data solutions Author(s) Shilpi Saxena and Saurabh Gupta, the authors, are experienced professionals in distributed systems and data engineering, focusing on practical applications of real-time computing. They bring their extensive industry experience to this book, helping readers understand the complexities of real-time data solutions in an approachable and hands-on manner. Who is it for? This book is ideal for software engineers and data engineers with a background in Java who seek to develop real-time data solutions. It is suitable for readers familiar with concepts of real-time data processing, and enhances knowledge in frameworks like Spark, Flink, Storm, and Kafka. Target audience includes learners building production data solutions and those designing distributed analytics engines.

Kafka: The Definitive Guide

2017-09-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Neha Narkhede , Todd Palino , Gwen Shapira

API Big Data Data Streaming data data-engineering streaming-messaging

Every enterprise application creates data, whether it’s log messages, metrics, user activity, outgoing messages, or something else. And how to move all of this data becomes nearly as important as the data itself. If you’re an application architect, developer, or production engineer new to Apache Kafka, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds. Engineers from Confluent and LinkedIn who are responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. Through detailed examples, you’ll learn Kafka’s design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer. Understand publish-subscribe messaging and how it fits in the big data ecosystem. Explore Kafka producers and consumers for writing and reading messages Understand Kafka patterns and use-case requirements to ensure reliable data delivery Get best practices for building data pipelines and applications with Kafka Manage Kafka in production, and learn to perform monitoring, tuning, and maintenance tasks Learn the most critical metrics among Kafka’s operational measurements Explore how Kafka’s stream delivery capabilities make it a perfect source for stream processing systems

Learning Spark SQL

2017-09-07 · O'Reilly SQL Books O'Reilly Amazon

book

by Aurobindo Sarkar

AI/ML Analytics API Big Data Java Python Scala Spark SQL Data Streaming apache spark

"Learning Spark SQL" takes you from data exploration to designing scalable applications with Apache Spark SQL. Through hands-on examples, you will comprehend real-world use cases and gain practical skills crucial for working with Spark SQL APIs, data frames, streaming data, and optimizing Spark applications. What this Book will help me do Understand the principles of Spark SQL and its APIs for building scalable distributed applications. Gain hands-on experience performing data wrangling and visualization using Spark SQL and real-world datasets. Learn how to design and optimize applications for performance and scalability with Spark SQL. Develop the skills to integrate Spark SQL with other frameworks like Apache Kafka for streaming analytics. Master the techniques required to architect machine learning and deep learning solutions using Spark SQL. Author(s) None Sarkar is an experienced technologist and trainer specializing in big data, streaming analytics, and scalable architectures using Apache Spark. With years of practical experience in implementing Spark solutions, Sarkar draws from real-world projects to provide readers with valuable insights. Sarkar's approachable and detailed writing style ensures readers grasp both the theory and the practice of Spark SQL. Who is it for? This book is ideal for software developers, data engineers, and architects aspiring to harness Apache Spark for robust, scalable applications. It suits readers with some SQL querying experience and a basic knowledge of programming in languages like Scala, Java, or Python. Whether you're a Spark newcomer or advancing your capabilities in scalable data processing, this resource will accelerate your learning journey.

Building Data Streaming Applications with Apache Kafka

2017-08-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Manisha Sethi , Anshul Joshi , Chanchal Singh , Manish Kumar

Data Engineering Java Spark Data Streaming data data-engineering streaming-messaging

Learn how to design and build efficient real-time streaming applications using Apache Kafka, a leading distributed streaming platform. This book provides comprehensive guidance on setting up Kafka clusters, developing producers and consumers, and integrating with frameworks like Spark, Storm, and Heron. By the end, you'll master the skills needed to create enterprise-grade data streaming solutions. What this Book will help me do Grasp the core concepts and components of Apache Kafka and its ecosystem. Develop robust Kafka producers and consumers to process real-time data streams. Design and implement streaming applications using Spark, Storm, and Heron. Plan Kafka deployments with a focus on scalability, capacity, and fault tolerance. Ensure secure data streaming with best practices for securing Apache Kafka. Author(s) The authors, None Singh and None Kumar, bring years of expertise in data engineering and distributed systems. Having worked extensively with streaming technologies like Apache Kafka, they aim to share their in-depth knowledge through practical examples and real-world scenarios. Their approach to teaching focuses on making complex concepts easily understandable. Who is it for? This book is ideal for software developers and data engineers who are eager to learn Apache Kafka for building streaming applications. Some experience with programming, particularly Java, will help readers get the most out of the material. If you are working on data-processing systems or looking to enhance your skills in real-time data handling, this book caters to your needs.

Mastering Apache Storm

2017-08-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ankit Jain

Big Data Hadoop Apache HBase Java Redis Data Streaming data data-engineering storm streaming-messaging

Mastering Apache Storm is your step-by-step guide to mastering real-time data streaming with this robust framework. You'll learn how to process big data efficiently and integrate Apache Storm with popular technologies like Kafka, HBase, and Redis to maximize its potential. This book walks you through from basic concepts to advanced implementations of Apache Storm in real-world scenarios. What this Book will help me do Understand the core features and operation of Apache Storm for real-time data streaming. Integrate Apache Storm with other Big Data frameworks like Kafka, HBase, Redis, and Hadoop. Effectively deploy and manage multi-node Apache Storm clusters in real-world environments. Monitor and analyze your data streams and system health effectively using built-in and external tools. Learn to implement fault-tolerant, scalable, and distributed stream processing applications in Apache Storm. Author(s) None Jain is an experienced software developer and technical instructor specializing in distributed systems and real-time data processing. With years of experience working with Apache Storm and related technologies, their teachings focus on practical, hands-on learning to equip readers with actionable skills. Who is it for? This book is ideal for Java developers aspiring to build expertise in real-time data streaming and distributed processing applications using Apache Storm. Beginners can start with the fundamentals provided, while those with prior knowledge can delve into intermediate and advanced implementations.

Astronomer with Ry Walker - Episode 6

2017-08-06 · Data Engineering Podcast Listen

podcast_episode

by Ry Walker (Astronomer) , Tobias Macey

Airflow Flink API Astronomer AWS Kinesis AWS Lambda Data Engineering Data Management Docker Druid ELK +12 more

Summary

Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Ry Walker, CEO of Astronomer, the platform for data engineering.

Interview

Introduction How did you first get involved in the area of data management? What is Astronomer and how did it get started? Regulatory challenges of processing other people’s data What does your data pipelining architecture look like? What are the most challenging aspects of building a general purpose data management environment? What are some of the most significant sources of technical debt in your platform? Can you share some of the failures that you have encountered while architecting or building your platform and company and how you overcame them? There are certain areas of the overall data engineering workflow that are well defined and have numerous tools to choose from. What are some of the unsolved problems in data management? What are some of the most interesting or unexpected uses of your platform that you are aware of?

Contact Information

Email @rywalker on Twitter

Links

Astronomer Kiss Metrics Segment Marketing tools chart Clickstream HIPAA FERPA PCI Mesos Mesos DC/OS Airflow SSIS Marathon Prometheus Grafana Terraform Kafka Spark ELK Stack React GraphQL PostGreSQL MongoDB Ceph Druid Aries Vault Adapter Pattern Docker Kinesis API Gateway Kong AWS Lambda Flink Redshift NOAA Informatica SnapLogic Meteor

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Apache Spark 2.x for Java Developers

2017-07-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sourav Gulati (Databricks) , Sumit Kumar

AI/ML Analytics API Big Data CSV Java JSON Scala Spark SQL Data Streaming XML +3 more

Delve into mastering big data processing with 'Apache Spark 2.x for Java Developers.' This book provides a practical guide to implementing Apache Spark using the Java APIs, offering a unique opportunity for Java developers to leverage Spark's powerful framework without transitioning to Scala. What this Book will help me do Learn how to process data from formats like XML, JSON, CSV using Spark Core. Implement real-time analytics using Spark Streaming and third-party tools like Kafka. Understand data querying with Spark SQL and master SQL schema processing. Apply machine learning techniques with Spark MLlib to real-world scenarios. Explore graph processing and analytics using Spark GraphX. Author(s) None Kumar and None Gulati, experienced professionals in Java development and big data, bring their wealth of practical experience and passion for teaching to this book. With a clear and concise writing style, they aim to simplify Spark for Java developers, making big data approachable. Who is it for? This book is perfect for Java developers who are eager to expand their skillset into big data processing with Apache Spark. Whether you are a seasoned Spark user or first diving into big data concepts, this book meets you at your level. With practical examples and straightforward explanations, you can unlock the potential of Spark in real-world scenarios.

Streaming Data

2017-07-05 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Andrew Psaltis

Analytics Flink Spark Data Streaming data data-engineering streaming-architecture streaming-messaging

Streaming Data introduces the concepts and requirements of streaming and real-time data systems. The book is an idea-rich tutorial that teaches you to think about how to efficiently interact with fast-flowing data. About the Technology As humans, we're constantly filtering and deciphering the information streaming toward us. In the same way, streaming data applications can accomplish amazing tasks like reading live location data to recommend nearby services, tracking faults with machinery in real time, and sending digital receipts before your customers leave the shop. Recent advances in streaming data technology and techniques make it possible for any developer to build these applications if they have the right mindset. This book will let you join them. About the Book Streaming Data is an idea-rich tutorial that teaches you to think about efficiently interacting with fast-flowing data. Through relevant examples and illustrated use cases, you'll explore designs for applications that read, analyze, share, and store streaming data. Along the way, you'll discover the roles of key technologies like Spark, Storm, Kafka, Flink, RabbitMQ, and more. This book offers the perfect balance between big-picture thinking and implementation details. What's Inside The right way to collect real-time data Architecting a streaming pipeline Analyzing the data Which technologies to use and when About the Reader Written for developers familiar with relational database concepts. No experience with streaming or real-time applications required. About the Author Andrew Psaltis is a software engineer focused on massively scalable real-time analytics. Quotes The definitive book if you want to master the architecture of an enterprise-grade streaming application. - Sergio Fernandez Gonzalez, Accenture A thorough explanation and examination of the different systems, strategies, and tools for streaming data implementations. - Kosmas Chatzimichalis, Mach 7x A well-structured way to learn about streaming data and how to put it into practice in modern real-time systems. - Giuliano Araujo Bertoti, FATEC This book is all you need to understand what streaming is all about! - Carlos Curotto, Globant

talk-data.com

Activity Trend

Top Events

Top Speakers

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Designing Event-Driven Systems

JavaScript and JSON Essentials - Second Edition

ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Complete Guide to Open Source Big Data Stack

Wallaroo with Sean T. Allen - Episode 12

Apache Kafka 1.0 Cookbook

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

Practical Real-time Data Processing and Analytics

Kafka: The Definitive Guide

Learning Spark SQL

Building Data Streaming Applications with Apache Kafka

Mastering Apache Storm

Astronomer with Ry Walker - Episode 6

Apache Spark 2.x for Java Developers

Streaming Data