IBM

Distributed In Memory Processing And Streaming With Hazelcast

2020-09-15 · Data Engineering Podcast Listen

podcast_episode

by Dale Kim (Hazelcast) , Tobias Macey

Flink Big Data Cloud Computing Data Engineering Data Management Kubernetes Python Spark Data Streaming

Summary In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity hardware. On top of this foundation, the Hazelcast team has also built a streaming platform for reliable high throughput data transmission. In this episode Dale Kim shares how Hazelcast is implemented, the use cases that it enables, and how it complements on-disk data management systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Tree Schema is a data catalog that is making metadata management accessible to everyone. With Tree Schema you can create your data catalog and have it fully populated in under five minutes when using one of the many automated adapters that can connect directly to your data stores. Tree Schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration around your data. With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to dataengineeringpodcast.com/treeschema today to get your first month free, and mention this podcast to get %50 off your first three months after the trial. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Dale Kim about Hazelcast, a distributed in-memory computing platform for data intensive applications

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Hazelcast is and its origins? What are the benefits and tradeoffs of in-memory computation for data-intensive workloads? What are some of the common use cases for the Hazelcast in memory grid? How is Hazelcast implemented?

How has the architecture evolved since it was first created?

How is the Jet streaming framework architected?

What was the motivation for building it? How do the capabilities of Jet compare to systems such as Flink or Spark Streaming?

How has the introduction of hardware capabilities such as NVMe drives influenced the market for in-memory systems? How is the governance of the open source grid and Jet projects handled?

What is the guiding heuristic for which capabilities or features to include in the open source projects vs. the commercial offerings?

What is involved in building an application or workflow on top of Hazelcast? What are the common patterns for engineers who are building on top of Hazelcast? What is involved in deploying and maintaining an installation of the Hazelcast grid or Jet streaming? What are the scaling factors for Hazelcast?

What are the edge cases that users should be aware of?

What are some of the most interesting, innovative, or unexpected ways that you have seen Hazelcast used? When is Hazelcast Grid or Jet the wrong choice? What is in store for the future of Hazelcast?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

HazelCast Istanbul Apache Spark OrientDB CAP Theorem NVMe Memristors Intel Optane Persistent Memory Hazelcast Jet Kappa Architecture IBM Cloud Paks Digital Integration Hub (Gartner)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Building A New Foundation For CouchDB

2020-03-17 · Data Engineering Podcast Listen

podcast_episode

by Adam Kocoloski , Tobias Macey

AI/ML Analytics Big Data ClickHouse Cloud Computing Data Engineering Data Lake Data Management DevOps DWH GitHub Kubernetes +3 more

Summary CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB. In this episode Adam Kocoloski shares the history of the project, how it works under the hood, and how the new design will improve the project for our new era of computation. This was an interesting conversation about the challenges of maintaining a large and mission critical project and the work being done to evolve it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve ClickHouse, the open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for ClickHouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Adam Kocoloski about CouchDB and the work being done to migrate the storage layer to FoundationDB

Interview

Introduction How did you get involved in the area of data management? Can you starty by describing what CouchDB is?

How did you get involved in the CouchDB project and what is your current role in the community?

What are the use cases that it is well suited for? Can you share some of the history of CouchDB and its role in the NoSQL movement? How is CouchDB currently architected and how has it evolved since it was first introduced? What have been the benefits and challenges of Erlang as the runtime for CouchDB? How is the current storage engine implemented and what are its shortcomings? What problems are you trying to solve by replatforming on a new storage layer?

What were the selection criteria for the new storage engine and how did you structure the decision making process? What was the motivation for choosing FoundationDB as opposed to other options such as rocksDB, levelDB, etc.?

How is the adoption of FoundationDB going to impact the overall architecture and implementation of CouchDB? How will the use of FoundationDB impact the way that the current capabilities are implemented, such as data replication? What will the migration path be for people running an existing installation? What are some of the biggest challenges that you are facing in rearchitecting the codebase? What new capabilities will the FoundationDB storage layer enable? What are some of the most interesting/unexpected/innovative ways that you have seen CouchDB used?

What new capabilities or use cases do you anticipate once this migration is complete?

What are some of the most interesting/unexpected/challenging lessons that you have learned while working with the CouchDB project and community? What is in store for the future of CouchDB?

Contact Info

LinkedIn @kocolosk on Twitter kocolosk on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Apache CouchDB FoundationDB

Podcast Episode

IBM Cloudant Experimental Particle Physics FPGA == Field Programmable Gate Array Apache Software Foundation CRDT == Conflict-free Replicated Data Type

Podcast Episode

Erlang Riak RabbitMQ Heisenbug Kubernetes Property Based Testing

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

2018-12-10 · Data Engineering Podcast Listen

podcast_episode

by Jean-Georges Perrin (Actian) , Tobias Macey

AI/ML Flink Data Engineering Data Management Databricks ETL/ELT GitHub Hadoop Kafka Microsoft MySQL Spark +3 more

Summary

Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Spark is?

What are some of the main use cases for Spark? What are some of the problems that Spark is uniquely suited to address? Who uses Spark?

What are the tools offered to Spark users? How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm? For someone building on top of Spark what are the main software design paradigms?

How does the design of an application change as you go from a local development environment to a production cluster?

Once your application is written, what is involved in deploying it to a production environment? What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline? What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments? What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies? What are the limitations of the Spark programming model?

What are the cases where Spark is the wrong choice?

What was your motivation for writing a book about Spark?

Who is the target audience?

What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark? What advice do you have for anyone who is considering or currently using Spark?

Contact Info

@jgperrin on Twitter Blog

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Book Discount

Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com

Links

Apache Spark Spark In Action Book code examples in GitHub Informix International Informix Users Group MySQL Microsoft SQL Server ETL (Extract, Transform, Load) Spark SQL and Spark In Action‘s chapter 11 Spark ML and Spark In Action‘s chapter 18 Spark Streaming (structured) and Spark In Action‘s chapter 10 Spark GraphX Hadoop Jupyter

Podcast Interview

Zeppelin Databricks IBM Watson Studio Kafka Flink

P

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

2018-11-19 · Data Engineering Podcast Listen

podcast_episode

by Fabian Hueske (Data Artisans) , Tobias Macey

Flink Cloud Computing Data Engineering Data Management Dataflow GCP GitHub Hadoop Java Kafka Scala Spark +2 more

Summary

Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Flink is and how the project got started? What are some of the primary ways that Flink is used? How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?

What are some use cases that Flink is uniquely qualified to handle?

Where does Flink fit into the current data landscape? How is Flink architected?

How has that architecture evolved? Are there any aspects of the current design that you would do differently if you started over today?

How does scaling work in a Flink deployment?

What are the scaling limits? What are some of the failure modes that users should be aware of?

How is the statefulness of a cluster managed?

What are the mechanisms for managing conflicts? What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose? Can state be shared across processes or tasks within a Flink cluster?

What are the comparative challenges of working with bounded vs unbounded streams of data? How do you handle out of order events in Flink, especially as the delay for a given event increases? For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it? What are some of the most challenging or complicated aspects of building and maintaining Flink? What are some of the most interesting or unexpected ways that you have seen Flink used? What are some of the improvements or new features that are planned for the future of Flink? What are some features or use cases that you are explicitly not planning to support? For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?

What do they find most interesting or exciting?

Contact Info

LinkedIn @fhueske on Twitter fhueske on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Flink Data Artisans IBM DB2 Technische Universität Berlin Hadoop Relational Database Google Cloud Dataflow Spark Cascading Java RocksDB Flink Checkpoints Flink Savepoints Kafka Pulsar Storm Scala LINQ (Language INtegrated Query) SQL Backpressure

MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24

2018-03-25 · Data Engineering Podcast Listen

podcast_episode

by Hitoshi Harada (Alpaca) , Christopher Ryan (Alpaca) , Tobias Macey

API Data Engineering Data Management Datadog GitHub Go Hierarchical Data Format

Summary

The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store specifically for retrieving and analyzing data generated by trading markets. In this episode Hitoshi Harada, the CTO of Alapaca, and Christopher Ryan, their lead software engineer, explain their motivation for building MarketStore, how it operates, and how it has helped to simplify their development workflows.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Christopher Ryan and Hitoshi Harada about MarketStore, a storage server for large volumes of financial timeseries data

Interview

Introduction How did you get involved in the area of data management? What was your motivation for creating MarketStore? What are the characteristics of financial time series data that make it challenging to manage? What are some of the workflows that MarketStore is used for at Alpaca and how were they managed before it was available? With MarketStore’s data coming from multiple third party services, how are you managing to keep the DB up-to-date and in sync with those services?

What is the worst case scenario if there is a total failure in the data store? What guards have you built to prevent such a situation from occurring?

Since MarketStore is used for querying and analyzing data having to do with financial markets and there are potentially large quantities of money being staked on the results of that analysis, how do you ensure that the operations being performed in MarketStore are accurate and repeatable? What were the most challenging aspects of building MarketStore and integrating it into the rest of your systems? Motivation for open sourcing the code? What is the next planned major feature for MarketStore, and what use-case is it aiming to support?

Contact Info

Christopher

Email

Hitoshi

Email

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

MarketStore

GitHub Release Announcement

Alpaca IBM DB2 GreenPlum Algorithmic Trading Backtesting OHLC (Open-High-Low-Close) HDF5 Golang C++ Timeseries Database List InfluxDB JSONRPC Slait CircleCI GDAX

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

talk-data.com

Activity Trend

Top Events

Top Speakers

Distributed In Memory Processing And Streaming With Hazelcast

Building A New Foundation For CouchDB

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24