API

MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24

2018-03-25 · Data Engineering Podcast Listen

podcast_episode

by Hitoshi Harada (Alpaca) , Christopher Ryan (Alpaca) , Tobias Macey

Data Engineering Data Management Datadog GitHub Go Hierarchical Data Format IBM

Summary

The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store specifically for retrieving and analyzing data generated by trading markets. In this episode Hitoshi Harada, the CTO of Alapaca, and Christopher Ryan, their lead software engineer, explain their motivation for building MarketStore, how it operates, and how it has helped to simplify their development workflows.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Christopher Ryan and Hitoshi Harada about MarketStore, a storage server for large volumes of financial timeseries data

Interview

Introduction How did you get involved in the area of data management? What was your motivation for creating MarketStore? What are the characteristics of financial time series data that make it challenging to manage? What are some of the workflows that MarketStore is used for at Alpaca and how were they managed before it was available? With MarketStore’s data coming from multiple third party services, how are you managing to keep the DB up-to-date and in sync with those services?

What is the worst case scenario if there is a total failure in the data store? What guards have you built to prevent such a situation from occurring?

Since MarketStore is used for querying and analyzing data having to do with financial markets and there are potentially large quantities of money being staked on the results of that analysis, how do you ensure that the operations being performed in MarketStore are accurate and repeatable? What were the most challenging aspects of building MarketStore and integrating it into the rest of your systems? Motivation for open sourcing the code? What is the next planned major feature for MarketStore, and what use-case is it aiming to support?

Contact Info

Christopher

Email

Hitoshi

Email

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

MarketStore

GitHub Release Announcement

Alpaca IBM DB2 GreenPlum Algorithmic Trading Backtesting OHLC (Open-High-Low-Close) HDF5 Golang C++ Timeseries Database List InfluxDB JSONRPC Slait CircleCI GDAX

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Stretching The Elastic Stack with Philipp Krenn - Episode 23

2018-03-19 · Data Engineering Podcast Listen

podcast_episode

by Philipp Krenn (Elastic) , Tobias Macey

Cassandra Cloud Computing Data Engineering Data Management Datadog ELK GitHub Kibana Logstash MongoDB Neo4j NoSQL

Summary

Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Philipp Krenn about the Elastic Stack and the ways that you can use it in your systems

Interview

Introduction How did you get involved in the area of data management? The Elasticsearch product has been around for a long time and is widely known, but can you give a brief overview of the other components that make up the Elastic Stack and how they work together? Beyond the common pattern of using Elasticsearch as a search engine connected to a web application, what are some of the other use cases for the various pieces of the stack? What are the common scaling bottlenecks that users should be aware of when they are dealing with large volumes of data? What do you consider to be the biggest competition to the Elastic Stack as you expand the capabilities and target usage patterns? What are the biggest challenges that you are tackling in the Elastic stack, technical or otherwise? What are the biggest challenges facing Elastic as a company in the near to medium term? Open source as a business model: https://www.elastic.co/blog/doubling-down-on-open?utm_source=rss&utm_medium=rss What is the vision for Elastic and the Elastic Stack going forward and what new features or functionality can we look forward to?

Contact Info

@xeraa on Twitter xeraa on GitHub Website Email

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Elastic Vienna – Capital of Austria What Is Developer Advocacy? NoSQL MongoDB Elasticsearch Cassandra Neo4J Hazelcast Apache Lucene Logstash Kibana Beats X-Pack ELK Stack Metrics APM (Application Performance Monitoring) GeoJSON Split Brain Elasticsearch Ingest Nodes PacketBeat Elastic Cloud Elasticon Kibana Canvas SwiftType

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

2018-02-04 · Data Engineering Podcast Listen

podcast_episode

by Matteo Merli , Rajan Dhabalia , Tobias Macey

AI/ML Ansible Data Engineering Data Management Data Science GitHub Kafka Linux Pub/Sub

Summary

One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements:

There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.

Your host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Pulsar is and what the original inspiration for the project was? What have been some of the most challenging aspects of building and promoting Pulsar? For someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components? What are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to? What projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison? The documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka? One of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged? When is Pulsar the wrong tool to use? What are some of the improvements or new features that you have planned for the future of Pulsar?

Contact Info

Matteo

merlimat on GitHub @merlimat on Twitter

Rajan

@dhabaliaraj on Twitter rhabalia on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Pulsar Publish-Subscribe Yahoo Streamlio ActiveMQ Kafka Bookkeeper SLA (Service Level Agreement) Write-Ahead Log Ansible Zookeeper Pulsar Deployme

Astronomer with Ry Walker - Episode 6

2017-08-06 · Data Engineering Podcast Listen

podcast_episode

by Ry Walker (Astronomer) , Tobias Macey

Airflow Flink Astronomer AWS Kinesis AWS Lambda Data Engineering Data Management Docker Druid ELK Grafana +12 more

Summary

Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Ry Walker, CEO of Astronomer, the platform for data engineering.

Interview

Introduction How did you first get involved in the area of data management? What is Astronomer and how did it get started? Regulatory challenges of processing other people’s data What does your data pipelining architecture look like? What are the most challenging aspects of building a general purpose data management environment? What are some of the most significant sources of technical debt in your platform? Can you share some of the failures that you have encountered while architecting or building your platform and company and how you overcame them? There are certain areas of the overall data engineering workflow that are well defined and have numerous tools to choose from. What are some of the unsolved problems in data management? What are some of the most interesting or unexpected uses of your platform that you are aware of?

Contact Information

Email @rywalker on Twitter

Links

Astronomer Kiss Metrics Segment Marketing tools chart Clickstream HIPAA FERPA PCI Mesos Mesos DC/OS Airflow SSIS Marathon Prometheus Grafana Terraform Kafka Spark ELK Stack React GraphQL PostGreSQL MongoDB Ceph Druid Aries Vault Adapter Pattern Docker Kinesis API Gateway Kong AWS Lambda Flink Redshift NOAA Informatica SnapLogic Meteor

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

ScyllaDB with Eyal Gutkind - Episode 4

2017-03-18 · Data Engineering Podcast Listen

podcast_episode

by Eyal Gutkind , Tobias Macey

Cassandra Data Engineering Data Management GitHub

Summary

If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded database market.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Eyal Gutkind about ScyllaDB

Interview

Introduction How did you get involved in the area of data management? What is ScyllaDB and why would someone choose to use it? How do you ensure sufficient reliability and accuracy of the database engine? The large draw of Scylla is that it is a drop in replacement of Cassandra with faster performance and no requirement to manage th JVM. What are some of the technical and architectural design choices that have enabled you to do that? Deployment and tuning What challenges are inroduced as a result of needing to maintain API compatibility with a diferent product? Do you have visibility or advance knowledge of what new interfaces are being added to the Apache Cassandra project, or are you forced to play a game of keep up? Are there any issues with compatibility of plugins for CassandraDB running on Scylla? For someone who wants to deploy and tune Scylla, what are the steps involved? Is it possible to join a Scylla cluster to an existing Cassandra cluster for live data migration and zero downtime swap? What prompted the decision to form a company around the database? What are some other uses of Seastar?

Keep in touch

Eyal

ScyllaDB

Website @ScyllaDB on Twitter GitHub Mailing List Slack

Links

Seastar Project DataStax XFS TitanDB OpenTSDB KairosDB CQL Pedis

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

talk-data.com

Activity Trend

Top Events

Top Speakers

MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24

Stretching The Elastic Stack with Philipp Krenn - Episode 23

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Astronomer with Ry Walker - Episode 6

ScyllaDB with Eyal Gutkind - Episode 4