Topic

JSON Schema

schemas data_validation data_modelling

Activities

2

tagged

Activity Trend

1 peak/qtr

2020-Q1 2026-Q2

Top Events

O'Reilly Data Engineering Books 3 Data Engineering Podcast 2 dbt Coalesce 2022 1 Data + AI Summit 2025 1 dbt Coalesce 2023 1

Top Speakers

Tobias Macey 2 Dattatraya Walake (Databricks) 1 Mark Woolley 1 Jan Steeman (ArangoDB) 1 Andy Armstrong 1 Nigel Williams 1 Rufus Credle 1 Jan Stücke (ArangoDB) 1 Matthew Wilson 1 Tom Marrs 1 Geoff Pirie 1 Inderpal Singh 1

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Tobias Macey ×

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

2018-06-04 · Data Engineering Podcast Listen

podcast_episode

by Jan Steeman (ArangoDB) , Jan Stücke (ArangoDB) , Tobias Macey

API Data Engineering Data Management Data Modelling GitHub JSON Cyber Security postgresql

Summary

Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Jan Stücke and Jan Steeman about ArangoDB, a multi-model distributed database for graph, document, and key/value storage.

Interview

Introduction How did you get involved in the area of data management? Can you give a high level description of what ArangoDB is and the motivation for creating it?

What is the story behind the name?

How is ArangoDB constructed?

How does the underlying engine store the data to allow for the different ways of viewing it?

What are some of the benefits of multi-model data storage?

When does it become problematic?

For users who are accustomed to a relational engine, how do they need to adjust their approach to data modeling when working with Arango? How does it compare to OrientDB? What are the options for scaling a running system?

What are the limitations in terms of network architecture or data volumes?

One of the unique aspects of ArangoDB is the Foxx framework for embedding microservices in the data layer. What benefits does that provide over a three tier architecture?

What mechanisms do you have in place to prevent data breaches from security vulnerabilities in the Foxx code? What are some of the most interesting or surprising uses of this functionality that you have seen?

What are some of the most challenging technical and business aspects of building and promoting ArangoDB? What do you have planned for the future of ArangoDB?

Contact Info

Jan Steemann

jsteemann on GitHub @steemann on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

ArangoDB Köln Multi-model Database Graph Algorithms Apache 2 C++ ArangoDB Foxx Raft Protocol Target Partners RocksDB AQL (ArangoDB Query Language) OrientDB PostGreSQL OrientDB Studio Google Spanner 3-Tier Architecture Thomson-Reuters Arango Search Dell EMC Google S2 Index ArangoDB Geographic Functionality JSON Schema

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

2017-06-18 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey , Justin Cunningham (Yelp)

Flink Avro Beam BI Data Engineering Data Management ETL/ELT JSON Kafka Linux Protobuf Redshift

Summary

Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and modern, and then they open sourced it! In this episode Justin Cunningham joins me to discuss the decisions they made and the lessons they learned in the process, including what worked, what didn’t, and what he would do differently if he was starting over today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Justin Cunningham about Yelp’s data pipeline

Interview with Justin Cunningham

Introduction How did you get involved in the area of data engineering? Can you start by giving an overview of your pipeline and the type of workload that you are optimizing for? What are some of the dead ends that you experienced while designing and implementing your pipeline? As you were picking the components for your pipeline, how did you prioritize the build vs buy decisions and what are the pieces that you ended up building in-house? What are some of the failure modes that you have experienced in the various parts of your pipeline and how have you engineered around them? What are you using to automate deployment and maintenance of your various components and how do you monitor them for availability and accuracy? While you were re-architecting your monolithic application into a service oriented architecture and defining the flows of data, how were you able to make the switch while verifying that you were not introducing unintended mutations into the data being produced? Did you plan to open-source the work that you were doing from the start, or was that decision made after the project was completed? What were some of the challenges associated with making sure that it was properly structured to be amenable to making it public? What advice would you give to anyone who is starting a brand new project and how would that advice differ for someone who is trying to retrofit a data management architecture onto an existing project?

Keep in touch

Yelp Engineering Blog Email

Links

Kafka Redshift ETL Business Intelligence Change Data Capture LinkedIn Data Bus Apache Storm Apache Flink Confluent Apache Avro Game Days Chaos Monkey Simian Army PaaSta Apache Mesos Marathon SignalFX Sensu Thrift Protocol Buffers JSON Schema Debezium Kafka Connect Apache Beam

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast