talk-data.com talk-data.com

Filter by Source

Select conferences and events

People (3 results)

Companies (1 result)

Sr. Software Engineer
Showing 11 results

Activities & events

Title & Speakers Event

This November, we will be bringing you a night of data engineering extravaganza at Snowflake HQ 🙌 ✨ in collaboration with the Snowflake User Group.

6pm: Doors Open

🗣️The Speakers🗣️

Intelligent Snowflake Warehouse Management - Konrad Maliszewski, VP Technology at Crisp (Konrad's Linkedin) Konrad’s session explores how Crisp automated Snowflake warehouse management to cut costs and boost efficiency. Using their open-source dbt package dbt-macro-polo, the team built a system that automatically selects the right warehouse size for each workload, adapts by environment, and removes manual scaling. The result: predictable performance, lower compute spend (up to 80% reduction), and a smoother developer experience across teams—all achieved through intelligent automation directly in dbt.

From Stadiums to Snowflake: Lessons in Real-Time Data at Global Scale - Sam Malcolm, Head of Architecture & Engineering at Centrus (Sam's Linkedin) Sam’s session dives into lessons from large-scale live event data systems—handling over 10 billion data points per second for global tours like Beyoncé, Coldplay, and Glastonbury. He connects the extreme demands of real-time analytics and high-performance networking to modern cloud data practices, showing how the same principles of speed, resilience, and precision apply when designing reliable, scalable data platforms today.

Talks finish by 8pm.

You can sign up via responding to this event or via Snowflake User Group.

🚨IMPORTANT: Please bring a valid form of ID.

See you all on the 18th November 🤩

Happy Networking 🍻

By attending this event, you agree to abide by our rules of conduct:

  • Respect others' opinions.
  • Keep it appropriate - no harassment of any sort.
  • If you see something or have a complaint, please reach out to one of the organisers or email [email protected].
Data Engineers London x Snowflake User Group Nov Event

In collaboration with Data Engineers London

Intelligent Snowflake Warehouse Management Konrad’s session explores how Crisp automated Snowflake warehouse management to cut costs and boost efficiency. Using their open-source dbt package dbt-macro-polo, the team built a system that automatically selects the right warehouse size for each workload, adapts by environment, and removes manual scaling. The result: predictable performance, lower compute spend (up to 80% reduction), and a smoother developer experience across teams—all achieved through intelligent automation directly in dbt. From Stadiums to Scaled Data: Lessons in Real-Time Data at Global Scale Sam’s session dives into lessons from large-scale live event data systems—handling over 10 billion data points per second for global tours like Beyoncé, Coldplay, and Glastonbury. He connects the extreme demands of real-time analytics and high-performance networking to modern cloud data practices, showing how the same principles of speed, resilience, and precision apply when designing reliable, scalable data platforms today.

LSUG - Snowflake Warehouse Management & Real-Time Data at Massive Scale

Location: Beekeeper, Hardturmstrasse 181 · Zürich, ZH ----- Hey there, data enthusiasts and ClickHouse aficionados! We've got some exciting news to share - our next meetup is on the horizon, and it's going to be in Zurich! Get ready for a delightful mix of mind-boggling data tales, insightful conversations, and maybe even a surprise or two up our sleeves.

But here's the deal: to secure your spot, make sure you register ASAP!

Agenda:

6:00 - 6:40 pm - Arrival and Check-in 6:40 - 7:00 pm - Rapidata: Jorge Paravicini, Software Architect and Luca Strebel, Founder. "Rapidata ClickHouse Journey: Observability with ClickHouse (Logs / Metrics / Traces) 7:00 - 7:20 pm - Gamera.app: Chris Naegelin, CTO. "Site Analytics: Cookieless, accurate, real time, and never sampled. Build on ClickHouse" 7:20 - 7:40 pm - GetInData: Krzysztof Zarzycki, CTO and Co-founder at GetInData, a part of Xebia. "Real-Time Analytics: Power of the KFC Stack (Kafka, Flink, and ClickHouse)" 7:40 - 8:00 pm - Zondax: Juan Leni, CEO of Zondax. "What indexing means in the blockchain industry?" 8:00 - 8:15 pm - "ClickHouse for Observability: Logs / Metrics / Traces" by Oussama Chakri, ClickHouse Solution Architect 8:15 - 09:00 pm - Food, Snacks, and Conversation

If you are interested in speaking at a future event, please contact [email protected]

GetInData: Krzysztof Zarzycki In the fast-paced world of modern business, where decisions need to be made at the speed of a drive-thru order, real-time analytics has become essential. This presentation explores the challenges of building real-time analytics systems and demonstrates how the Kafka, Flink, and ClickHouse – the "KFC Stack" – can transform your data from raw input into actionable insights with the efficiency of a well-oiled kitchen. We'll cover:

  • The critical role of real-time analytics in today's competitive landscape
  • Key technical and business challenges in implementing real-time data systems
  • Real-time Streaming architecture using KFC Stack
  • Case studies illustrating the application of the KFC stack in real-time analytics

This presentation will be of particular interest to data engineers, analysts, IT architects, and anyone interested in using data to make faster and better business decisions. Whether you're a seasoned data professional or just starting your journey, this presentation will provide you with the knowledge and tools to build real-time analytics that deliver insights as quickly as your favorite fast-food order.

Zondax: Juan Leni, Context: What indexing means in the blockchain industry - About Zondax: What we do - Before Clickhouse: How we were using Postgres\, Timescale in the past . What problems we had.... - Clickhouse: how the change was great\, how long it took\, some stories. - Infrastructure: cloud vs self-hosting. On-prem infrastructure. - Technical challenges and nice solutions. We have some interesting points that we can touch such as:

  • Postgres with millons of rows is not even able to count rows
  • Chain of materialized views (up to 8 chained MV to calculate different statistics)
  • Some of them uses a in-memory dic to translate some values
  • Around 10B rows on the bigger tables
  • Clickhouse2Postgres databases to sync data from postgres-only apps (for retro-compatibility)
  • Materialized Views to generate stats over Clickhosue2Postgres databases
  • Running on cluster with three shards
Open Source Real-Time Data Warehouse & Real-Time Analytics

Hey there, data enthusiasts!

The need for Real_Time BI / Analytics / Monitoring is growing whether this is for Customer-Facing analytics or for internal teams who require real-time responses to their queries.

This Meet Up page is brought to you by Open Source ClickHouse to drive awareness and adoption of Real_Time Data Warehousing / Analytics / BI.

Get ready for a delightful mix of mind-boggling data tales, insightful conversations, and maybe even a surprise or two up our sleeves.

But here's the deal: to secure your spot, make sure you register ASAP!

Location: SQ Europe: Room Corradi Square de Meeûs 35, 1000 Brussels 15 min walk from Brussels Gare Central / 8 min by Bus 38 https://silversquare.eu/en/coworking-locations/brussels/european-quarter

Agenda: 6:00 - 6:40 - Arrival and Check-in 6:40 - 7:00 - "ClickHouse at Radisson Hotel Group"\, Aubry Van Nieuwenborgh\, Director\, Pulse and Analytics 7:20 - 7:40 - "ClickHouse + Luzmo: Embed impactful insights into your SaaS product in days" by Haroen Vermylen\, CTO Luzmo 7:40 - 8:00 - ClickHouse Roadmap Update + Q&A by ClickHouse 8:00 - 8:30 - Food\, Drinks\, and Conversation

If you are interested in speaking at this, or a future event, please contact [email protected]

Open Source Real-Time Data Warehouse & Real-Time Analytics

We're excited to host the 4th Budapest dbt Meetup! All data enthusiasts » Let's shape Hungarian dbt community, enjoy talks, snack & beer together!

We will be in-person in Spot Budapest, which is an exciting location in the stunning neighbourhood of the famous Budapest downtown.

dbt Meetups are networking events open to all folks working with data! Talks predominantly focus on community members' experience with dbt, however, you'll catch presentations on broader topics such as analytics engineering, data stacks, data ops, modeling, testing, and team structures.

🤝Organizer: Hiflylabs

🏠Venue Host: The Spot Budapest, 26 Király Street, Budapest 1061

https://create26.hu/

🍕Catering: beer&snacks

To attend, please read the Required Participation Language for In-Person Events with dbt Labs: https://bit.ly/3QIJXFb

📝Agenda

  • 6:00 - 6:30 \| Registration (30 min)

  • 6:30 - 6:35 \| Welcome Remarks (5 min)

  • 6:35 - 6:55 \| Talk 1 (15-20 min)

  • 6:55 - 7:00 \| Q&A (5 min)

  • 7:00 - 7:20 \| Talk 2 (15-20 min)

  • 7:20 - 7:25 \| Q&A (5 min)

  • 7:25 - 7:30 \| Closing Remarks (5 min)

  • 7:30 - 8:30 \| Networking - let's have a chat\, beer and snacks!

🗣️Presentation #1: How Python on Databricks gives antigravity to dbt?

Zsombor will delve into dbt's second language: Python. Using Databricks as a warehouse, he'll illustrate the capabilities of dbt's Python models. The presentation will introduce how Python enhances dbt's traditional SQL-based function set, highlighting use cases from the practical to the cutting-edge theory of Python models.

Speaker: Zsombor Földesi, Lead Data Engineer, Hiflylabs

🗣️Presentation #2: Do you really need real-time data?

Real-time analytics can be useful, but building and maintaining real-time ETL pipelines is often time-consuming. For many scenarios, a batch pipeline implemented using the modern data stack can accomplish the same goal. In this presentation, Ben will share how we build batch and real-time pipelines and how we decide which one is appropriate for different scenarios.

Speaker: Ben Kulcsar, Senior Data Engineer, Hotjar

The language of the presentations is English.

The doors open at 6pm. Presentations begin at 6:30pm. Food and refreshments will be provided.

➡️ Join the dbt Slack community: https://www.getdbt.com/community/

🤝For the best Meetup experience, make sure to join the #local-budapest channel in dbt Slack (https://slack.getdbt.com/).

----------------------------------

dbt is an open source command-line tool that speaks the preferred language of data analysts everywhere—SQL. With dbt, analysts take ownership of the entire analytics engineering workflow, from writing data transformation code to deployment and documentation.

Learn more: https://www.getdbt.com/

Budapest dbt Meetup (in-person)
Yoav Cohen – co-founder and CTO @ Satori , Tobias Macey – host

Summary

As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today RudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/rudder Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Yoav Cohen about the challenges that data teams face in securing their data platforms and how that impacts the productivity and adoption of data in the organization

Interview

Introduction How did you get involved in the area of data management? Data security is a very broad term. Can you start by enumerating some of the different concerns that are involved? How has the scope and complexity of implementing security controls on data systems changed in recent years?

In your experience, what is a typical number of data locations that an organization is trying to manage access/permissions within?

What are some of the main challenges that data/compliance teams face in establishing and maintaining security controls?

How much of the problem is technical vs. procedural/organizational?

As a vendor in the space, how do you think about the broad categories/boundary lines for the different elements of data security? (e.g. masking vs. RBAC, etc.)

What are the different layers that are best suited to managing each of those categories? (e.g. masking and encryption in storage layer, RBAC in warehouse, etc.)

What are some of the ways that data security and organizational productivity are at odds with each other?

What are some of the shortcuts that you see teams and individuals taking to address the productivity hit from security controls?

What are some of the methods that you have found to be most effective at mitigating or even improving productivity impacts through security controls?

How does up-front design of the security layers improve the final outcome vs. trying to bolt on security after the platform is already in use? How can education about the motivations for different security practices improve compliance and user experience?

What are the most interesting, innovative, or unexpected ways that you have seen data teams align data security and productivity? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data security technology? What are the areas of data security that still need improvements?

Contact Info

Yoav Cohen

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Satori

Podcast Episode

Data Masking RBAC == Role Based Access Control ABAC == Attribute Based Access Control Gartner Data Security Platform Report

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Rudderstack: Rudderstack Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit RudderStack.com/DEP to learn moreData Council: Data Council Logo Join us at the event for the global data community, Data Council Austin. From March 28-30th 2023, we'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount off tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit: dataengineeringpodcast.com/data-council Promo Code: dataengpod20TimeXtender: TimeXtender Logo TimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.

You can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.

Go to dataengineeringpodcast.com/timextender today to get started for free!Support Data Engineering Podcast

AI/ML Analytics CDP Data Engineering Data Management Data Science JavaScript Modern Data Stack Python Cyber Security
Priyendra Deshwal – guest @ NetSpring , Tobias Macey – host

Summary

With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Your host is Tobias Macey and today I'm interviewing Priyendra Deshwal about how NetSpring is using the data warehouse to deliver a more flexible and detailed view of your product analytics

Interview

Introduction How did you get involved in the area of data management? Can you describe what NetSpring is and the story behind it?

What are the activities that constitute "product analytics" and what are the roles/teams involved in those activities?

When teams first come to you, what are the common challenges that they are facing and what are the solutions that they have attempted to employ? Can you describe some of the challenges involved in bringing product analytics into enterprise or highly regulated environments/industries?

How does a warehouse-native approach simplify that effort?

There are many different players (both commercial and open source) in the product analytics space. Can you share your view on the role that NetSpring plays in that ecosystem? How is the NetSpring platform implemented to be able to best take advantage of modern warehouse technologies and the associated data stacks?

What are the pre-requisites for an organization's infrastructure/data maturity for being able to benefit from NetSpring? How have the goals and implementation of the NetSpring platform evolved from when you first started working on it?

Can you describe the steps involved in integrating NetSpring with an organization's existing warehouse?

What are the signals that NetSpring uses to understand the customer journeys of different organizations? How do you manage the variance of the data models in the warehouse while providing a consistent experience for your users?

Given that you are a product organization, how are you using NetSpring to power NetSpring? What are the most interesting, innovative, or unexpected ways that you have seen NetSpring used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on NetSpring? When is NetSpring the wrong choice? What do you have planned for the future of NetSpring?

Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

NetSpring ThoughtSpot Product Analytics Amplitude Mixpanel Customer Data Platform GDPR CCPA Segment

Podcast Episode

Rudderstack

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: TimeXtender: TimeXtender Logo TimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.

You can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.

Go to dataengineeringpodcast.com/timextender today to get started for free!Rudderstack: Rudderstack

RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.

RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.

Visit dataengineeringpodcast.com/rudderstack to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Data Council: Data Council Logo Join us at the event for the global data community, Data Council Austin. From March 28-30th 2023, we'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount off tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit: dataengineeringpodcast.com/data-council Promo Code: dataengpod20Support Data Engineering Podcast

AI/ML Amplitude Analytics API CDP Data Engineering Data Lake Data Management Data Science DWH ETL/ELT GDPR/CCPA Mixpanel Python Data Streaming Thoughtspot
Mark Van de Wiel – guest @ Fivetran , Tobias Macey – host

Summary Data integration from source systems to their downstream destinations is the foundational step for any data product. With the increasing expecation for information to be instantly accessible, it drives the need for reliable change data capture. The team at Fivetran have recently introduced that functionality to power real-time data products. In this episode Mark Van de Wiel explains how they integrated CDC functionality into their existing product, discusses the nuances of different approaches to change data capture from various sources.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Mark Van de Wiel about Fivetran’s implementation of chang

Analytics AWS Azure BigQuery CDP Cloud Computing Dashboard Data Engineering Data Lake Data Management Data Quality Databricks ETL/ELT Fivetran GCP Java Kubernetes MongoDB MySQL postgresql Python Scala Snowflake Spark SQL Data Streaming
Adam Kocoloski – guest , Tobias Macey – host

Summary CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB. In this episode Adam Kocoloski shares the history of the project, how it works under the hood, and how the new design will improve the project for our new era of computation. This was an interesting conversation about the challenges of maintaining a large and mission critical project and the work being done to evolve it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve ClickHouse, the open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for ClickHouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Adam Kocoloski about CouchDB and the work being done to migrate the storage layer to FoundationDB

Interview

Introduction How did you get involved in the area of data management? Can you starty by describing what CouchDB is?

How did you get involved in the CouchDB project and what is your current role in the community?

What are the use cases that it is well suited for? Can you share some of the history of CouchDB and its role in the NoSQL movement? How is CouchDB currently architected and how has it evolved since it was first introduced? What have been the benefits and challenges of Erlang as the runtime for CouchDB? How is the current storage engine implemented and what are its shortcomings? What problems are you trying to solve by replatforming on a new storage layer?

What were the selection criteria for the new storage engine and how did you structure the decision making process? What was the motivation for choosing FoundationDB as opposed to other options such as rocksDB, levelDB, etc.?

How is the adoption of FoundationDB going to impact the overall architecture and implementation of CouchDB? How will the use of FoundationDB impact the way that the current capabilities are implemented, such as data replication? What will the migration path be for people running an existing installation? What are some of the biggest challenges that you are facing in rearchitecting the codebase? What new capabilities will the FoundationDB storage layer enable? What are some of the most interesting/unexpected/innovative ways that you have seen CouchDB used?

What new capabilities or use cases do you anticipate once this migration is complete?

What are some of the most interesting/unexpected/challenging lessons that you have learned while working with the CouchDB project and community? What is in store for the future of CouchDB?

Contact Info

LinkedIn @kocolosk on Twitter kocolosk on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Apache CouchDB FoundationDB

Podcast Episode

IBM Cloudant Experimental Particle Physics FPGA == Field Programmable Gate Array Apache Software Foundation CRDT == Conflict-free Replicated Data Type

Podcast Episode

Erlang Riak RabbitMQ Heisenbug Kubernetes Property Based Testing

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

AI/ML Analytics Big Data ClickHouse Cloud Computing Data Engineering Data Lake Data Management DevOps DWH GitHub IBM Kubernetes NoSQL Snowplow Data Streaming
Kent Graziano – chief technical evangelist @ SnowflakeDB , Tobias Macey – host

Summary Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with flexibility and speed. Data Vault is an approach that allows for evolving a data model in place without requiring destructive transformations and massive up front design to answer valuable questions. In this episode Kent Graziano shares his journey with data vault, explains how it allows for an agile approach to data warehousing, and explains the core principles of how to use it. If you’re struggling with unwieldy dimensional models, slow moving projects, or challenges integrating new data sources then listen in on this conversation and then give data vault a try for yourself.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve Clickhouse, the open source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for Clickhouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Kent Graziano about data vault modeling and the role that it plays in the current data landscape

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what data vault modeling is and how it differs from other approaches such as third normal form or the star/snowflake schema?

What is the history of this approach and what limitations of alternate styles of modeling is it attempting to overcome? How did you first encounter this approach to data modeling and what is your motivation for dedicating so much time and energy to promoting it?

What are some of the primary challenges associated with data modeling that contribute to the long lead times for data requests or o

Agile/Scrum AI/ML Analytics Big Data ClickHouse Data Engineering Data Management Data Modelling Data Vault DevOps DWH Kubernetes Snowflake Data Streaming
Willy Lulciuc – guest @ WeWork , Julien Le Dem – creator of Parquet , Tobias Macey – host

Summary Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email [email protected] with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Marquez is?

What was missing in existing metadata management platforms that necessitated the creation of Marquez?

How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?

How does it compare to the Amundsen platform that Lyft recently released?

What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see? What are some of the capabilities that are unique to Marquez and how are you using them at WeWork? What are the primary resource types that you support in Marquez?

What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?

Can you explain how Marquez is architected and how the design has evolved since you first began working on it?

Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?

What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?

How is the metadata itself stored and managed in Marquez?

How much up-front data modeling is necessary and what types of schema representations are supported?

Can you talk through the overall workflow of someone using Marquez in their environment?

What is involved in registering and updating datasets? How do you define and track the health of a given dataset? What are some of the interesting questions that can be answered from the information stored in Marquez?

What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases? For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it? What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform? When is Marquez the wrong choice for a metadata repository? What do you have planned for the future of Marquez?

Contact Info

Julien Le Dem

@J_ on Twitter Email julienledem on GitHub

Willy

LinkedIn @wslulciuc on Twitter wslulciuc on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Marquez

DataEngConf Presentation

WeWork Canary Yahoo Dremio Hadoop Pig Parquet

Podcast Episode

Airflow Apache Atlas Amundsen

Podcast Episode

Uber DataBook LinkedIn DataHub Iceberg Table Format

Podcast Episode

Delta Lake

Podcast Episode

Great Expectations data pipeline unit testing framework

Podcast.init Episode

Redshift SnowflakeDB

Podcast Episode

Apache Kafka Schema Registry

Podcast Episode

Open Tracing Jaeger Zipkin DropWizard Java framework Marquez UI Cayley Graph Database Kubernetes Marquez Helm Chart Marquez Docker Container Dagster

Podcast Episode

Luigi DBT

Podcast Episode

Thrift Protocol Buffers

The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss"…

AI/ML Airflow Analytics Big Data Dagster Data Engineering Data Management Data Modelling Data Quality Google Dataform dbt Delta Docker Dremio DWH ETL/ELT Git GitHub Hadoop Iceberg Java Kafka Kubernetes Luigi Parquet postgresql Protobuf Python Redshift SQL Data Streaming
Showing 11 results