At Criteo, we’ve relied on automatic aggregations for years. “Automatic aggregation” is the name we give to a system of recording rules that matches most metrics and removes certain dimensions, such as the instance emitting the metric, to reduce the cardinality (i.e., the number of metrics) and thus makes queries faster. What started as a workaround has become a key part of how we ensure backend stability and reliability at scale, with hundreds of millions of active metrics, all without requiring users to write a single recording rule. It also significantly reduces the cost of metrics storage. Internally, we call this approach zero-effort Observability, as most teams don’t have to write/maintain recording rules. In this talk, Raphael will explain how our approach to automatic aggregations has evolved over time and how we’ve adapted it to fit naturally into our Prometheus-based stack. He will share the different implementations we’ve tried, the lessons we’ve learned, and how our latest version takes advantage of recent improvements in Prometheus (new type label).
talk-data.com
Topic
Prometheus
17
tagged
Activity Trend
Top Events
Prometheus has become the go-to standard for metrics-based monitoring, but as environments grow in complexity and scale, teams often find themselves hitting its operational limits, especially around cardinality and long-term storage. This talk explores how VictoriaMetrics builds on Prometheus fundamentals to offer a more scalable and efficient alternative for teams managing high-ingestion workloads and demanding retention needs, without abandoning the familiar Prometheus ecosystem. I’ll dive into how VictoriaMetrics supports Prometheus-compatible scrape configurations and exporters, allowing seamless integration with existing workflows. The session will showcase practical strategies for setting up and tuning scrape jobs, managing cardinality through label analysis and relabeling, and using VictoriaMetrics’ UI and tools to gain insight into metric usage patterns. This talk is tailored for advanced users eager to push the boundaries of Prometheus-based observability, demonstrating how the core philosophy of Prometheus can be extended and elevated through the integration of high-performance systems like VictoriaMetrics.
Gardening meets Grafana in this beginner-friendly talk. Marie will walk you through building a complete IoT monitoring pipeline using a soil moisture sensor, Arduino-compatible hardware, and Grafana Cloud. Setting up the tech: Arduino, Wi-Fi modules, and soil sensors. Pushing metrics to the cloud with Prometheus and Grafana. Designing the dashboard.
The Kubernetes observability talk will cover how to monitor, trace, and troubleshoot applications in a Kubernetes environment. It will highlight key tools like Prometheus, Thanos, Grafana, and OpenTelemetry for tracking metrics, logs, and distributed traces. Topics include improving visibility into clusters and microservices, detecting anomalies, and ensuring reliability. The session will focus on best practices for proactive observability and efficient debugging to maintain the health of cloud-native applications.
Julius will give a quick update on the latest developments in Prometheus and then give a deep dive into the new experimental native histograms: Histograms are crucial for anyone who wants to track service latency and other numeric value distributions in Prometheus. However, the existing "legacy" histograms in Prometheus come with a number of painful drawbacks: they require manual and static bucket configuration, generate a separate time series for each configured histogram bucket, and thus require you to make hard tradeoffs between a histogram's resolution and cost. This is where the new "native" histogram metric type comes in. These native histograms allow you to track value distributions in higher detail at a significantly lower storage and processing cost, while also reducing the manual bucket configuration effort. Julius will explain how native histograms work, how they achieve these key benefits, and how you can use them in Prometheus today in an experimental fashion.
Learn how Reddit uses a custom monitoring operator to manage Thanos and Prometheus to scale their metrics deployment beyond 45 million samples per second and 600 million active series. To achieve this they run thousands of Prometheus instances of varying sizes managed by their internally developed Kubernetes controller. They use Thanos for long-term storage and global single pane of glass querying across this massive deployment. Learn about the operator, other tools they've developed, and the challenges they've faced along the way.
Learn Grafana 10.x is your essential guide to mastering the art of data visualization and monitoring through interactive dashboards. Whether you're starting from scratch or updating your knowledge to Grafana 10.x, this book walks you through installation, implementation, data transformation, and effective visualization techniques. What this Book will help me do Install and configure Grafana 10.x for real-time data visualization and analytics. Create and manage insightful dashboards with Grafana's enhanced features. Integrate Grafana with diverse data sources such as Prometheus, InfluxDB, and Elasticsearch. Set up dynamic templated dashboards and alerting systems for proactive monitoring. Implement Grafana's user authentication mechanisms for enhanced security. Author(s) None Salituro is a seasoned expert in data analytics and observability platforms with extensive experience working with time-series data using Grafana. Their practical teaching approach and passion for sharing insights make this book an invaluable resource for both newcomers and experienced users. Who is it for? This book is perfect for business analysts, data visualization enthusiasts, and developers interested in analyzing and monitoring time-series data. Whether you're a newcomer or have some background knowledge, this book offers accessible guidance and advanced tips suitable for all levels. If you're aiming to efficiently build and utilize Grafana dashboards, this is the book for you.
Airflow has an inherent SLA alert mechanism. When the scheduler sees such an SLA miss for some task, it sends an alert by email. The problem is, that this email is nice, but we can’t really know when each task is eventually successful. Moreover, even if there is such an email upon success following an SLA miss, it does not give us a good view of the current status at any given time. In order to solve this, we developed SLAyer, an application that gets information of SLA misses from Airflow’s database and reports the current status to Prometheus, provides metrics per dag, task, and execution date currently in violation of its SLA.
Summary Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale. Datastax has recently introduced a new managed offering for Pulsar workloads in the form of Astra Streaming that lowers those barriers and make stremaing workloads accessible to a wider audience. In this episode Prabhat Jha and Jonathan Ellis share the work that they have been doing to integrate streaming data into their managed Cassandra service. They explain how Pulsar is being used by their customers, the work that they have done to scale the administrative workload for multi-tenant environments, and the challenges of operating such a data intensive service at large scale. This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Prabhat Jha and Jonathan Ellis about Astra Streaming, a cloud-native streaming platform built on Apache Pulsar
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what the Astra platform is and the story behind it?
How does streaming fit into your overall product vision and the needs of your customers?
What was your selection process/criteria for adopting a streaming engine to complement your existing technology investment?
What are the core use cases that you are aiming to support with Astra Streaming?
Can you describe the architecture and automation of your hosted platform for Pulsar?
What are the integration points that you have built to make it work well with Cassandra?
What are some of the additional tools that you have added to your distribution of Pulsar to simplify operation and use?
What are some of the sharp edges that you have had to sand down as you have scaled up your usage of Pulsar?
What is the process for someone to adopt and integrate with your Astra Streaming service?
How do you handle migrating existing projects, particularly if they are using Kafka currently?
One of the capabilities that you highlight on the product page for Astra Streaming is the ability to execute machine learning workflows on data in flight. What are some of the supporting systems that are necessary to power that workflow?
What are the capabilities that are built into Pulsar that simplify the operational aspects of streaming ML?
What are the ways that you are engaging with and supporting the Pulsar community?
What are the near to medium term elements of the Pulsar roadmap that you are working toward and excited to incorporate into Astra?
What are the most interesting, innovative, or unexpected ways that you have seen Astra used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Astra?
When is Astra the wrong choice?
What do you have planned for the future of Astra?
Contact Info
Prabhat
LinkedIn @prabhatja on Twitter prabhatja on GitHub
Jonathan
LinkedIn @spyced on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Pulsar
Podcast Episode Streamnative Episode
Datastax Astra Streaming Datastax Astra DB Luna Streaming Distribution Datastax Cassandra Kesque (formerly Kafkaesque) Kafka RabbitMQ Prometheus Grafana Pulsar Heartbeat Pulsar Summit Pulsar Summit Presentation on Kafka Connectors Replicated Chaos Engineering Fallout chaos engineering tools Jepsen
Podcast Episode
Jack VanLightly
BookKeeper TLA+ Model
Change Data Capture
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast
Summary
The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.
Introduction
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year
Interview
Introduction How did you get involved in the area of data management? Can you refresh our memory about what TimescaleDB is? How has the market for timeseries databases changed since we last spoke? What has changed in the focus and features of the TimescaleDB project and company? Toward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone?
What were the most challenging aspects of reaching that goal?
In terms of timeseries workloads, what are some of the factors that differ across varying use cases?
How do those differences impact the ways in which Timescale is used by the end user, and built by your team?
What are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven? How have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product?
Have you been able to leverage some of the native improvements to simplify your implementation? Are there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale?
What is in store for the future of the Timescale product and organization?
Contact Info
Ajay
@acoustik on Twitter LinkedIn
Mike
LinkedIn Website @michaelfreedman on Twitter
Timescale
Website Documentation Careers timescaledb on GitHub @timescaledb on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
TimescaleDB Original Appearance on the Data Engineering Podcast 1.0 Release Blog Post PostgreSQL
Podcast Interview
RDS DB-Engines MongoDB IOT (Internet Of Things) AWS Timestream Kafka Pulsar
Podcast Episode
Spark
Podcast Episode
Flink
Podcast Episode
Hadoop DevOps PipelineDB
Podcast Interview
Grafana Tableau Prometheus OLTP (Online Transaction Processing) Oracle DB Data Lake
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
Summary
A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease
Interview
Introduction How did you get involved in the area of data management? Can you start by describing what Upsolver is and how it got started?
What are your goals for the platform?
There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?
What are the shortcomings of a data lake architecture?
How is Upsolver architected?
How has that architecture changed over time? How do you manage schema validation for incoming data? What would you do differently if you were to start over today?
What are the biggest challenges at each of the major stages of the data lake? What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake? When is Upsolver the wrong choice for an organization considering implementation of a data platform? Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house? What features or improvements do you have planned for the future of Upsolver?
Contact Info
Yoni
yoniiny on GitHub LinkedIn
Upsolver
Website @upsolver on Twitter LinkedIn Facebook
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Upsolver Data Lake Israeli Army Data Warehouse Data Engineering Podcast Episode About Data Curation Three Vs Kafka Spark Presto Drill Spot Instances Object Storage Cassandra Redis Latency Avro Parquet ORC Data Engineering Podcast Episode About Data Serialization Formats SSTables Run Length Encoding CSV (Comma Separated Values) Protocol Buffers Kinesis ETL DevOps Prometheus Cloudwatch DataDog InfluxDB SQL Pandas Confluent KSQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
Summary
As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ajay Kulkarni and Mike Freedman about Timescale DB, a scalable timeseries database built on top of PostGreSQL
Interview
Introduction How did you get involved in the area of data management? Can you start by explaining what Timescale is and how the project got started? The landscape of time series databases is extensive and oftentimes difficult to navigate. How do you view your position in that market and what makes Timescale stand out from the other options? In your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. How does Timescale handle out of order timestamps, such as from infrequently connected sensors or mobile devices? How is Timescale implemented and how has the internal architecture evolved since you first started working on it?
What impact has the 10.0 release of PostGreSQL had on the design of the project? Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL?
For someone who wants to start using Timescale what is involved in deploying and maintaining it? What are the axes for scaling Timescale and what are the points where that scalability breaks down?
Are you aware of anyone who has deployed it on top of Citus for scaling horizontally across instances?
What has been the most challenging aspect of building and marketing Timescale? When is Timescale the wrong tool to use for time series data? One of the use cases that you call out on your website is for systems metrics and monitoring. How does Timescale fit into that ecosystem and can it be used along with tools such as Graphite or Prometheus? What are some of the most interesting uses of Timescale that you have seen? Which came first, Timescale the business or Timescale the database, and what is your strategy for ensuring that the open source project and the company around it both maintain their health? What features or improvements do you have planned for future releases of Timescale?
Contact Info
Ajay
LinkedIn @acoustik on Twitter Timescale Blog
Mike
Website LinkedIn @michaelfreedman on Twitter Timescale Blog
Timescale
Website @timescaledb on Twitter GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Timescale PostGreSQL Citus Timescale Design Blog Post MIT NYU Stanford SDN Princeton Machine Data Timeseries Data List of Timeseries Databases NoSQL Online Transaction Processing (OLTP) Object Relational Mapper (ORM) Grafana Tableau Kafka When Boring Is Awesome PostGreSQL RDS Google Cloud SQL Azure DB Docker Continuous Aggregates Streaming Replication PGPool II Kubernetes Docker Swarm Citus Data
Website Data Engineering Podcast Interview
Database Indexing B-Tree Index GIN Index GIST Index STE Energy Redis Graphite Prometheus pg_prometheus OpenMetrics Standard Proposal Timescale Parallel Copy Hadoop PostGIS KDB+ DevOps Internet of Things MongoDB Elastic DataBricks Apache Spark Confluent New Enterprise Associates MapD Benchmark Ventures Hortonworks 2σ Ventures CockroachDB Cloudflare EMC Timescale Blog: Why SQL is beating NoSQL, and what this means for the future of data
The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss" target="_blank"…
Summary
Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Ry Walker, CEO of Astronomer, the platform for data engineering.
Interview
Introduction How did you first get involved in the area of data management? What is Astronomer and how did it get started? Regulatory challenges of processing other people’s data What does your data pipelining architecture look like? What are the most challenging aspects of building a general purpose data management environment? What are some of the most significant sources of technical debt in your platform? Can you share some of the failures that you have encountered while architecting or building your platform and company and how you overcame them? There are certain areas of the overall data engineering workflow that are well defined and have numerous tools to choose from. What are some of the unsolved problems in data management? What are some of the most interesting or unexpected uses of your platform that you are aware of?
Contact Information
Email @rywalker on Twitter
Links
Astronomer Kiss Metrics Segment Marketing tools chart Clickstream HIPAA FERPA PCI Mesos Mesos DC/OS Airflow SSIS Marathon Prometheus Grafana Terraform Kafka Spark ELK Stack React GraphQL PostGreSQL MongoDB Ceph Druid Aries Vault Adapter Pattern Docker Kinesis API Gateway Kong AWS Lambda Flink Redshift NOAA Informatica SnapLogic Meteor
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
There’s a lot of buzz around performance metrics right now, but the ones measured most often are the DORA metrics: deployment frequency, lead time, mean time to recovery and failure rate. The problem is that it can be expensive to collect them and prepare them for presentation automatically. This talk is about a lightweight and open source DORA metrics controller that works with any CI/CD tool and requires only Kubernetes, Prometheus and Grafana to run.
A hands-on, beginner-friendly dive into the world of MicroPython and real-time data visualization. This interactive session will introduce you to MicroPython and show you how to gather live data from sensors. We'll explore different methods to make that data available to Grafana or Prometheus, and demonstrate a dashboard that visualizes it all. We’ll then take things further by constructing a physical representation of a Grafana panel using MicroPython-powered hardware. Watch your dashboard data come to life in the real world! No programming or Grafana experience required. If you'd (optionally) like to join in the fun, bring a fully charged laptop with a USB A port: Mac users may need to bring their adapter. Come curious and leave inspired by what's possible when hardware meets observability!
One-step cloud observability approach for AWS with simplified instrumentation for cloud, OTel, and Prometheus data, with AWS engineers and customers sharing best practices for cloud-native observability.
Dive into monitoring modern AWS cloud environments with one-step cloud observability. Learn how to get started faster with simplified instrumentation for cloud data, OpenTelemetry (OTel), and Prometheus data. AWS engineers and joint customers will share best practices for cloud-native observability.