talk-data.com talk-data.com

Topic

Snowplow

behavioral_data analytics data_collection web_analytics digital_analytics

27

tagged

Activity Trend

4 peak/qtr
2020-Q1 2026-Q1

Activities

27 activities · Newest first

AWS re:Invent 2025 - Keynote Customer - Condé Nast

Sanjay Bhakta details Condé Nast's complete digital reinvention by migrating 800+ properties to AWS infrastructure with partners like Databricks and Snowplow, transforming from data-rich/insights-poor to cloud-native, personalized content delivery.

Learn more about AWS events: https://go.aws/events

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSEvents

In this session, we’ll walk through why we decided to abandon Redshift, driven by the need for elasticity, cost efficiency, and faster iteration. We'll also discuss how we designed a structured migration blueprint, including a vendor evaluation, proof of concept, and a four-pillared framework to guide our journey.

We’ll then dive into the areas where things got turbulent: why off-the-shelf translation tools failed, why LLMs didn’t help, and the specific technical challenges we encountered. From surrogate key issues to Stitch's unflattened data and Snowplow’s monolithic event tables.

We’ll explain how we kept the migration on track by turning our analysts into co-navigators, embedding them into validation loops to leverage domain knowledge, and how a parallel ingestion strategy helped stabilize progress during high-risk phases.

Finally, we’ll share what this migration unlocked for us: faster ingestion, better permission management and more. Expect real-world pitfalls, lessons learned, and actionable insights for your own migration journey.

We are entering the Era of Experience, where AI agents will transform customer journeys by learning directly from interactions. But most customer-facing agents today are “senseless,” lacking the real-time context needed to deliver relevant, empathetic, and valuable experiences. This session will explore how real-time streaming architectures and proprietary customer data can power the next generation of intelligent, perceptive agents.

Join Snowplow’s Jon Su as he unpacks:

  • Why brands risk commoditization if they rely on third-party agents
  • How real-time context enables smarter, more personalized customer interactions
  • The key ingredients for building agents that perceive, adapt, and self-optimize
  • How Snowplow Signals provides the real-time customer intelligence foundation for agentic applications

Discover how to shift from static personalization to adaptive, agent-driven experiences that improve customer satisfaction, loyalty, and business outcomes.

AI-powered development tools are accelerating development speed across the board and analytics event implementation is no exception to this, but without appropriate usage they’re very capable of creating organizational chaos. Same company, same prompt, completely different schemas—data teams can’t analyze what should be identical events across platforms.

The infrastructure assumptions that worked when developers shipped tracking changes in sprint cycles or quarters are breaking when they ship them multiple times per day. Schema inconsistency, cost surprises from experimental traffic, and trust erosion in AI-generated code are becoming the new normal.

Josh will demonstrate how Snowplow’s MCP (Model Context Protocol) server and data-structure toolchains enable teams to harness AI development speed while maintaining data quality and architectural consistency. Using Snowplow’s production approach of AI-powered design paired with deterministic implementation, teams get rapid iteration without the hallucination bugs that plague direct AI code generation.

Key Takeaways:

• How AI development acceleration is fragmenting analytics schemas within organizations

• Architectural patterns that separate AI creativity from production reliability

• Real-world implementation using MCP, Data Products, and deterministic code generation

Send us a text In this episode, we explore how public media can build scalable, transparent, and mission-driven data infrastructure - with Emilie Nenquin, Head of Data & Intelligence at VRT, and Stijn Dolphen, Team Lead & Analytics Engineer at Dataroots. Emilie shares how she architected VRT’s data transformation from the ground up: evolving from basic analytics to a full-stack data organization with 45+ specialists across engineering, analytics, AI, and user management. We dive into the strategic shift from Adobe Analytics to Snowplow, and what it means to own your data pipeline in a public service context. Stijn joins to unpack the technical decisions behind VRT’s current architecture, including real-time event tracking, metadata modeling, and integrating 70+ digital platforms into a unified ecosystem. 💡 Topics include: Designing data infrastructure for transparency and scaleBuilding a modular, privacy-conscious analytics stackMetadata governance across fragmented content systemsRecommendation systems for discovery, not just engagementThe circular relationship between data quality and AI performanceApplying machine learning in service of cultural and civic missionsWhether you're leading a data team, rethinking your stack, or exploring ethical AI in media, this episode offers practical insights into how data strategy can align with public value.

Leveling Up Gaming Analytics: How Supercell Evolved Player Experiences With Snowplow and Databricks

In the competitive gaming industry, understanding player behavior is key to delivering engaging experiences. Supercell, creators of Clash of Clans and Brawl Stars, faced challenges with fragmented data and limited visibility into user journeys. To address this, they partnered with Snowplow and Databricks to build a scalable, privacy-compliant data platform for real-time insights. By leveraging Snowplow’s behavioral data collection and Databricks’ Lakehouse architecture, Supercell achieved: Cross-platform data unification: A unified view of player actions across web, mobile and in-game Real-time analytics: Streaming event data into Delta Lake for dynamic game balancing and engagement Scalable infrastructure: Supporting terabytes of data during launches and live events AI & ML use cases: Churn prediction and personalized in-game recommendations This session explores Supercell’s data journey and AI-driven player engagement strategies.

Sponsored by: Snowplow | Snowplow Signals: Powering Tomorrow’s Customer Experiences on Databricks

The web is on the verge of a major shift. Agentic applications will redefine how customers engage with digital experiences—delivering highly personalized, relevant interactions. In this talk, Snowplow CTO Yali Sassoon explores how Snowplow Signals enables agents to perceive users through short- and long-term memory, natively on the Databricks Data Intelligence Platform.

In his keynote talks at the Snowflake and Databricks Summit this year, Jenson Huang, the Founder CEO at NVIDEA, talked at length about how, to compete today, organizations have to build data flywheels: where they take their proprietary business data, use AI on that data to build proprietary intelligence, use that insight to build proprietary products and services that your customers love and use that to create more proprietary data to feed AIs to build more proprietary intelligence and so on.

But what does this mean in practice? Jenson's example of NVIDEA is intriguing - but how can the rest of us build data flywheels in our own organizations? What practical steps can they take?

In this talk, Yali Sassoon, Snowplow cofounder and CPO, will start to answer these questions, drawing on examples from Snowplow customers in retail, media and technology that have successfully built customer data flywheels on top of their proprietary 1st party customer data.

Embrace First-Party Customer Data for Marketing and Advertising using Data Cleanrooms

The digital marketing and advertising industry is going through revolutionary change in 2023, with technical, organisational, cultural and regulatory overhaul. As a result, measuring digital advertising effectiveness or coordinating and running highly targeted and effective ad campaigns is becoming more challenging than ever.

First party customer behavioral data provides organizations true competitive advantage and the ability outperform your peers in the battle for customer attention and brand loyalty.

However, first party customer data is still used sparingly across the digital ad ecosystem, and there are few tools or frameworks to allow advertisers to unlock the value in what first party data they have.

This session will show you how Snowplow allows organizations to deeply understand their users' behavior and intent by creating the best quality behavioral data. It will also explain that when this is combined with the Databricks Lakehouse and data clean rooms, brands can now unlock insights that were previously unachievable, and activate their first party customer behavioral data into highly effective, personalized and creative ad campaigns.

In this session you will learn: - Why first party data can be the ultimate in competitive advantage for digital advertisers - How data clean rooms combined with Snowplow behavioral data enable better insights and more impactful ad targeting - What specific marketing and advertising use cases are possible when utilizing a data clean room on top of the Databricks Lakehouse

Talk by: Jordan Peck

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: Snowplow | Revolutionize Your Customer Engagement Strategy w/ First-Party Customer Data

In today's highly competitive market, personalized experiences are the key to winning customer engagement and loyalty. But how can you deliver these experiences at scale? The answer lies in a single unified view of your customers, powered by rich first-party customer data. With complete 360 visibility into your customer's journey, you can predict their next best action and deliver the most relevant experience based on their unique needs and behaviors.

Join this session to learn how to unlock the full potential of your first-party customer data by empowering your data team to collaborate seamlessly with your marketing team by removing technology barriers. Learn how to create a data-driven next-best action (NBA) strategy by building solutions that will set you apart in the competitive landscape and captivate your customers at every touchpoint. In this session, you'll discover: - The critical importance of personalized experiences in today's hyper-competitive market Proven strategies for building a data-driven NBA approach that drives results - See a live demo of how Snowplow and Databricks can be combined to produce powerful ML models for NBA revolutionizing your customer data strategy - Best practices for fostering strong collaboration between marketing and data teams to achieve business outcomes and deliver next-gen customer experiences

Don't miss out on this opportunity to unlock the full potential of your first-party customer data and revolutionize your customer engagement strategy.

Talk by: Yali Sassoon

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

As organizations of all sizes continuously look to drive value out of data, the modern data stack has emerged as a clear solution for getting insights into the hands of the organization. With the rapid pace of innovation not slowing down, the tools within the modern data stack have enabled data teams to drive faster insights, collaborate at scale, and democratize data knowledge. However, are tools just enough to drive business value with data?  In the first of our four RADAR 2023 sessions, we look at the key drivers of value within the modern data stack through the minds of Yali Sassoon and Barr Moses.  Yali Sassoon is the Co-Founder and Chief Strategy Officer at Snowplow Analytics, a behavioral data platform that empowers data teams to solve complex data challenges. At Snowplow, Yali gets to combine his love of building things with his fascination of the ways in which people use data to reason. Barr Moses is CEO & Co-Founder of Monte Carlo. Previously, she was VP Customer Operations at customer success company Gainsight, where she helped scale the company 10x in revenue and, among other functions, built the data/analytics team.  Listen in as Yali and Barr outline how data leaders can drive value creation with data in 2023.

With the increasing rate at which new data tools and platforms are being created, the modern data stack risks becoming just another buzzword data leaders use when talking about how they solve problems.

Alongside the arrival of new data tools is the need for leaders to see beyond just the modern data stack and think deeply about how their data work can align with business outcomes, otherwise, they risk falling behind trying to create value from innovative, but irrelevant technology.

In this episode, Yali Sassoon joins the show to explore what the modern data stack really means, how to rethink the modern data stack in terms of value creation, data collection versus data creation, and the right way businesses should approach data ingestion, and much more.

Yali is the Co-Founder and Chief Strategy Officer at Snowplow Analytics, a behavioral data platform that empowers data teams to solve complex data challenges. Yali is an expert in data with a background in both strategy and operations consulting teaching companies how to use data properly to evolve their operations and improve their results.

Summary A lot of the work that goes into data engineering is trying to make sense of the "data exhaust" from other applications and services. There is an undeniable amount of value and utility in that information, but it also introduces significant cost and time requirements. In this episode Nick King discusses how you can be intentional about data creation in your applications and services to reduce the friction and errors involved in building data products and ML applications. He also describes the considerations involved in bringing behavioral data into your systems, and the ways that he and the rest of the Snowplow team are working to make that an easy addition to your platforms.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline

Competitive advantage hinges on predictive insights generated from AI! Build powerful data-driven

AI is central to unlocking competitive advantage. However data science teams don’t have access to a consistent level of high-quality data required to build AI & ML data applications.

Instead data scientists spend 80% of their time collecting, cleaning & preparing the data for analysis rather than building AI-data applications.

During this talk Snowplow introduces the concept of data creation. Create & deploy high-quality & predictive behavioral data in real-time to Databricks.

Learn how being equipped with AI-ready data in Databricks allows data science teams to focus on building AI data applications rather than data wrangling—dramatically accelerating the pace of data projects & improving model performance & managing data governance. - How to execute more AI & data intensive applications in production using Databricks & Snowplow - How to execute on each AI & data intensive application faster thanks to pre-validated & predictive data - How data creation can solve for data governance

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Summary The landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained steady, but as the industry matures new trends emerge and gain prominence. In this episode Astasia Myers of Redpoint Ventures shares her perspective as an investor on which categories she is paying particular attention to for the near to medium term. She discusses the work being done to address challenges in the areas of data quality, observability, discovery, and streaming. This is a useful conversation to gain a macro perspective on where businesses are looking to improve their capabilities to work with data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar to get you up and running in no time. With simple pricing, fast networking, S3 compatible object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll. Your host is Tobias Macey and today I’m interviewing Astasia Myers about the trends in the data industry that she sees as an investor at Redpoint Ventures

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of Redpoint Ventures and your role there? From an investor perspective, what is most appealing about the category of data-oriented businesses? What are the main sources of information that you rely on to keep up to date with what is happening in the data industry?

What is your personal heuristic for determining the relevance of any given piece of information to decide whether it is worthy of further investigation?

As someone who works closely with a variety of companies across different industry verticals and different areas of focus, what are some of the common trends that you have identified in the data ecosystem? In your article that covers the trends you are keeping an eye on for 2020 you call out 4 in particular, data quality, data catalogs, observability of what influences critical business indicators, and streaming data. Taking those in turn:

What are the driving factors that influence data quality, and what elements of that problem space are being addressed by the companies you are watching?

What are the unsolved areas that you see as being viable for newcomers?

What are the challenges faced by businesses in establishing and maintaining data catalogs?

What approaches are being taken by the companies who are trying to solve this problem?

What shortcomings do you see in the available products?

For gaining visibility into the forces that impact the key performance indicators (KPI) of businesses, what is lacking in the current approaches?

What additional information needs to be tracked to provide the needed context for making informed decisions about what actions to take to improve KPIs? What challenges do businesses in this observability space face to provide useful access and analysis to this collected data?

Streaming is an area that has been growing rapidly over the past few years, with many open source and commercial options. What are the major business opportunities that you see to make streaming more accessible and effective?

What are the main factors that you see as driving this growth in the need for access to streaming data?

With your focus on these trends, how does that influence your investment decisions and where you spend your time? What are the unaddressed markets or product categories that you see which would be lucrative for new businesses? In most areas of technology now there is a mix of open source and commercial solutions to any given problem, with varying levels of maturity and polish between them. What are your views on the balance of this relationship in the data ecosystem?

For data in particular, there is a strong potential for vendor lock-in which can cause potential customers to avoid adoption of commercial solutions. What has been your experience in that regard with the companies that you work with?

Contact Info

@AstasiaMyers on Twitter @astasia on Medium LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Redpoint Ventures 4 Data Trends To Watch in 2020 Seagate Western Digital Pure Storage Cisco Cohesity Looker

Podcast Episode

DGraph

Podcast Episode

Dremio

Podcast Episode

SnowflakeDB

Podcast Episode

Thoughspot Tibco Elastic Splunk Informatica Data Council DataCoral Mattermost Bitwarden Snowplow

Podcast Interview Interview About Snowplow Infrastructure

CHAOSSEARCH

Podcast Episode

Kafka Streams Pulsar

Podcast Interview Followup Podcast Interview

Soda Toro Great Expectations Alation Collibra Amundsen DataHub Netflix Metacat Marquez

Podcast Episode

LDAP == Lightweight Directory Access Protocol Anodot Databricks Flink

a…

Summary CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB. In this episode Adam Kocoloski shares the history of the project, how it works under the hood, and how the new design will improve the project for our new era of computation. This was an interesting conversation about the challenges of maintaining a large and mission critical project and the work being done to evolve it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve ClickHouse, the open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for ClickHouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Adam Kocoloski about CouchDB and the work being done to migrate the storage layer to FoundationDB

Interview

Introduction How did you get involved in the area of data management? Can you starty by describing what CouchDB is?

How did you get involved in the CouchDB project and what is your current role in the community?

What are the use cases that it is well suited for? Can you share some of the history of CouchDB and its role in the NoSQL movement? How is CouchDB currently architected and how has it evolved since it was first introduced? What have been the benefits and challenges of Erlang as the runtime for CouchDB? How is the current storage engine implemented and what are its shortcomings? What problems are you trying to solve by replatforming on a new storage layer?

What were the selection criteria for the new storage engine and how did you structure the decision making process? What was the motivation for choosing FoundationDB as opposed to other options such as rocksDB, levelDB, etc.?

How is the adoption of FoundationDB going to impact the overall architecture and implementation of CouchDB? How will the use of FoundationDB impact the way that the current capabilities are implemented, such as data replication? What will the migration path be for people running an existing installation? What are some of the biggest challenges that you are facing in rearchitecting the codebase? What new capabilities will the FoundationDB storage layer enable? What are some of the most interesting/unexpected/innovative ways that you have seen CouchDB used?

What new capabilities or use cases do you anticipate once this migration is complete?

What are some of the most interesting/unexpected/challenging lessons that you have learned while working with the CouchDB project and community? What is in store for the future of CouchDB?

Contact Info

LinkedIn @kocolosk on Twitter kocolosk on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Apache CouchDB FoundationDB

Podcast Episode

IBM Cloudant Experimental Particle Physics FPGA == Field Programmable Gate Array Apache Software Foundation CRDT == Conflict-free Replicated Data Type

Podcast Episode

Erlang Riak RabbitMQ Heisenbug Kubernetes Property Based Testing

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Building applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems that were engineered in isolation. The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. In this episode Michael Drogalis, product manager for ksqlDB at Confluent, explains how the system is implemented, how you can use it for building your own stream processing applications, and how it fits into the lifecycle of your data infrastructure. If you have been struggling with building services on low level streaming interfaces then give this episode a listen and try it out for yourself.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Michael Drogalis about ksqlDB, the open source streaming database layer for Kafka

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what ksqlDB is? What are some of the use cases that it is designed for? How do the capabilities and design of ksqlDB compare to other solutions for querying streaming data with SQL such as Pulsar SQL, PipelineDB, or Materialize? What was the motivation for building a unified project for providing a database interface on the data stored in Kafka? How is ksqlDB architected?

If you were to rebuild the entire platform and its components from scratch today, what would you do differently?

What is the workflow for an analyst or engineer to design and build an application on top of ksqlDB?

What dialect of SQL is supported?

What ki

Summary Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed services makes this a viable option, but can lead to downstream challenges. In this episode Sean Knapp and Charlie Crocker share their experiences of working in and with companies that have dealt with shadow IT projects and the importance of enabling and empowering the use and exploration of data and analytics. If you have ever been frustrated by seemingly draconian policies or struggled to align everyone on your supported platform, then this episode will help you gain some perspective and set you on a path to productive collaboration.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Sean Knapp, Charlie Crocker about shadow IT in data and analytics

Interview

Introduction How did you get involved in the area of data management? Can you start by sharing your definition of shadow IT? What are some of the reasons that members of an organization might start building their own solutions outside of what is supported by the engineering teams?

What are some of the roles in an organization that you have seen involved in these shadow IT projects?

What kinds of tools or platforms are well suited for being provisioned and managed without involvement from the platform team?

What are some of the pitfalls that these solutions present as a result of their initial ease of use?

What are the benefits to the organization of individuals or teams building and managing their own solutions? What are some of the risks associated with these implementations of data collection, storage, man