The AI landscape is evolving at breakneck speed, with new capabilities emerging quarterly that redefine what's possible. For professionals across industries, this creates a constant need to reassess workflows and skills. How do you stay relevant when the technology keeps leapfrogging itself? What happens to traditional roles when AI can increasingly handle complex tasks that once required specialized expertise? With product-market fit becoming a moving target and new positions like forward-deployed engineers emerging, understanding how to navigate this shifting terrain is crucial. The winners won't just be those who adopt AI—but those who can continuously adapt as it evolves. Tomasz Tunguz is a General Partner at Theory Ventures, a $235m early-stage venture capital firm. He blogs at tomtunguz.com & co-authored Winning with Data. He has worked or works with Looker, Kustomer, Monte Carlo, Dremio, Omni, Hex, Spot, Arbitrum, Sui & many others. He was previously the product manager for Google's social media monetization team, including the Google-MySpace partnership, and managed the launches of AdSense into six new markets in Europe and Asia. Before Google, Tunguz developed systems for the Department of Homeland Security at Appian Corporation. In the episode, Richie and Tom explore the rapid investment in AI, the evolution of AI models like Gemini 3, the role of AI agents in productivity, the shifting job market, the impact of AI on customer success and product management, and much more. Links Mentioned in the Show: Theory VenturesConnect with TomTom’s BlogGavin Baker on MediumAI-Native Course: Intro to AI for WorkRelated Episode: Data & AI Trends in 2024, with Tom Tunguz, General Partner at Theory VenturesRewatch RADAR AI New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business
talk-data.com
Topic
Looker
78
tagged
Activity Trend
Top Events
On today's Promoted Episode of Experiencing Data, I’m talking with Lucas Thelosen, CEO of Gravity and creator of Orion, an AI analyst transforming how data teams work. Lucas was head of PS for Looker, and eventually became Head of Product for Google’s Data and AI Cloud prior to starting his own data product company. We dig into how his team built Orion, the challenge of keeping AI accurate and trustworthy when doing analytical work, and how they’re thinking about the balance of human control with automation when their product acts as a force multiplier for human analysts.
In addition to talking about the product, we also talk about how Gravity arrived at specific enough use cases for this technology that a market would be willing to pay for, and how they’re thinking about pricing in today’s more “outcomes-based” environment.
Incidentally, one thing I didn’t know when I first agreed to consider having Gravity and Lucas on my show was that Lucas has been a long-time proponent of data product management and operating with a product mindset. In this episode, he shares the “ah-hah” moment where things clicked for him around building data products in this manner. Lucas shares how pivotal this moment was for him, and how it helped accelerate his career from Looker to Google and now Gravity.
If you’re leading a data team, you’re a forward-thinking CDO, or you’re interested in commercializing your own analytics/AI product, my chat with Lucas should inspire you!
Highlights/ Skip to:
Lucas’s breakthrough came when he embraced a data product management mindset (02:43) How Lucas thinks about Gravity as being the instrumentalists in an orchestra, conducted by the user (4:31) Finding product-market fit by solving for a common analytics pain point (8:11) Analytics product and dashboard adoption challenges: why dashboards die and thinking of analytics as changing the business gradually (22:25) What outcome-based pricing means for AI and analytics (32:08) The challenge of defining guardrails and ethics for AI-based analytics products [just in case somebody wants to “fudge the numbers”] (46:03) Lucas’ closing thoughts about what AI is unlocking for analysts and how to position your career for the future (48:35)
Special Bonus for DPLC Community Members Are you a member of the Data Product Leadership Community? After our chat, I invited Lucas to come give a talk about his journey of moving from “data” to “product” and adopting a producty mindset for analytics and AI work. He was more than happy to oblige. Watch for this in late 2025/early 2026 on our monthly webinar and group discussion calendar.
Note: today’s episode is one of my rare Promoted Episodes. Please help support the show by visiting Gravity’s links below:
Quotes from Today’s Episode “The whole point of data and analytics is to help the business evolve. When your reports make people ask new questions, that’s a win. If the conversations today sound different than they did three months ago, it means you’ve done your job, you’ve helped move the business forward.” — Lucas
“Accuracy is everything. The moment you lose trust, the business, the use case, it's all over. Earning that trust back takes a long time, so we made accuracy our number one design pillar from day one.” — Lucas
“Language models have changed the game in terms of scale. Suddenly, we’re facing all these new kinds of problems, not just in AI, but in the old-school software sense too. Things like privacy, scalability, and figuring out who’s responsible.” — Brian
“Most people building analytics products have never been analysts, and that’s a huge disadvantage. If data doesn’t drive action, you’ve missed the mark. That’s why so many dashboards die quickly.” — Lucas
“Re: collecting feedback so you know if your UX is good: I generally agree that qualitative feedback is the best place to start, not analytics [on your analytics!] Especially in UX, analytics measure usage aspects of the product, not the subject human experience. Experience is a collection of feelings and perceptions about how something went.” — Brian
Links
Gravity: https://www.bygravity.com LinkedIn: https://www.linkedin.com/in/thelosen/ Email Lucas and team: [email protected]
Product managers for BI platforms have it easy. They "just" need to have the dev team build a tool that gives all types of users access to all of the data they should be allowed to see in a way that is quick, simple, and clear while preventing them from pulling data that can be misinterpreted. Of course, there are a lot of different types of users—from the C-level executive who wants ready access to high-level metrics all the way to the analyst or data scientist who wants to drop into a SQL flow state to everyone in between. And sometimes the tool needs to provide structured dashboards, while at other times it needs to be a mechanism for ad hoc analysis. Maybe the product manager's job is actually…impossible? Past Looker CAO and current Omni CEO Colin Zima joined this episode for a lively discussion on the subject! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.
The modern data stack has transformed how organizations work with data, but are our BI tools keeping pace with these changes? As data schemas become increasingly fluid and analysis needs range from quick explorations to production-grade reporting, traditional approaches are being challenged. How can we create analytics experiences that accommodate both casual spreadsheet users and technical data modelers? With semantic layers becoming crucial for AI integration and data governance growing in importance, what skills do today's BI professionals need to master? Finding the balance between flexibility and governance is perhaps the greatest challenge facing data teams today. Colin Zima is the Co-Founder and CEO of Omni, a business intelligence platform focused on making data more accessible and useful for teams of all sizes. Prior to Omni, he was Chief Analytics Officer and VP of Product at Looker, where he helped shape the product and data strategy leading up to its acquisition by Google for $2.6 billion. Colin’s background spans roles in data science, analytics, and product leadership, including positions at Google, HotelTonight, and as founder of the restaurant analytics startup PrimaTable. He holds a degree in Operations Research and Financial Engineering from Princeton University and began his career as a Structured Credit Analyst at UBS. In the episode, Richie and Colin explore the evolution of BI tools, the challenges of integrating casual and rigorous data analysis, the role of semantic layers, and the impact of AI on business intelligence. They discuss the importance of understanding business needs, creating user-focused dashboards, and the future of data products, and much more. Links Mentioned in the Show: OmniConnect with ColinSkill Track: Design in Power BIRelated Episode: Self-Service Business Intelligence with Sameer Al-Sakran, CEO at MetabaseRegister for RADAR AI - June 26 New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business
In this episode of Hub & Spoken, Jason Foster speaks with Colin Zima, CEO and Co-founder of Omni, a modern business intelligence platform that combines the best of governance and usability. With a background spanning roles at Looker and Google, and two decades as both a data user and builder, Colin brings a unique perspective on the evolution of BI and the real role of AI in shaping its future. They explore why business intelligence remains critical for aligning organisations, how AI is raising the bar for access and self-service, and why semantics and business logic are more important than ever. The conversation challenges the notion that AI will replace dashboards, and instead focuses on how it can enhance accessibility, support different user needs, and empower data teams to work more efficiently. This episode is essential listening for business and data leaders thinking about the future of BI, the practical use of AI, and the role data teams play in delivering real value at speed. Tune in to hear how modern BI is evolving, and what leaders need to know to stay ahead. **** Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. It works with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and data leadership. The company was named one of The Sunday Times' fastest-growing private companies in both 2022 and 2023 and recognised as The Best Place to Work in Data by DataIQ in 2023 and 2024. Cynozure is a certified B Corporation.
Jamie Davidson (Chief Product Officer at Omni, Former VP of Product at Looker) joins me to chat about "modern" data modeling, going from a startup to Google and back to a startup, and much more.
Omni: https://omni.co/
As we look back at 2024, we're highlighting some of our favourite episodes of the year, and with 100 of them to choose from, it wasn't easy! The four guests we'll be recapping with are: Lea Pica - A celebrity in the data storytelling and visualisation space. Richie and Lea cover the full picture of data presentation, how to understand your audience, how to leverage hollywood storytelling and more. Out December 19.Alex Banks - Founder of Sunday Signal. Adel and Alex cover Alex’s journey into AI and what led him to create Sunday Signal, the potential of AI, prompt engineering at its most basic level, chain of thought prompting, the future of LLMs and more. Out December 23.Don Chamberlin - The renowned co-inventor of SQL. Richie and Don explore the early development of SQL, how it became standardized, the future of SQL through NoSQL and SQL++ and more. Out December 26.Tom Tunguz - general Partner at Theory Ventures, a $235m VC firm. Richie and Tom explore trends in generative AI, cloud+local hybrid workflows, data security, the future of business intelligence and data analytics, AI in the corporate sector and more. Out December 30. Rapid change seems to be the new norm within the data and AI space, and due to the ecosystem constantly changing, it can be tricky to keep up. Fortunately, any self-respecting venture capitalist looking into data and AI will stay on top of what’s changing and where the next big breakthroughs are likely to come from. We all want to know which important trends are emerging and how we can take advantage of them, so why not learn from a leading VC. Tomasz Tunguz is a General Partner at Theory Ventures, a $235m early-stage venture capital firm. He blogs sat tomtunguz.com & co-authored Winning with Data. He has worked or works with Looker, Kustomer, Monte Carlo, Dremio, Omni, Hex, Spot, Arbitrum, Sui & many others. He was previously the product manager for Google's social media monetization team, including the Google-MySpace partnership, and managed the launches of AdSense into six new markets in Europe and Asia. Before Google, Tunguz developed systems for the Department of Homeland Security at Appian Corporation. In the episode, Richie and Tom explore trends in generative AI, the impact of AI on professional fields, cloud+local hybrid workflows, data security, and changes in data warehousing through the use of integrated AI tools, the future of business intelligence and data analytics, the challenges and opportunities surrounding AI in the corporate sector. You'll also get to discover Tom's picks for the hottest new data startups. Links Mentioned in the Show: Tom’s BlogTheory VenturesArticle: What Air Canada Lost In ‘Remarkable’ Lying AI Chatbot Case[Course] Implementing AI Solutions in BusinessRelated Episode: Making Better Decisions using Data & AI with Cassie Kozyrkov, Google's First Chief Decision ScientistSign up to RADAR: AI...
Rapid change seems to be the new norm within the data and AI space, and due to the ecosystem constantly changing, it can be tricky to keep up. Fortunately, any self-respecting venture capitalist looking into data and AI will stay on top of what’s changing and where the next big breakthroughs are likely to come from. We all want to know which important trends are emerging and how we can take advantage of them, so why not learn from a leading VC. Tomasz Tunguz is a General Partner at Theory Ventures, a $235m early-stage venture capital firm. He blogs sat tomtunguz.com & co-authored Winning with Data. He has worked or works with Looker, Kustomer, Monte Carlo, Dremio, Omni, Hex, Spot, Arbitrum, Sui & many others. He was previously the product manager for Google's social media monetization team, including the Google-MySpace partnership, and managed the launches of AdSense into six new markets in Europe and Asia. Before Google, Tunguz developed systems for the Department of Homeland Security at Appian Corporation. In the episode, Richie and Tom explore trends in generative AI, the impact of AI on professional fields, cloud+local hybrid workflows, data security, and changes in data warehousing through the use of integrated AI tools, the future of business intelligence and data analytics, the challenges and opportunities surrounding AI in the corporate sector. You'll also get to discover Tom's picks for the hottest new data startups. Links Mentioned in the Show: Tom’s BlogTheory VenturesArticle: What Air Canada Lost In ‘Remarkable’ Lying AI Chatbot Case[Course] Implementing AI Solutions in BusinessRelated Episode: Making Better Decisions using Data & AI with Cassie Kozyrkov, Google's First Chief Decision ScientistSign up to RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business
Summary
All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team! Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy and Rob Goretsky about when and how to think about migrating your data stack
Interview
Introduction How did you get involved in the area of data management? A migration can be anything from a minor task to a major undertaking. Can you start by describing what constitutes a migration for the purposes of this conversation? Is it possible to completely avoid having to invest in a migration? What are the signals that point to the need for a migration?
What are some of the sources of cost that need to be accounted for when considering a migration? (both in terms of doing one, and the costs of not doing one) What are some signals that a migration is not the right solution for a perceived problem?
Once the decision has been made that a migration is necessary, what are the questions that the team should be asking to determine the technologies to move to and the sequencing of execution? What are the preceding tasks that should be completed before starting the migration to ensure there is no breakage downstream of the changing component(s)? What are some of the ways that a migration effort might fail? What are the major pitfalls that teams need to be aware of as they work through a data platform migration? What are the opportunities for automation during the migration process? What are the most interesting, innovative, or unexpected ways that you have seen teams approach a platform migration? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform migrations? What are some ways that the technologies and patterns that we use can be evolved to reduce the cost/impact/need for migraitons?
Contact Info
Gleb
LinkedIn @glebmm on Twitter
Rob
LinkedIn RobGoretsky on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Datafold
Podcast Episode
Informatica Airflow Snowflake
Podcast Episode
Redshift Eventbrite Teradata BigQuery Trino EMR == Elastic Map-Reduce Shadow IT
Podcast Episode
Mode Analytics Looker Sunk Cost Fallacy data-diff
Podcast Episode
SQLGlot Dagster dbt
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Hex: 
Hex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at [dataengineeringpodcast.com/hex](https://www.dataengineeringpodcast.com/hex} and get 30 days free!Rudderstack: 
Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackSupport Data Engineering Podcast
Summary
Data has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book "Datapreneurs" he reflects on the people and businesses that he has known and worked with and how they relied on data to deliver valuable services and drive meaningful change.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Bob Muglia about his recent book about the idea of "Datapreneurs" and the role of data in the modern economy
Interview
Introduction How did you get involved in the area of data management? Can you describe what your concept of a "Datapreneur" is?
How is this distinct from the common idea of an entreprenur?
What do you see as the key inflection points in data technologies and their impacts on business capabilities over the past ~30 years? In your role as the CEO of Snowflake you had a first-row seat for the rise of the "modern data stack". What do you see as the main positive and negative impacts of that paradigm?
What are the key issues that are yet to be solved in that ecosmnjjystem?
For technologists who are thinking about launching new ventures, what are the key pieces of advice that you would like to share? What do you see as the short/medium/long-term impact of AI on the technical, business, and societal arenas? What are the most interesting, innovative, or unexpected ways that you have seen business leaders use data to drive their vision? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Datapreneurs book? What are your key predictions for the future impact of data on the technical/economic/business landscapes?
Contact Info
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Datapreneurs Book SQL Server Snowflake Z80 Processor Navigational Database System R Redshift Microsoft Fabric Databricks Looker Fivetran
Podcast Episode
Databricks Unity Catalog RelationalAI 6th Normal Form Pinecone Vector DB
Podcast Episode
Perplexity AI
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Rudderstack: 
Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackSupport Data Engineering Podcast
Summary
Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions! Your host is Tobias Macey and today I'm interviewing Chris Merrick about the Omni Analytics platform and how they are adding automatic data modeling to your business intelligence
Interview
Introduction How did you get involved in the area of data management? Can you describe what Omni Analytics is and the story behind it?
What are the core goals that you are trying to achieve with building Omni?
Business intelligence has gone through many evolutions. What are the unique capabilities that Omni Analytics offers over other players in the market?
What are the technical and organizational anti-patterns that typically grow up around BI systems?
What are the elements that contribute to BI being such a difficult product to use effectively in an organization?
Can you describe how you have implemented the Omni platform?
How have the design/scope/goals of the product changed since you first started working on it?
What does the workflow for a team using Omni look like?
What are some of the developments in the broader ecosystem that have made your work possible?
What are some of the positive and negative inspirations that you have drawn from the experience that you and your team-mates have gained in previous businesses?
What are the most interesting, innovative, or unexpected ways that you have seen Omni used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Omni?
When is Omni the wrong choice?
What do you have planned for the future of Omni?
Contact Info
LinkedIn @cmerrick on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Omni Analytics Stitch RJ Metrics Looker
Podcast Episode
Singer dbt
Podcast Episode
Teradata Fivetran Apache Arrow
Podcast Episode
DuckDB
Podcast Episode
BigQuery Snowflake
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Materialize: 
Looking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use.
Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.
Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.
Go to materialize.comSupport Data Engineering Podcast
Today I’m chatting with Bruno Aziza, Head of Data & Analytics at Google Cloud. Bruno leads a team of outbound product managers in charge of BigQuery, Dataproc, Dataflow and Looker and we dive deep on what Bruno looks for in terms of skills for these leaders. Bruno describes the three patterns of operational alignment he’s observed in data product management, as well as why he feels ownership and customer obsession are two of the most important qualities a good product manager can have. Bruno and I also dive into how to effectively abstract the core problem you’re solving, as well as how to determine whether a problem might be solved in a better way.
Highlights / Skip to:
Bruno introduces himself and explains how he created his “CarCast” podcast (00:45) Bruno describes his role at Google, the product managers he leads, and the specific Google Cloud products in his portfolio (02:36) What Bruno feels are the most important attributes to look for in a good data product manager (03:59) Bruno details how a good product manager focuses on not only the core problem, but how the problem is currently solved and whether or not that’s acceptable (07:20) What effective abstracting the problem looks like in Bruno’s view and why he positions product management as a way to help users move forward in their career (12:38) Why Bruno sees extracting value from data as the number one pain point for data teams and their respective companies (17:55) Bruno gives his definition of a data product (21:42) The three patterns Bruno has observed of operational alignment when it comes to data product management (27:57) Bruno explains the best practices he’s seen for cross-team goal setting and problem-framing (35:30)
Quotes from Today’s Episode
“What’s happening in the industry is really interesting. For people that are running data teams today and listening to us, the makeup of their teams is starting to look more like what we do [in] product management.” — Bruno Aziza (04:29)
“The problem is the problem, so focus on the problem, decompose the problem, look at the frictions that are acceptable, look at the frictions that are not acceptable, and look at how by assembling a solution, you can make it most seamless for the individual to go out and get the job done.” – Bruno Aziza (11:28)
“As a product manager, yes, we’re in the business of software, but in fact, I think you’re in the career management business. Your job is to make sure that whatever your customer’s job is that you’re making it so much easier that they, in fact, get so much more done, and by doing so they will get promoted, get the next job.” – Bruno Aziza (15:41)
“I think that is the task of any technology company, of any product manager that’s helping these technology companies: don’t be building a product that’s looking for a problem. Just start with the problem back and solution from that. Just make sure you understand the problem very well.” (19:52)
“If you’re a data product manager today, you look at your data estate and you ask yourself, ‘What am I building to save money? When am I building to make money?’ If you can do both, that’s absolutely awesome. And so, the data product is an asset that has been built repeatedly by a team and generates value out of data.” – Bruno Aziza (23:12)
“[Machine learning is] hard because multiple teams have to work together, right? You got your business analyst over here, you’ve got your data scientists over there, they’re not even the same team. And so, sometimes you’re struggling with just the human aspect of it.” (30:30)
“As a data leader, an IT leader, you got to think about those soft ways to accomplish the stuff that’s binary, that’s the hard [stuff], right? I always joke, the hard stuff is the soft stuff for people like us because we think about data, we think about logic, we think, ‘Okay if it makes sense, it will be implemented.’ For most of us, getting stuff done is through people. And people are emotional, how can you express the feeling of achieving that goal in emotional value?” – Bruno Aziza (37:36)
Links As referenced by Bruno, “Good Product Manager/Bad Product Manager”: https://a16z.com/2012/06/15/good-product-managerbad-product-manager/ LinkedIn: https://www.linkedin.com/in/brunoaziza/ Bruno’s Medium Article on Competing Against Luck by Clayton M. Christensen: https://brunoaziza.medium.com/competing-against-luck-3daeee1c45d4 The Data CarCast on YouTube: https://www.youtube.com/playlist?list=PLRXGFo1urN648lrm8NOKXfrCHzvIHeYyw
Summary The data ecosystem has been growing rapidly, with new communities joining and bringing their preferred programming languages to the mix. This has led to inefficiencies in how data is stored, accessed, and shared across process and system boundaries. The Arrow project is designed to eliminate wasted effort in translating between languages, and Voltron Data was created to help grow and support its technology and community. In this episode Wes McKinney shares the ways that Arrow and its related projects are improving the efficiency of data systems and driving their next stage of evolution.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Wes McKinney about his work at Voltron Data and on the Arrow ecosystem
Interview
Introduction How did you get involved in the area of data management? Can you describe what you are building at Voltron Data and the story behind it? What is the vision for the broader data ecosystem that you are trying to realize through your investment in Arrow and related projects?
How does your work at Voltron Data contribute to the realization of that vision?
What is the impact on engineer productivity and compute efficiency that gets introduced by the impedance mismatches between language and framework representations of data? The scope and capabilities of the Arrow project have grown substantially since it was first introduced. Can you give an overview of the current features and extensions to the project? What are some of the ways that ArrowVe and its related projects can be integrated with or replace the different elements of a data platform? Can you describe how Arrow is implemented?
What are the most complex/challenging aspects of the engineering needed to support interoperable data interchange between language runtimes?
How are you balancing the desire to move quickly and improve the Arrow protocol and implementations, with the need to wait for other players in the ecosystem (e.g. database engines, compute frameworks, etc.) to add support? With the growing application of data formats such as graphs and vectors, what do you see as the role of Arrow and its ideas in those use cases? For workflows that rely on integrating structured and unstructured data, what are the options for interaction with non-tabular data? (e.g. images, documents, etc.) With your support-focused business model, how are you approaching marketing and customer education to make it viable and scalable? What are the most interesting, innovative, or unexpected ways that you have seen Arrow used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Arrow and its ecosystem? When is Arrow the wrong choice? What do you have planned for the future of Arrow?
Contact Info
Website wesm on GitHub @wesmckinn on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Voltron Data Pandas
Podcast Episode
Apache Arrow Partial Differential Equation FPGA == Field-Programmable Gate Array GPU == Graphics Processing Unit Ursa Labs Voltron (cartoon) Feature Engineering PySpark Substrait Arrow Flight Acero Arrow Datafusion Velox Ibis SIMD == Single Instruction, Multiple Data Lance DuckDB
Podcast Episode
Data Threads Conference Nano-Arrow Arrow ADBC Protocol Apache Iceberg
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Atlan: 
Have you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?
Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.
Go to dataengineeringpodcast.com/atlan and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.a href="https://dataengineeringpodcast.com/montecarlo"…
Summary Agile methodologies have been adopted by a majority of teams for building software applications. Applying those same practices to data can prove challenging due to the number of systems that need to be included to implement a complete feature. In this episode Shane Gibson shares practical advice and insights from his years of experience as a consultant and engineer working in data about how to adopt agile principles in your data work so that you can move faster and provide more value to the business, while building systems that are maintainable and adaptable.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Shane Gibson about how to bring Agile practices to your data management workflows
Interview
Introduction How did you get involved in the area of data management? Can you describe what AgileData is and the story behind it? What are the main industries and/or use cases that you are focused on supporting? The data ecosystem has been trying on different paradigms from software development for some time now (e.g. DataOps, version control, etc.). What are the aspects of Agile that do and don’t map well to data engineering/analysis? One of the perennial challenges of data analysis is how to approach data modeling. How do you balance the need to provide value with the long-term impacts of incomplete or underinformed modeling decisions made in haste at the beginning of a project?
How do you design in affordances for refactoring of the data models without breaking downstream assets?
Another aspect of implementing data products/platforms is how to manage permissions and governance. What are the incremental ways that those principles can be incorporated early and evolved along with the overall analytical products? What are some of the organizational design strategies that you find most helpful when establishing or training a team who is working on data products? In order to have a useful target to work toward it’s necessary to understand what the data consumers are hoping to achieve. What are some of the challenges of doing requirements gathering for data products? (e.g. not knowing what information is available, consumers not understanding what’s hard vs. easy, etc.)
How do you work with the "customers" to help them understand what a reasonable scope is and translate that to the actual project stages for the engineers?
What are some of the perennial questions or points of confusion that you have had to address with your clients on how to design and implement analytical assets? What are the most interesting, innovative, or unexpected ways that you have seen agile principles used for data? What are the most interesting, unexpected, or challenging lessons that you have learned while working on AgileData? When is agile the wrong choice for a data project? What do you have planned for the future of AgileData?
Contact Info
LinkedIn @shagility on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
AgileData OptimalBI How To Make Toast Data Mesh Information Product Canvas DataKitchen
Podcast Episode
Great Expectations
Podcast Episode
Soda Data
Podcast Episode
Google DataStore Unfix.work Activity Schema
Podcast Episode
Data Vault
Podcast Episode
Star Schema Lean Methodology Scrum Kanban
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Atlan: 
Have you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?
Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.
Go to dataengineeringpodcast.com/atlan and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.Prefect: 
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit…
Summary Data lineage is something that has grown from a convenient feature to a critical need as data systems have grown in scale, complexity, and centrality to business. Alvin is a platform that aims to provide a low effort solution for data lineage capabilities focused on simplifying the work of data engineers. In this episode co-founder Martin Sahlen explains the impact that easy access to lineage information can have on the work of data engineers and analysts, and how he and his team have designed their platform to offer that information to engineers and stakeholders in the places that they interact with data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Martin Sahlen about his work on data lineage at Alvin and how it factors into the day-to-day work of data engineers
Interview
Introduction How did you get involved in the area of data management? Can you describe what Alvin is and the story behind it? What is the core problem that you are trying to solve at Alvin? Data lineage has quickly become an overloaded term. What are the elements of lineage that you are focused on addressing?
What are some of the other sources/pieces of information that you integrate into the lineage graph?
How does data lineage show up in the work of data engineers?
In what ways does your focus on data engineers inform the way that you model the lineage information?
As with every data asset/product, the lineage graph is only as useful as the data that it stores. What are some of the ways that you focus on establishing and ensuring a complete view of lineage?
How do you account for assets (e.g. tables, dashboards, exports, etc.) that are created outside of the "officially supported" methods? (e.g. someone manually runs a SQL create statement, etc.)
Can you describe how you have implemented the Alvin platform?
How have the design and goals shifted from when you first started exploring the problem?
What are the types of data systems/assets that you are focused on supporting? (e.g. data warehouses vs. lakes, structured vs. unstructured, which BI tools, etc.) How does Alvin fit into the workflow of data engineers and their downstream customers/collaborators?
What are some of the design choices (both visual and functional) that you focused on to avoid friction in the data engineer’s workflow?
What are some of the open questions/areas for investigation/improvement in the space of data lineage?
What are the factors that contribute to the difficulty of a truly holistic and complete view of lineage across an organization?
What are the most interesting, innovative, or unexpected ways that you have seen Alvin used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Alvin? When is Alvin the wrong choice? What do you have planned for the future of Alvin?
Contact Info
LinkedIn @martinsahlen on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Alvin Unacast sqlparse Python library Cython
Podcast.init Episode
Antlr Kotlin programming language PostgreSQL
Podcast Episode
OpenSearch ElasticSearch Redis Kubernetes Airflow BigQuery Spark Looker Mode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast
Summary A large fraction of data engineering work involves moving data from one storage location to another in order to support different access and query patterns. Singlestore aims to cut down on the number of database engines that you need to run so that you can reduce the amount of copying that is required. By supporting fast, in-memory row-based queries and columnar on-disk representation, it lets your transactional and analytical workloads run in the same database. In this episode SVP of engineering Shireesh Thota describes the impact on your overall system architecture that Singlestore can have and the benefits of using a cloud-native database engine for your next application.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you becom
Summary The interfaces and design cues that a tool offers can have a massive impact on who is able to use it and the tasks that they are able to perform. With an eye to making data workflows more accessible to everyone in an organization Raj Bains and his team at Prophecy designed a powerful and extensible low-code platform that lets technical and non-technical users scale data flows without forcing everyone into the same layers of abstraction. In this episode he explores the tension between code-first and no-code utilities and how he is working to balance the strengths without falling prey to their shortcomings.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Raj Bains about how improving the user experience for data tools can make your work as a data engineer better and easier
Interview
Introduction How did you get involved in the area of data management? What are the broad categories of data tool designs that are available currently and how does that impact what is possible with them?
What are the points of friction that are introduced by the tools? Can you share some of the types of workarounds or wasted effort that are made necessary by those design elements?
What are the core design principles that you have built into Prophecy to address these shortcomings?
How do those user experience changes improve the quality and speed of work for data engineers?
How has the Prophecy platform changed since we last spoke almost a year ago? What are the tradeoffs of low code systems for productivity vs. flexibility and creativity? What are the most interesting, innovative, or unexpected approaches to developer experience that you have seen for data tools? What are the most interesting, unexpected, or challenging lessons that you have learned while working on user experience optimization for data tooling at Prophecy? When is it more important to optimize for computational efficiency over developer productivity? What do you have planned for the future of Prophecy?
Contact Info
LinkedIn @_raj_bains on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Prophecy
Podcast Episode
CUDA Clustrix Hortonworks Apache Hive Compilerworks
Podcast Episode
Airflow Databricks Fivetran
Podcast Episode
Airbyte
Podcast Episode
Streamsets Change Data Capture Apache Pig Spark Scala Ab Initio Type 2 Slowly Changing Dimensions AWS Deequ Matillion
Podcast Episode
Prophecy SaaS
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast
Summary Industrial applications are one of the primary adopters of Internet of Things (IoT) technologies, with business critical operations being informed by data collected across a fleet of sensors. Vopak is a business that manages storage and distribution of a variety of liquids that are critical to the modern world, and they have recently launched a new platform to gain more utility from their industrial sensors. In this episode Mário Pereira shares the system design that he and his team have developed for collecting and managing the collection and analysis of sensor data, and how they have split the data processing and business logic responsibilities between physical terminals and edge locations, and centralized storage and compute.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Mário Pereira about building a data management system for globally distributed IoT sensors at Vopak
Interview
Introduction How did you get involved in the area of data management? Can you describe what Vopak is and what kinds of information you rely on to power the business? What kinds of sensors and edge devices are you using?
What kinds of consistency or variance do you have between sensors across your locations?
How much computing power and storage space do you place at the edge?
What level of pre-processing/filtering is being done at the edge and how do you decide what information needs to be centralized? What are some examples of decision-making that happens at the edge?
Can you describe the platform architecture that you have built for collecting and processing sensor data?
What was your process for selecting and evaluating the various components?
How much tolerance do you have for missed messages/dropped data? How long are your data retention period
Summary Dan Delorey helped to build the core technologies of Google’s cloud data services for many years before embarking on his latest adventure as the VP of Data at SoFi. From being an early engineer on the Dremel project, to helping launch and manage BigQuery, on to helping enterprises adopt Google’s data products he learned all of the critical details of how to run services used by data platform teams. Now he is the consumer of many of the tools that his work inspired. In this episode he takes a trip down memory lane to weave an interesting and informative narrative about the broader themes throughout his work and their echoes in the modern data ecosystem.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Dan Delorey about his journey through the data ecosystem as the current head of data at SoFi, prior engineering leader with the BigQuery team, and early engineer on Dremel
Interview
Introduction
How did you get involved in the area of data management?
Can you start by sharing what your current relationship to the data ecosystem is and the cliffs-notes version of how you ended up there?
Dremel was a ground-breaking technology at the time. What do you see as its lasting impression on the landscape of data both in and outside of Google?
You were instrumental in crafting the vision behind "querying data in place," (what they called, federated data) at Dremel and BigQuery. What do you mean by this? How has this approach evolved? What are some challenges with this approach?
How well did the Drill project capture the core principles of Dremel as outlined in the eponymous white paper?
Following your work on Drill you were involved with the development and growth of BigQuery and the broader suite of Google Cloud’s data platform.
Summary The predominant pattern for data integration in the cloud has become extract, load, and then transform or ELT. Matillion was an early innovator of that approach and in this episode CTO Ed Thompson explains how they have evolved the platform to keep pace with the rapidly changing ecosystem. He describes how the platform is architected, the challenges related to selling cloud technologies into enterprise organizations, and how you can adopt Matillion for your own workflows to reduce the maintenance burden of data integration workflows.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more. Your host is Tobias Macey and today I’m interviewing Ed Thompson about Matillion, a cloud-native data integration platform for accelerating your time to analytics
Interview
Introduction How did you get involved in the area of data management?