talk-data.com talk-data.com

Filter by Source

Select conferences and events

People (123 results)

See all 123 →
Showing 16 results

Activities & events

Title & Speakers Event

Join fellow Airflow enthusiasts and leaders at Salisbury House for an evening of engaging talks, great food and drinks, and exclusive swag!

We'll start you off with a deep dive into the Airflow 2026 survey results, and finish off with a community member presentation on the Apache TinkerPop provider.

PRESENTATIONS

Talk #1: The State of Apache Airflow® 2026

Apache Airflow® continues to thrive as the world’s leading open-source data orchestration platform, with 30M downloads per month and over 3k contributors. 2025 marked a major milestone with the release of Airflow 3, which introduced DAG versioning, enhanced security and task isolation, assets, and more. These changes have reshaped how data teams build, operate, and govern their pipelines.

In this session, our speaker will share insights from the State of Airflow 2026 report, including:

  • Latest trends in how teams are using Airflow today
  • What’s next for the project and ecosystem
  • A discussion of emerging best practices and evolving use cases

Join us to hear directly from a leader in the community and discover how to get the most out of Airflow in the year ahead.

Talk #2: Building the Apache TinkerPop Provider for Airflow

Graph databases are powering everything from recommendation engines to fraud detection, but integrating graph operations into modern data pipelines has often required custom code and workarounds. Earlier this year, Ahmad built a new Apache TinkerPop provider for Airflow, making it easier than ever to orchestrate Gremlin queries, manage graph workloads, and connect Airflow to TinkerPop-enabled systems. In this session, you’ll learn:

  • What the TinkerPop provider does and why it matters for graph-based workloads
  • How to run Gremlin queries and manage graph jobs directly within Airflow
  • Real examples from the development process, including design decisions and lessons learned
  • How this provider opens the door for new use cases in graph analytics and data engineering

Join us to explore how Airflow and TinkerPop can work together to streamline graph workflows and unlock new patterns in modern data pipelines.

AGENDA

  • 5:30-6 PM: Arrivals, networking, food & drinks
  • 6-7PM: Presentations
  • 7-8PM: Networking
The State of Airflow 2026: London Airflow Meetup!

These are the notes of the previous "How to Build a Portfolio That Reflects Your Real Skills" event:

Properties of an ideal portfolio repository:

  • Built to prove employable skills and readiness for real work
  • Fewer projects, carefully chosen to match job requirements
  • Clean, readable, refactored code, and follows best practices
  • Detailed READMEs (setup, features, tech stack, decisions, how to deploy, testing strategy, etc)
  • Logical, meaningful commits that show development process <- you can follow the git history for important commits/features
  • Clear architecture (layers, packages, separation of concerns) <- use best practices
  • Unit and integration tests included and explained <-- also talk about them in the README
  • Proper validation, exceptions, and edge case handling
  • Polished, complete, production-like projects only
  • “Can this person work on our codebase?” <-- reviewers will ask this
  • Written for recruiters, hiring managers, and senior engineers
  • Uses industry-relevant and job-listed technologies <- tech stak should match the CV
  • Well-scoped, realistic features similar to real products
  • Consistent style, structure, and conventions across projects
  • Environment variables, clear setup steps, sample configs
  • Minimal, justified dependencies with clear versioning
  • Proper logging, and meaningful log messages
  • No secrets committed, basic security best practices applied
  • Shows awareness of scaling, performance, and future growth <- at least have a "possible improvements" section in the README
  • a list of ADRs explains design choices and trade-offs <- should be a part of the documentation

📌 Backend & Frontend Portfolio Project Ideas

These projects are intentionally reusable across tech stacks. Following tutorials and reusing patterns is expected — what matters is:

  • understanding the architecture
  • explaining trade-offs
  • documenting decisions clearly

☕ Junior Java Backend Developer (Spring Boot)

1. Shop Manager Application

A monolithic Spring Boot app designed with microservice-style boundaries. Features

  • Secure user registration & login
  • Role-based access control using JWT
  • REST APIs for:
  • Users
  • Products
  • Inventory
  • Orders
  • Automatic inventory updates when orders are placed
  • CSV upload for bulk product & inventory import
  • Clear service boundaries (UserService, OrderService, InventoryService, etc.)

Engineering Focus

  • Clean architecture (controllers, services, repositories)
  • Global exception handling
  • Database migrations (Flyway/Liquibase)
  • Unit & integration testing
  • Clear README explaining architecture decisions

2. Parallel Data Processing Engine

Backend service for processing large datasets efficiently. Features

  • Upload large CSV/log files
  • Split data into chunks
  • Process chunks in parallel using:
  • ExecutorService
  • CompletableFuture
  • Aggregate and return results

Demonstrates

  • Java concurrency
  • Thread pools & async execution
  • Performance optimization

3. Distributed Task Queue System

Simple async job processing system. Features

  • One service submits tasks
  • Another service processes them asynchronously
  • Uses Kafka or RabbitMQ
  • Tasks: report generation, data transformation

Demonstrates

  • Message-driven architecture
  • Async workflows
  • Eventual consistency

4. Rate Limiting & Load Control Service

Standalone service that protects APIs from abuse. Features

  • Token bucket or sliding window algorithms
  • Redis-backed counters
  • Per-user or per-IP limits

Demonstrates

  • Algorithmic thinking
  • Distributed state
  • API protection patterns

5. Search & Indexing Backend

Document or record search service. Features

  • In-memory inverted index
  • Text search, filters, ranking
  • Optional Elasticsearch integration

Demonstrates

  • Data structures
  • Read-optimized design
  • Trade-offs between custom vs external tools

6. Distributed Configuration & Feature Flag Service

Centralized config service for other apps. Features

  • Key-value configuration store
  • Feature flags
  • Caching & refresh mechanisms

Demonstrates

  • Caching strategies
  • Consistency vs availability trade-offs
  • System design for shared services

🐹 Mid-Level Go Backend Developer (Non-Kubernetes)

1. High-Throughput Event Processing Pipeline

Multi-stage concurrent pipeline. Features

  • HTTP/gRPC ingestion
  • Validation & transformation stages
  • Goroutines & channels
  • Worker pools, batching, backpressure
  • Graceful shutdown

2. Distributed Job Scheduler & Worker System

Async job execution platform. Features

  • Job scheduling & delayed execution
  • Retries & idempotency
  • Job states (pending, running, failed, completed)
  • Message queue or gRPC-based workers

3. In-Memory Caching Service

Redis-like cache written from scratch. Features

  • TTL support
  • Eviction strategies (LRU/LFU)
  • Concurrent-safe access
  • Optional disk persistence

4. Rate Limiting & Traffic Shaping Gateway

Reverse-proxy-style rate limiter. Features

  • Token bucket / leaky bucket
  • Circuit breakers
  • Redis-backed distributed limits

5. Log Aggregation & Query Engine

Incrementally built system: Step-by-step

  1. REST API + Postgres (store logs, query logs)
  2. Optimize for massive concurrency
  3. Replace DB with in-memory data structures
  4. Add streaming endpoints using channels & batching

🐍 Mid-Level Python Backend Developer

1. Asynchronous Task Processing System

Async job execution platform. Features

  • Async API submission
  • Worker pool (asyncio or Celery-like)
  • Retries & failure handling
  • Job status tracking
  • Idempotency

2. Event-Driven Data Pipeline

Streaming data processing service. Features

  • Event ingestion
  • Validation & transformation
  • Batching & backpressure handling
  • Output to storage or downstream services

3. Distributed Rate Limiting Service

API protection service. Steps

  • Step 1: Use an existing rate-limiting library
  • Step 2: Implement token bucket / sliding window yourself

4. Search & Indexing Backend

Search system for logs or documents. Features

  • Custom indexing or Elasticsearch
  • Filtering & time-based queries
  • Read-heavy optimization

5. Configuration & Feature Flag Service

Shared configuration backend. Steps

  • Step 1: Use a caching library
  • Step 2: Implement your own cache (explain in README)

🟦 Mid-Level TypeScript Backend Developer

1. Asynchronous Job Processing System

Queue-based task execution. Features

  • BullMQ / RabbitMQ / Redis
  • Retries & scheduling
  • Status tracking

2. Real-Time Chat / Notification Service

WebSocket-based system. Features

  • Presence tracking
  • Message persistence
  • Real-time updates

3. Rate Limiting & API Gateway

API gateway with protections. Features

  • Token bucket / sliding window
  • Response caching
  • Request logging

4. Search & Filtering Engine

Search backend for products, logs, or articles. Features

  • In-memory index or Elasticsearch
  • Pagination & sorting

5. Feature Flag & Configuration Service

Centralized config management. Features

  • Versioning
  • Rollout strategies
  • Caching

🟨 Mid-Level Node.js Backend Developer

1. Async Task Queue System

Background job processor. Features

  • Bull / Redis / RabbitMQ
  • Retries & scheduling
  • Status APIs

2. Real-Time Chat / Notification Service

Socket-based system. Features

  • Rooms
  • Presence tracking
  • Message persistence

3. Rate Limiting & API Gateway

Traffic control service. Features

  • Per-user/API-key limits
  • Logging
  • Optional caching

4. Search & Indexing Backend

Indexing & querying service.


5. Feature Flag / Configuration Service

Shared backend for app configs.


⚛️ Mid-Level Frontend Developer (React / Next.js)

1. Dynamic Analytics Dashboard

Interactive data visualization app. Features

  • Charts & tables
  • Filters & live updates
  • React Query / Redux / Zustand
  • Responsive layouts

2. E-Commerce Store

Full shopping experience. Features

  • Product listings
  • Search, filters, sorting
  • Cart & checkout
  • SSR/SSG with Next.js

3. Real-Time Chat / Collaboration App

Live multi-user UI. Features

  • WebSockets or Firebase
  • Presence indicators
  • Real-time updates

4. CMS / Blogging Platform

SEO-focused content app. Features

  • SSR for SEO
  • Markdown or API-based content
  • Admin editing interface

5. Personalized Analytics / Recommendation UI

Data-heavy frontend. Features

  • Filtering & lazy loading
  • Large dataset handling
  • User-specific insights

6. AI Chatbot App — “My House Plant Advisor”

LLM-powered assistant with production-quality UX. Core Features

  • Chat interface with real-time updates
  • Input normalization & validation
  • Offensive content filtering
  • Unsupported query detection
  • Rate limiting (per user)
  • Caching recent queries
  • Conversation history per session
  • Graceful fallbacks & error handling

Advanced Features

  • Prompt tuning (beginner vs expert users)
  • Structured advice formatting (cards, bullets)
  • Local LLM support
  • Analytics dashboard (popular questions)
  • Voice input/output (speech-to-text, TTS)

✅ Final Advice

You do NOT need to build everything. Instead, pick 1–2 strong projects per role and focus on depth:

  • Explain the architecture clearly
  • Document trade-offs (why you chose X over Y)
  • Show incremental improvements
  • Prove you understand why, not just how

📌 Portfolio Quality Signals (Very Important)

  • Have a large, organic commit history → A single or very few commits is a strong indicator of copy-paste work.
  • Prefer 3–5 complex projects over 20 simple ones → Many tiny projects often signal shallow understanding.

🎯 Why This Helps in Interviews

Working on serious projects gives you:

  • Real hands-on practice
  • Concrete anecdotes (stories you can tell in interviews)
  • A safe way to learn technologies you don’t fully know yet
  • Better focus and long-term learning discipline
  • A portfolio that can be ported to another tech stack later (Java → Go, Node → Python, etc.)

🎥 Demo & Documentation Best Practices

  • Create a 2–3 minute demo / walkthrough video
  • Show the app running
  • Explain what problem it solves
  • Highlight one or two technical decisions
  • At the top of every README:
  • Add a plain-English paragraph explaining what the project does
  • Assume the reader is a complete beginner

🤝 Open Source & Personal Projects (Interview Signal)

Always mention that you have contributed to Open Source or built personal projects.

  • Shows team spirit
  • Shows you can read, understand, and navigate an existing codebase
  • Signals that you can onboard into a real-world repository
  • Makes you sound like an engineer, not just a tutorial follower
[Notes]How to Build a Portfolio That Reflects Your Real Skills
Event The Pragmatic Engineer 2025-11-26
Johannes Dahse – VP of Code Security @ Sonar , Gergely Orosz – host

Brought to You By: •⁠ Statsig ⁠ — ⁠ The unified platform for flags, analytics, experiments, and more. Statsig are helping make the first-ever Pragmatic Summit a reality. Join me and 400 other top engineers and leaders on 11 February, in San Francisco for a special one-day event. Reserve your spot here. •⁠ Linear ⁠ — ⁠ The system for modern product development. Engineering teams today move much faster, thanks to AI. Because of this, coordination increasingly becomes a problem. This is where Linear helps fast-moving teams stay focused. Check out Linear. — As software engineers, what should we know about writing secure code? Johannes Dahse is the VP of Code Security at Sonar and a security expert with 20 years of industry experience. In today’s episode of The Pragmatic Engineer, he joins me to talk about what security teams actually do, what developers should own, and where real-world risk enters modern codebases. We cover dependency risk, software composition analysis, CVEs, dynamic testing, and how everyday development practices affect security outcomes. Johannes also explains where AI meaningfully helps, where it introduces new failure modes, and why understanding the code you write and ship remains the most reliable defense. If you build and ship software, this episode is a practical guide to thinking about code security under real-world engineering constraints. — Timestamps (00:00) Intro (02:31) What is penetration testing? (06:23) Who owns code security: devs or security teams? (14:42) What is code security?  (17:10) Code security basics for devs (21:35) Advanced security challenges (24:36) SCA testing  (25:26) The CVE Program  (29:39) The State of Code Security report  (32:02) Code quality vs security (35:20) Dev machines as a security vulnerability (37:29) Common security tools (42:50) Dynamic security tools (45:01) AI security reviews: what are the limits? (47:51) AI-generated code risks (49:21) More code: more vulnerabilities (51:44) AI’s impact on code security (58:32) Common misconceptions of the security industry (1:03:05) When is security “good enough?” (1:05:40) Johannes’s favorite programming language — The Pragmatic Engineer deepdives relevant for this episode: • What is Security Engineering? •⁠ Mishandled security vulnerability in Next.js •⁠ Okta Schooled on Its Security Practices — Production and marketing by ⁠⁠⁠⁠⁠⁠⁠⁠https://penname.co/⁠⁠⁠⁠⁠⁠⁠⁠. For inquiries about sponsoring the podcast, email [email protected].

Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe

AI/ML Analytics JavaScript Marketing Cyber Security
Gergely Orosz – host , Laura Tacho – CTO @ DX

Supported by Our Partners •⁠ Statsig ⁠ — ⁠ The unified platform for flags, analytics, experiments, and more. • Graphite — The AI developer productivity platform. — There’s no shortage of bold claims about AI and developer productivity, but how do you separate signal from noise? In this episode of The Pragmatic Engineer, I’m joined by Laura Tacho, CTO at DX, to cut through the hype and share how well (or not) AI tools are actually working inside engineering orgs. Laura shares insights from DX’s research across 180+ companies, including surprising findings about where developers save the most time, why devs don’t use AI at all, and what kinds of rollouts lead to meaningful impact. We also discuss:  • The problem with oversimplified AI headlines and how to think more critically about them • An overview of the DX AI Measurement framework • Learnings from Booking.com’s AI tool rollout • Common reasons developers aren’t using AI tools • Why using AI tools sometimes decreases developer satisfaction • Surprising results from DX’s 180+ company study • How AI-generated documentation differs from human-written docs • Why measuring developer experience before rolling out AI is essential • Why Laura thinks roadmaps are on their way out • And much more! — Timestamps (00:00) Intro (01:23) Laura’s take on AI overhyped headlines  (10:46) Common questions Laura gets about AI implementation  (11:49) How to measure AI’s impact  (15:12) Why acceptance rate and lines of code are not sufficient measures of productivity (18:03) The Booking.com case study (20:37) Why some employees are not using AI  (24:20) What developers are actually saving time on  (29:14) What happens with the time savings (31:10) The surprising results from the DORA report on AI in engineering  (33:44) A hypothesis around AI and flow state and the importance of talking to developers (35:59) What’s working in AI architecture  (42:22) Learnings from WorkHuman’s adoption of Copilot  (47:00) Consumption-based pricing, and the difficulty of allocating resources to AI  (52:01) What DX Core 4 measures  (55:32) The best outcomes of implementing AI  (58:56) Why highly regulated industries are having the best results with AI rollout (1:00:30) Indeed’s structured AI rollout  (1:04:22) Why migrations might be a good use case for AI (and a tip for doing it!)  (1:07:30) Advice for engineering leads looking to get better at AI tooling and implementation  (1:08:49) Rapid fire round — The Pragmatic Engineer deepdives relevant for this episode: • AI Engineering in the real world • Measuring software engineering productivity • The AI Engineering stack • A new way to measure developer productivity – from the creators of DORA and SPACE — See the transcript and other references from the episode at ⁠⁠https://newsletter.pragmaticengineer.com/podcast⁠⁠ — Production and marketing by ⁠⁠⁠⁠⁠⁠⁠⁠https://penname.co/⁠⁠⁠⁠⁠⁠⁠⁠. For inquiries about sponsoring the podcast, email [email protected].

Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe

AI/ML Analytics Marketing
Event Dbt Coalesce 2024 2024-10-16
Erica Louie – Sr. Analytics Manager @ dbt Labs , Andrew Escay – Lead Data Analyst @ dbt Labs

"Look at these beautiful dbt models! Why are we still experiencing the same friction with stakeholders?" This talk from experts at dbt Labs argues that we solved the first stage of building "Transformations" (via the 2024 State of Analytics Engineering report) and now we're now in the second stage: "The Philosophy of Transformations". And all roads lead to "metrics".

Speakers: Erica Louie Sr. Analytics Manager dbt Labs

Andrew Escay Lead Data Analyst dbt Labs

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Analytics Analytics Engineering Cloud Computing dbt
Rachel House – Senior Developer Advocate @ Great Expectations

The dbt Labs 2024 State of Analytics Engineering report highlights that stakeholder data literacy remains a problem in the modern data workplace. Data stakeholders and data professionals can both benefit from learning foundational data literacy concepts that foster their ability to reason about working with data in a business environment. In this talk, an expert from Great Expectations covers two key concepts that they've applied in their own career when framing data fundamentals: “the data supply chain” and “ML in a nutshell.”

Speaker: Rachel House Senior Developer Advocate Great Expectations

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

AI/ML Analytics Analytics Engineering Cloud Computing dbt
Copenhagen dbt Meetup vol. 5 2024-10-10 · 15:30

Mark your calendars for the fifth dbt Meetup in Copenhagen! Thursday, the 10th of October 🧡

We have the speaker line-up ready for you, and we will have three exciting speakers!

Date: Thursday the 10th of October

🤝 Organizer: Intellishore

🏠 Venue Host: group.one - Kalvebod Brygge 24, 1560 København

Agenda & Speakers 17:30: Meet new people and catch up with old acquaintances over drinks and snacks 🥤

17:45: Event kick-off and Welcome - Hosted by Marie-Christin Henkelmann\, Senior Consultant at Intellishore 💬

18:00-20:00 - Speaker sessions 💬

Deep Dives on CI and Orchestration @ group.one by Emil Nilsson, Director of Data at group.one In this session, group.one will share their experiences with building a data team in a M&A-driven organisation, discussing key decisions and learnings along the way. Hear about specific aspects of their data stack, including CI processes and the orchestration of dbt jobs.

Schedule-less Data Architecture @ Ageras by August Hörlën, Senior Analytics Engineer at Ageras Learn how Ageras transformed its data architecture by leveraging dynamic tables in Snowflake and dbt. This talk shows how they moved beyond traditional scheduling to create a responsive, schedule-less pipeline that ensures data is processed exactly when needed.

Late-stage transformations: Utilizing dbt Semantic Layer metrics by Erica Louie, Sr. Analytics Manager at dbt Labs & Andrew Escay, Lead Data Analyst at dbt Labs Look at these beautiful dbt models! Why are we still experiencing the same friction with stakeholders?" This talk from experts at dbt Labs argues that we solved the first stage of building "Transformations" (via the 2024 State of Analytics Engineering report), and now we're now in the second stage: "The Philosophy of Transformations". And all roads lead to "metrics".

20:00: 🍕 Food & socializing - Continue the conversation over food and drinks.

20:45: Over and out 🙌🏼

➡️ Join the dbt Slack community: https://www.getdbt.com/community/

🤝For the best Meetup experience, make sure to join the #local-denmark channel in dbt Slack (https://slack.getdbt.com/) ---------------------------------- dbt is the standard in data transformation, used by over 40,000 organizations worldwide. Through the application of software engineering best practices like modularity, version control, testing, and documentation, dbt’s analytics engineering workflow helps teams work more efficiently to produce data the entire organization can trust. Learn more: https://www.getdbt.com/

To attend, please read the Health and Safety Policy and Terms of Participation: https://www.getdbt.com/legal/health-and-safety-policy

Copenhagen dbt Meetup vol. 5

The results are in for The State of Analytics Engineering Report — the pains and gains shared by data practitioners and leaders, in dbt Labs’s -annual survey of this fast-changing space. Check out Erica, Adam and Amada as they discuss industry benchmarks, macro trends shaping the industry, and strategies for creating effective data organizations 🔥

If you're a first-timer, dbt Meetups are networking events open to all folks working with data!

🤝 Organizer: dbt Labs 🏠 Venue Host: Materialize (https://materialize.com/) - Thank you for hosting us in your office! 6th floor, 436 Lafayette St, New York, NY 10003 🍕 Catering: expect light food and beverages

📝Agenda 5:30 - 6:15pm \| Check-in\, nametags and mingling 6:15 - 7:05pm \| Presentation/Discussion of the State of the Analytics Engineering Report results with Amada Echeverria (Community @ dbt Labs)\, Adam Stone (Analytics Engineering Manager @ BDC\, a Velir Company)\, and Erica Louie (Senior Manager of Analytics Engineer at dbt Labs) 7:05 - 7:20 Q & A 7:20 - 8:15pm \| Networking\, bites and drinks

Our venue has capacity limits, so please only RSVP if you intend to come and change your RSVP status on the Meetup to "Not Going" or message Amada Echeverria in Meetup if you have an issue.

To attend, please read the Health and Safety Policy and Terms of Participation: https://bit.ly/4azcreT

➡️ Join the dbt Slack community: https://www.getdbt.com/community/ 🤝For the best Meetup experience, make sure to join the #local-nyc channel in dbt Slack (https://slack.getdbt.com/). ---------------------------------- dbt allows teams to ship trusted data products, faster. dbt is a data transformation framework that lets analysts and engineers collaborate using their shared knowledge of SQL. Through the application of software engineering best practices like modularity, version control, testing, and documentation, dbt’s analytics engineering workflow helps teams work more efficiently to produce data the entire organization can trust.

Learn more: https://www.getdbt.com/

The State of Analytics Engineering Report
Rob Goretsky – guest , Gleb Mezhanskiy – guest @ Datafold , Tobias Macey – host

Summary

All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team! Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy and Rob Goretsky about when and how to think about migrating your data stack

Interview

Introduction How did you get involved in the area of data management? A migration can be anything from a minor task to a major undertaking. Can you start by describing what constitutes a migration for the purposes of this conversation? Is it possible to completely avoid having to invest in a migration? What are the signals that point to the need for a migration?

What are some of the sources of cost that need to be accounted for when considering a migration? (both in terms of doing one, and the costs of not doing one) What are some signals that a migration is not the right solution for a perceived problem?

Once the decision has been made that a migration is necessary, what are the questions that the team should be asking to determine the technologies to move to and the sequencing of execution? What are the preceding tasks that should be completed before starting the migration to ensure there is no breakage downstream of the changing component(s)? What are some of the ways that a migration effort might fail? What are the major pitfalls that teams need to be aware of as they work through a data platform migration? What are the opportunities for automation during the migration process? What are the most interesting, innovative, or unexpected ways that you have seen teams approach a platform migration? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform migrations? What are some ways that the technologies and patterns that we use can be evolved to reduce the cost/impact/need for migraitons?

Contact Info

Gleb

LinkedIn @glebmm on Twitter

Rob

LinkedIn RobGoretsky on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Datafold

Podcast Episode

Informatica Airflow Snowflake

Podcast Episode

Redshift Eventbrite Teradata BigQuery Trino EMR == Elastic Map-Reduce Shadow IT

Podcast Episode

Mode Analytics Looker Sunk Cost Fallacy data-diff

Podcast Episode

SQLGlot Dagster dbt

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Hex: Hex Tech Logo

Hex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at [dataengineeringpodcast.com/hex](https://www.dataengineeringpodcast.com/hex} and get 30 days free!Rudderstack: Rudderstack

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackSupport Data Engineering Podcast

AI/ML Airflow Analytics Amazon EMR BigQuery Dagster Data Engineering Data Management Data Science Datafold dbt ELK GitHub Informatica Looker Python Redshift SaaS Snowflake SQL Teradata Trino
Yoav Cohen – co-founder and CTO @ Satori , Tobias Macey – host

Summary

As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today RudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/rudder Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Yoav Cohen about the challenges that data teams face in securing their data platforms and how that impacts the productivity and adoption of data in the organization

Interview

Introduction How did you get involved in the area of data management? Data security is a very broad term. Can you start by enumerating some of the different concerns that are involved? How has the scope and complexity of implementing security controls on data systems changed in recent years?

In your experience, what is a typical number of data locations that an organization is trying to manage access/permissions within?

What are some of the main challenges that data/compliance teams face in establishing and maintaining security controls?

How much of the problem is technical vs. procedural/organizational?

As a vendor in the space, how do you think about the broad categories/boundary lines for the different elements of data security? (e.g. masking vs. RBAC, etc.)

What are the different layers that are best suited to managing each of those categories? (e.g. masking and encryption in storage layer, RBAC in warehouse, etc.)

What are some of the ways that data security and organizational productivity are at odds with each other?

What are some of the shortcuts that you see teams and individuals taking to address the productivity hit from security controls?

What are some of the methods that you have found to be most effective at mitigating or even improving productivity impacts through security controls?

How does up-front design of the security layers improve the final outcome vs. trying to bolt on security after the platform is already in use? How can education about the motivations for different security practices improve compliance and user experience?

What are the most interesting, innovative, or unexpected ways that you have seen data teams align data security and productivity? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data security technology? What are the areas of data security that still need improvements?

Contact Info

Yoav Cohen

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Satori

Podcast Episode

Data Masking RBAC == Role Based Access Control ABAC == Attribute Based Access Control Gartner Data Security Platform Report

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Rudderstack: Rudderstack Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit RudderStack.com/DEP to learn moreData Council: Data Council Logo Join us at the event for the global data community, Data Council Austin. From March 28-30th 2023, we'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount off tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit: dataengineeringpodcast.com/data-council Promo Code: dataengpod20TimeXtender: TimeXtender Logo TimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.

You can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.

Go to dataengineeringpodcast.com/timextender today to get started for free!Support Data Engineering Podcast

AI/ML Analytics CDP Data Engineering Data Management Data Science JavaScript Modern Data Stack Python Cyber Security
Mark Van de Wiel – guest @ Fivetran , Tobias Macey – host

Summary Data integration from source systems to their downstream destinations is the foundational step for any data product. With the increasing expecation for information to be instantly accessible, it drives the need for reliable change data capture. The team at Fivetran have recently introduced that functionality to power real-time data products. In this episode Mark Van de Wiel explains how they integrated CDC functionality into their existing product, discusses the nuances of different approaches to change data capture from various sources.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Mark Van de Wiel about Fivetran’s implementation of chang

Analytics AWS Azure BigQuery CDP Cloud Computing Dashboard Data Engineering Data Lake Data Management Data Quality Databricks ETL/ELT Fivetran GCP Java Kubernetes MongoDB MySQL postgresql Python Scala Snowflake Spark SQL Data Streaming
Tommy Yionoulis – Founder @ OpsAnalitica , Tobias Macey – host

Summary In order to improve efficiency in any business you must first know what is contributing to wasted effort or missed opportunities. When your business operates across multiple locations it becomes even more challenging and important to gain insights into how work is being done. In this episode Tommy Yionoulis shares his experiences working in the service and hospitality industries and how that led him to found OpsAnalitica, a platform for collecting and analyzing metrics on multi location businesses and their operational practices. He discusses the challenges of making data collection purposeful and efficient without distracting employees from their primary duties and how business owners can use the provided analytics to support their staff in their duties.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced

Analytics AWS Azure BigQuery CDP Cloud Computing Dashboard Data Collection Data Engineering Data Lake Data Management Data Quality Databricks ETL/ELT GCP Java Kubernetes MongoDB MySQL postgresql Python Scala Snowflake Spark SQL Data Streaming
Shruti Bhat – SVP of Product @ Rockset , Tobias Macey – host

Summary Data has permeated every aspect of our lives and the products that we interact with. As a result, end users and customers have come to expect interactions and updates with services and analytics to be fast and up to date. In this episode Shruti Bhat gives her view on the state of the ecosystem for real-time data and the work that she and her team at Rockset is doing to make it easier for engineers to build those experiences.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing

Analytics AWS Azure BigQuery Cloud Computing Data Engineering Data Lakehouse Data Management Data Quality Databricks DWH GCP Java Kubernetes MongoDB MySQL postgresql Python Scala Snowflake Spark SQL
Chris Riccomini – guest @ WePay; LinkedIn , Tobias Macey – host

Summary Building and maintaining reliable data assets is the prime directive for data engineers. While it is easy to say, it is endlessly complex to implement, requiring data professionals to be experts in a wide range of disparate topics while designing and implementing complex topologies of information workflows. In order to make this a tractable problem it is essential that engineers embrace automation at every opportunity. In this episode Chris Riccomini shares his experiences building and scaling data operations at WePay and LinkedIn, as well as the lessons he has learned working with other teams as they automated their own systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Chris Riccomini about building awareness of data usage into CI/CD pipelines for application development

Interview

Introduction How did you get involved in the area of data management? What are the pieces of data platforms and processing that have been most difficult to scale in an organizational sense? What are the opportunities for automation to alleviate some of the toil that data and analytics engineers get caught up in? The application delivery ecosystem has been going through ongoing transformation in the form of CI/CD, infrastructure as code, etc. What are the parallels in the data ecosystem that are still nascent? What are the principles that still need to be translated for data practitioners? Which are subject to impedance mismatch and may never make sense to translate? As someone with a software engineering background and extensive e

Analytics AWS Azure BigQuery CDP CI/CD Cloud Computing Data Engineering Data Lake Data Management Databricks ETL/ELT GCP IaC Java Kubernetes MongoDB MySQL postgresql Python Scala Snowflake Spark SQL Data Streaming
Akshay Deshpande – guest , Dan Weitzner – guest , Saket Saurabh – CEO @ Nexla , Maarten Masschelein – guest @ Soda Data , Tobias Macey – host

Summary Gartner analysts are tasked with identifying promising companies each year that are making an impact in their respective categories. For businesses that are working in the data management and analytics space they recognized the efforts of Timbr.ai, Soda Data, Nexla, and Tada. In this episode the founders and leaders of each of these organizations share their perspective on the current state of the market, and the challenges facing businesses and data professionals today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Saket Saurabh, Maarten Masschelein, Akshay Deshpande, and Dan Weitzner about the challenges facing data practitioners today and the solutions that are being brought to market for addressing them, as well as the work they are doing that got them recognized as "cool vendors" by Gartner.

Interview

Introduction How did you get involved in the area of data management? Can you each describe what you view as the biggest challenge facing data professionals? Who are you building your solutions for and what are the most common data management problems are you all solving? What are different components of Data Management and why is it so complex? What will simplify this process, if any? The report covers a lot of new data management terminology – data governance, data observability, data fabric, data mesh, DataOps, MLOps, AIOps – what does this all mean and why is it important for data engineers? How has the data management space changed in recent times? Describe the current data management landscape and any key developments. From your perspective, what are the biggest challenges in the data management space today? What modern data management features are lacking in existing databases? Gartner imagines a future where data and analytics leaders need to be prepared to rely on data manage

AI/ML Analytics Data Engineering Data Governance Data Management DataOps GitHub Kubernetes Looker Modern Data Stack Fabric MLOps Redshift Cyber Security Snowflake SQL
Julien Le Dem – creator of Parquet , Tobias Macey – host

Summary Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information. In order to eliminate the wasted effort of building custom integrations every time you want to combine lineage information across systems Julien Le Dem introduced the OpenLineage specification. In this episode he explains his motivations for starting the effort, the far-reaching benefits that it can provide to the industry, and how you can start integrating it into your data platform today. This is an excellent conversation about how competing companies can still find mutual benefit in co-operating on open standards.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt. Your host is Tobias Macey and today I’m interviewing Julien Le Dem about Open Lineage, a new standard for structuring metadata to enable interoperability across the ecosystem of data management tools.

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what the Open Lineage project is and the story behind it? What is the current state of t

AI/ML Analytics BigQuery Cloud Computing Data Engineering Data Management Data Quality DWH Kubernetes Redshift Snowflake Data Streaming
Showing 16 results