talk-data.com talk-data.com

Filter by Source

Select conferences and events

Showing 11 results

Activities & events

Title & Speakers Event

This practical, in-depth guide shows you how to build modern, sophisticated data processes using the Snowflake platform and DataOps.live —the only platform that enables seamless DataOps integration with Snowflake. Designed for data engineers, architects, and technical leaders, it bridges the gap between DataOps theory and real-world implementation, helping you take control of your data pipelines to deliver more efficient, automated solutions. . You’ll explore the core principles of DataOps and how they differ from traditional DevOps, while gaining a solid foundation in the tools and technologies that power modern data management—including Git, DBT, and Snowflake. Through hands-on examples and detailed walkthroughs, you’ll learn how to implement your own DataOps strategy within Snowflake and maximize the power of DataOps.live to scale and refine your DataOps processes. Whether you're just starting with DataOps or looking to refine and scale your existing strategies, this book—complete with practical code examples and starter projects—provides the knowledge and tools you need to streamline data operations, integrate DataOps into your Snowflake infrastructure, and stay ahead of the curve in the rapidly evolving world of data management. What You Will Learn Explore the fundamentals of DataOps , its differences from DevOps, and its significance in modern data management Understand Git’s role in DataOps and how to use it effectively Know why DBT is preferred for DataOps and how to apply it Set up and manage DataOps.live within the Snowflake ecosystem Apply advanced techniques to scale and evolve your DataOps strategy Who This Book Is For Snowflake practitioners—including data engineers, platform architects, and technical managers—who are ready to implement DataOps principles and streamline complex data workflows using DataOps.live.

data data-engineering Snowflake Data Management DataOps dbt DevOps Git
O'Reilly Data Engineering Books

🌟 Session Overview 🌟

Session Name: Open Source Entity Resolution - Needs and Challenges Speaker: Sonal Goyal Session Description: Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields, which makes it hard to combine them together, especially with growing data volumes. This hurts customer analytics - establishing lifetime value, loyalty programs, or marketing channels is impossible when the base data is not linked. No AI algorithm for segmentation can produce the right results when there are multiple copies of the same customer lurking in the data. No warehouse can live up to its promise if the dimension tables have duplicates.

With a modern data stack and DataOps, we have established patterns for E and L in ELT for building data warehouses, datalakes and deltalakes. However, the T - getting data ready for analytics still needs a lot of effort. Modern tools like dbt are actively and successfully addressing this. What is also needed is a quick and scalable way to resolve entities to build the single source of truth of core business entities post Extraction and pre or post Loading.

This session would cover the problem of Entity Resolution, its practical applications and challenges in building an entity resolution system. It will also cover Zingg - an Open Source Framework for building Entity Resolution systems. (https://github.com/zinggAI/zingg/) 🚀 About Big Data and RPA 2024 🚀

Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨

📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP

💡 Stay Connected & Updated 💡

Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!

🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT

AI/ML Analytics Big Data Dashboard DataOps dbt ETL/ELT GitHub Marketing Modern Data Stack
DATA MINER Big Data Europe Conference 2020

dbt Meetups are networking events open to all folks working with data! Talks predominantly focus on community members’ experience with dbt, however, you’ll catch presentations on broader topics such as analytics engineering, data stacks, DataOps, modeling, testing, and team structures.

📍 Venue Host: Utopicus Habana (P.º de La Habana, 9, 11, 28036 Madrid) 🍕 Catering: Drinks & Pizza at the place of the event 🤝 Organizer: Astrafy is organizing this event, enabled by the community team at dbt Labs

*To attend, please read the Health and Safety Policy and Terms of Participation: ***https://www.getdbt.com/legal/health-and-safety-policy

🗓️Agenda:

  • 18:45 - 19:00 \| Welcome
  • 19:00 - 19:20 \| Data validation with dbt: Going beyond testing (Alejandro de la Cruz Lopez - Astrafy)
  • 19:25 - 19:45 \| How Airflow + dbt Work Together at Okta (Miquel Angel Andreu Febrer - Okta)
  • 19:50 - 20:10 \| Leveraging column-level lineage to scale your dbt projects (Benoit Perigaud - dbt)
  • 20:15 - 22h00 \| Networking with drinks & pizzas 🍺 🍕

🗣️Presentation #1: dbt can be leveraged for more than just basic testing. We will dive into advanced data validation techniques that ensure data quality beyond conventional testing in dbt. We will use Recce as a new emerging tool that allows validation checks and improved approval requests.

Speaker bio: Alejandro de la Cruz López is an experienced Data Engineer with a strong background in Data Science and Artificial Intelligence. He has led various data projects, optimizing systems and improving infrastructure for several organizations. Alejandro holds multiple professional certifications and has authored articles on data engineering practices. His work focuses on delivering efficient, scalable data solutions in the cloud.

---

🗣️Presentation #2: In this presentation, Miquel Angel will show how Okta dynamically builds all dbt DAGs from upstream to downstream based on tags and the dbt project structure, automates tests inside the dags, and uses the same warehouse configuration for both dbt runs and tests.

Speaker bio: Data Engineer specialized in ETL, BigData processes, and DevOps

🗣️Presentation #3: Today, we have tools to enforce quality checks on projects, at the model level, like dbt_project_evaluator. Those tools are indispensable to allow teams to scale their dbt transformation. But while we've been focusing on rules at the model level. Could we leverage CLL to also define rules at the column level now? The idea of this talk would be to build an open-source tool and present what problems it can solve.

Speaker bio: Staff Analytics Engineer at dbt Labs

➡️ Join the dbt Slack community: https://www.getdbt.com/community/ 🤝For the best Meetup experience, make sure to join the #local-\ channel in dbt Slack (https://slack.getdbt.com/). ---------------------------------- dbt is the standard in data transformation, used by over 40,000 organizations worldwide. Through the application of software engineering best practices like modularity, version control, testing, and documentation, dbt’s analytics engineering workflow helps teams work more efficiently to produce data the entire organization can trust. Learn more: https://www.getdbt.com/

Madrid dbt Meetup #5 (in-person)
Stockholm dbt Meetup 2024-05-30 · 15:30

What are dbt Meetups? dbt Meetups are networking events open to all folks working with data! Talks predominantly focus on community members' experience with dbt, however, you'll catch presentations on broader topics such as analytics engineering, data stacks, data ops, modeling, testing, and team structures.

🤝Organizer: Solita 🏠Venue Host: Solita, Lästmakargatan 10, 111 44 Stockholm SWEDEN 🍕Refreshments: Pizza & Drinks

📝Agenda 17.30 - 18.00 Registration and mingle 18.00 - 18.15 Welcome 18.15-18.45 🎤 Presentation: LLM assisted generation of dbt models for enterprise scale projects - Ludwig Sewall (Solita) Migrating projects with 1000+ models from legacy ETL tools to dbt can be a tedious and manual work. This talk will give you ideas and hands-on strategies on how to automate that process. We will cover everything from migration strategies, source code parsing to actual model generation in dbt.

Ludwig is an experienced architect and leader in the analytics space, he is one of 81 Snowflake Data Superheroes. He has successfully developed and implemented petabyte-scale data solutions that serve thousands of users. Ludwig is deeply committed to fostering the Snowflake and dbt communities through his leadership of strategic partnerships at Solita.

18.45 - 19.15 🎤 Presentation: Leveraging your dbt metadata to build powerful automation - Fernando Brito (SELECT) Every dbt project exposes metadata that can be used to introspect and augment the project itself. Be inspired to build your own automation by learning how Fernando developed an open-source project to automatically connect dashboards from a business intelligence tool with the dbt models it consumes, exposing their relationships on dbt's data lineage.

Fernando is a Lead Data Engineer with a background in Software Engineering. He is passionate about DataOps and he has built data platforms from scratch that serve terabytes of data to hundreds of internal users. Today he works as a consultant on SELECT, a cost management platform for data platforms.

19.15 - 19.45 🗣️🤝 Peer Exchange/ Community Discussion: Joint table discussion organised by moderator Kshitij Aranke (dbt Labs) and Solita. We will have a discussion where you choose 1 of 3 discussion groups to join: 1) Embracing AI, 2) Data Analytics at Scale, or 3) Analytics Engineering Best Practices

19.45 - 21.30 Network, Food & Drinks To attend, please read the Health and Safety Policy and Terms of Participation: getdbt.com/legal/health-and-safety-policy

➡️ Join the dbt Slack community: https://www.getdbt.com/community/ 🤝For the best Meetup experience, make sure to join the #local-stockholm channel in dbt Slack (https://slack.getdbt.com/). ---------------------------------- dbt allows teams to ship trusted data products, faster. dbt is a data transformation framework that lets analysts and engineers collaborate using their shared knowledge of SQL. Through the application of software engineering best practices like modularity, version control, testing, and documentation, dbt’s analytics engineering workflow helps teams work more efficiently to produce data the entire organization can trust.

Learn more: https://www.getdbt.com/

Stockholm dbt Meetup

Sponsored by Amdaris - https://amdaris.com Amdaris is your trusted partner for high velocity extended delivery teams. Location: Amdaris, Finzels Reach, Aurora, Bristol BS1 6BX

AGENDA 18.00 – 18:30 Meet & Greet -------------- 18:30 - 19:15 DataOps by Jonathan D'Aloia

In this session I will demonstrate how you can implement private link technology to secure your PAAS components across your Data-Platform.

Private endpoints are talked about more and more therefore as part of this session I will describe what Private Link technology is. I'll then go on to demonstrate the use cases for Private Link, the considerations that need to be taken into account and some example architectures that can be used when implementing across modern Data-Platforms.

I will also show how Private Link can be deployed and implemented through Infrastructure as code to enable you to scale and deploy Private Link in a consistent and secure way.

In this session I'll cover what the Data Lakehouse architecture is, where it fits against existing architectures like a data warehouse, and why you should build one. We'll also cover the underlying technology options to arm you with all of the information you need to plan your next data platform.

-------------- 19:15 - 19:45 Pizza and Networking --------------

19:45 - 20:30 Introduction to Data Build Tools by James C Yarrow

DBT stands for Data Build Tools and has become a leading choice for Transformation in the ELT data stack. DBT is a transformation workflow that can generate boilerplate, modularize your logic and generate test and error logic on the data itself. We will see how this can help us manage our SQL and business logic.

We'll begin with a level 100 introduction that will build on concepts. This session expects you to be comfortable with SQL and a very basic understanding on python.

-------------- 20:30 - Pub --------------

About Amdaris Of course, we’re obsessed with cutting-edge technology but it’s how much we care about people that sets us apart. Whether that’s looking after our clients, our staff or the next generation of tech talent, we know that exceptional software development is only possible with exceptional teamwork. If you need help extending your team, building your big idea or application support, we offer a better way to do software.

DataOps and Data Build Tools DBT
Toby Mao – guest @ SQLMesh , Tobias Macey – host

Summary

Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack- Your host is Tobias Macey and today I'm interviewing Toby Mao about SQLMesh, an open source DataOps framework designed to scale data transformations with ease of collaboration and validation built in

Interview

Introduction How did you get involved in the area of data management? Can you describe what SQLMesh is and the story behind it?

DataOps is a term that has been co-opted and overloaded. What are the concepts that you are trying to convey with that term in the context of SQLMesh?

What are the rough edges in existing toolchains/workflows that you are trying to address with SQLMesh?

How do those rough edges impact the productivity and effectiveness of teams using those

Can you describe how SQLMesh is implemented?

How have the design and goals evolved since you first started working on it?

What are the lessons that you have learned from dbt which have informed the design and functionality of SQLMesh? For teams who have already invested in dbt, what is the migration path from or integration with dbt? You have some built-in integration with/awareness of orchestrators (currently Airflow). What are the benefits of making the transformation tool aware of the orchestrator? What do you see as the potential benefits of integration with e.g. data-diff? What are the second-order benefits of using a tool such as SQLMesh that addresses the more mechanical aspects of managing transformation workfows and the associated dependency chains? What are the most interesting, innovative, or unexpected ways that you have seen SQLMesh used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLMesh? When is SQLMesh the wrong choice? What do you have planned for the future of SQLMesh?

Contact Info

tobymao on GitHub @captaintobs on Twitter Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

SQLMesh Tobiko Data SAS AirBnB Minerva SQLGlot Cron AST == Abstract Syntax Tree Pandas Terraform dbt

Podcast Episode

SQLFluff

Podcast.init Episode

The intro and outro music is from The Hug by The Freak Fandango Orc

AI/ML Airflow CDP Data Engineering Data Lake Data Management DataOps dbt GitHub ORC Pandas Python SAS SQL SQLFluff SQLMesh Data Streaming Terraform

Summary

This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Your host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years

Interview

Introduction 6 years of running the Data Engineering Podcast Around the first time that data engineering was discussed as a role

Followed on from hype about "data science"

Hadoop era Streaming Lambda and Kappa architectures

Not really referenced anymore

"Big Data" era of capture everything has shifted to focusing on data that presents value

Regulatory environment increases risk, better tools introduce more capability to understand what data is useful

Data catalogs

Amundsen and Alation

Orchestration engine

Oozie, etc. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc. Orchestration is now a part of most vertical tools

Cloud data warehouses Data lakes DataOps and MLOps Data quality to data observability Metadata for everything

Data catalog -> data discovery -> active metadata

Business intelligence

Read only reports to metric/semantic layers Embedded analytics and data APIs

Rise of ELT

dbt Corresponding introduction of reverse ETL

What are the most interesting, unexpected, or challenging lessons that you have learned while working on running the podcast? What do you have planned for the future of the podcast?

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Materialize: Materialize

Looking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use.

Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.

Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.

Go to materialize.comSupport Data Engineering Podcast

AI/ML Airflow Alation Analytics API AWS Lambda BI Big Data Cloud Computing Dagster Data Engineering Data Management Data Quality Data Science DataOps dbt ETL/ELT Hadoop Luigi MLOps postgresql Prefect Python SQL Data Streaming
Itamar Ben Hemo – CEO and founder @ Rivery , Tobias Macey – host

Summary Data engineering is a practice that is multi-faceted and requires integration with a large number of systems. This often means working across multiple tools to get the job done which can introduce significant cost to productivity due to the number of context switches. Rivery is a platform designed to reduce this incidental complexity and provide a single system for working across the different stages of the data lifecycle. In this episode CEO and founder Itamar Ben hemo explains how his experiences in the industry led to his vision for the Rivery platform as a single place to build end-to-end analytical workflows, including how it is architected and how you can start using it today for your own work.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! Your host is Tobias Macey and today I’m interviewing Itamar Ben Hemo about Rivery, a SaaS platform designed to provide an end-to-end solution for Ingestion, Transformation, Orchestration,

Airflow Analytics CI/CD Data Engineering Data Management Data Quality Datafold DataOps dbt GitHub Kubernetes Looker Modern Data Stack SaaS Snowflake SQL
Saket Saurabh – CEO @ Nexla , Avinash Shahdadpuri – CTO @ Nexla , Tobias Macey – host

Summary The technological and social ecosystem of data engineering and data management has been reaching a stage of maturity recently. As part of this stage in our collective journey the focus has been shifting toward operation and automation of the infrastructure and workflows that power our analytical workloads. It is an encouraging sign for the industry, but it is still a complex and challenging undertaking. In order to make this world of DataOps more accessible and manageable the team at Nexla has built a platform that decouples the logical unit of data from the underlying mechanisms so that you can focus on the problems that really matter to your business. In this episode Saket Saurabh (CEO) and Avinash Shahdadpuri (CTO) share the story behind the Nexla platform, discuss the technical underpinnings, and describe how their concept of a Nexset simplifies the work of building data products for sharing within and between organizations.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Saket Saurabh and Avinash Shahdadpuri about Nexla, a platform for powering data operations and sharing within and across businesses

Interview

Introduction How did you get involved in the area of data management? Can you describe what Nexla is and the story behind it? What are the major problems that Nexla is aiming to solve?

What are the components of a data platform that Nexla might replace?

What are the use cases and benefits of being able to publish data sets for use outside and across organizations? What are the different elements involved in implementing DataOps? How is the Nexla platform implemented?

What have been the most comple engineering challenges? How has the architecture changed or evolved since you first began working on it? What are some of the assumpt

AI/ML Airflow API Cloud Computing CSV Dashboard Data Engineering Data Management Data Quality DataOps dbt Hubspot Kubernetes Marketing Python SaaS SQL
Shevek – CTO @ Compilerworks , Tobias Macey – host

Summary A major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? Compilerworks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Shevek about Compilerworks and his work on writing compilers to automate data lineage tracking from your SQL code

Interview

Introduction How did you get involved in the area of data management? Can you describe what Compilerworks is and the story behind it? What is a compiler?

How are you applying compilers to the challenges of data processing systems?

What are some use cases that Compilerworks is uniquely well suited to? There are a number of other methods and systems available for tracking and/or computing data lineage. What are the benefits of the approach that you are taking with Compilerworks? Can you describe the design and implementation of the Compilerworks platform?

How has the system changed or evolved since you first began working on it?

What programming languages and SQL dialects do you currently support?

Which have been the most challenging to work with? How do you handle verification/validation of the algebraic representation of SQL code given the variability of implementations and the flexibility of the specification?

Can you talk through the process of getting Compilerworks

AI/ML Airflow API Cloud Computing CSV Dashboard Data Engineering Data Management Data Quality DataOps dbt Hubspot Kubernetes Marketing Python SaaS SQL
Maxime Beauchemin – guest , Kevin Stumpf – guest @ Tecton , Tobias Macey – host , Lior Gavish – co-founder @ Monte Carlo

Summary The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. More than just a collection of tools, there are a number of organizational and conceptual changes that a proper DataOps approach depends on. In this episode Kevin Stumpf, CTO of Tecton, Maxime Beauchemin, CEO of Preset, and Lior Gavish, CTO of Monte Carlo, discuss the grand vision and present realities of DataOps. They explain how to think about your data systems in a holistic and maintainable fashion, the security challenges that threaten to derail your efforts, and the power of using metadata as the foundation of everything that you do. If you are wondering how to get control of your data platforms and bring all of your stakeholders onto the same page then this conversation is for you.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Max Beauchemin, Lior Gavish, and Kevin Stumpf about the real world challenges of embracing DataOps practices and systems, and how to keep things secure as you scale

Interview

Introduction How did you get involved in the area of data management? Before we get started, can you each give your definition of what "DataOps" means to you?

How does this differ from "business as usual" in the data industry? What are some of the things that DataOps isn’t (despite what marketers might say)?

What are the biggest difficulties that you have faced in going from concept to production with a workflow or system intended to power self-serve access to other membe

Airflow BI BigQuery CI/CD Cloud Computing Data Engineering Data Management Data Quality Datafold DataOps dbt DevOps DWH ETL/ELT Kubernetes Monte Carlo Redshift Cyber Security Snowflake Data Streaming
Showing 11 results