talk-data.com
People (29 results)
See all 29 →Activities & events
| Title & Speakers | Event |
|---|---|
|
Evolution & Future d’Apache Iceberg (FR)
2025-06-19 · 19:30
Iceberg
|
|
|
L’avènement du Lac de données ouvertes (FR)
2025-06-19 · 18:55
|
|
|
NYC OpenLineage meetup at Datadog
2024-11-19 · 23:00
Data engineers and pipeline managers know that producing data lineage – end-to-end pipeline metadata instrumented at runtime or parsed at design time – is a heavy lift without a shared standard for lineage metadata. It requires duplication of effort across pipeline tooling, and deployment of new tools can break existing lineage workflows. Getting useful lineage can seem like a Sisyphean task. Enter OpenLineage, an increasingly adopted open standard for lineage metadata collection. It defines a generic model of run, job, and dataset entities identified using consistent naming strategies. The core lineage model is extensible by defining specific facets to enrich those entities. Join us at the Datadog Headquarters in NYC on November 19th at 6:00 pm to learn more about the OpenLineage spec and integrations. You’ll meet other members of the ecosystem, learn about the project’s goals and fundamental design, and participate in a robust discussion about the future of the project. Agenda:
Julien Le Dem (Project lead) & Harel Shein (TSC member), OpenLineage
Julian LaNeve (CTO) & Christine Shen (Software Engineer), Astronomer
Igor Kravchenko (Staff Engineer) & Ryan Warrier (Sr. Product Manager), Datadog Interpreting OpenLineage events to power Astro Observe Astronomer recently announced Observe, an observability platform powered by OpenLineage. We've created experiences designed around the reliability and efficiency of datasets delivered by Apache Airflow. To do so, we leverage the Airflow OpenLineage integration to get telemetry and lineage out of Airflow regardless of where it's running. We've created a processor framework that runs custom business logic depending on the event type and metadata included. In this talk, we'll dig under the covers and show our processor framework that allows us to easily extend the Observe platform into new use cases. How OpenLineage fits into the Datadog ecosystem This talk explores how Datadog integrates with OpenLineage to provide observability solutions for data pipelines. The first part is a technical walkthrough of the architecture of how Datadog ingests, processes, and stores OpenLineage events. The second part covers a brief overview of Datadog’s data observability capabilities, how Datadog sees OpenLineage providing value to users, and a demo of Airflow and dbt monitoring built using OpenLineage events. Thank you to our sponsors: Datadog LFAI & Data Foundation Enter from 620 8th Ave, check in at the left desk and take the elevators up to the 45th floor. |
NYC OpenLineage meetup at Datadog
|
|
Cross-Platform Data Lineage with OpenLineage
2023-07-28
Willy Lulciuc
,
Julien Le Dem
– Chief Architect
@ Astronomer
There are more data tools available than ever before, and it is easier to build a pipeline than it has ever been. These tools and advancements have created an explosion of innovation, resulting in data within today's organizations becoming increasingly distributed and can't be contained within a single brain, a single team, or a single platform. Data lineage can help by tracing the relationships between datasets and providing a map of your entire data universe. OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow, Apache Spark™, Flink®, and dbt. This empowers teams to diagnose and address widespread data quality and efficiency issues in real time. In this session, we will show how to trace data lineage across Apache Spark and Apache Airflow. There will be a walk-through of the OpenLineage architecture and a live demo of a running pipeline with real-time data lineage. Talk by: Julien Le Dem,Willy Lulciuc Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc |
Databricks DATA + AI Summit 2023 |
|
Ten years of building open source standards: From Parquet to Arrow to OpenLineage | Astronomer
2023-05-11 · 18:59
Julien Le Dem
– Chief Architect
@ Astronomer
ABOUT THE TALK: Over the last decade I have been lucky enough to contribute a few successful open source projects to the data ecosystem. In this talk Julien Le Dem shares the story of his contribution to successful open source projects to the data ecosystem and what made their success possible. From the ideation process and early growth of the Apache Parquet columnar format and how this led to the creation of its in-memory alter-ego Apache Arrow. Julian will end with showing how this experience enabled the success of OpenLineage, an LFAI & Data project that brings observability to the data ecosystem. ABOUT THE SPEAKER: Julien Le Dem is the Chief Architect of Astronomer and Co-Founder of Datakin. He co-created Apache Parquet and is involved in several open source projects including OpenLineage, Marquez (LFAI&Data), Apache Arrow, Apache Iceberg and a few others. Previously, he was a senior principal at Wework; principal architect at Dremio; and tech lead for Twitter’s data processing tools and principal engineer working on content platforms at Yahoo, where he received his Hadoop initiation. ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies. FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/ |
Data Council 2023 |
|
Julien Le Dem: Why Data Lineage Matters
2021-11-04 · 07:00
Julien has a unique history of building open frameworks that make data platforms interoperable. He's contributed in various ways to Apache Arrow, Apache Iceberg, Apache Parquet, and Marquez, and is currently leading OpenLineage, an open framework for data lineage collection and analysis. In this episode, Tristan & Julia dive into how open source projects grow to become standards, and why data lineage in particular is in need of an open standard. They also cover into some of the compelling use cases for this data lineage metadata, and where you might be able to deploy it in your work. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs. |
The Analytics Engineering Podcast |
|
97 Things Every Data Engineer Should Know
2021-06-14
Tobias Macey
– author
Take advantage of today's sky-high demand for data engineers. With this in-depth book, current and aspiring engineers will learn powerful real-world best practices for managing data big and small. Contributors from notable companies including Twitter, Google, Stitch Fix, Microsoft, Capital One, and LinkedIn share their experiences and lessons learned for overcoming a variety of specific and often nagging challenges. Edited by Tobias Macey, host of the popular Data Engineering Podcast, this book presents 97 concise and useful tips for cleaning, prepping, wrangling, storing, processing, and ingesting data. Data engineers, data architects, data team managers, data scientists, machine learning engineers, and software engineers will greatly benefit from the wisdom and experience of their peers. Topics include: The Importance of Data Lineage - Julien Le Dem Data Security for Data Engineers - Katharine Jarmul The Two Types of Data Engineering and Data Engineers - Jesse Anderson Six Dimensions for Picking an Analytical Data Warehouse - Gleb Mezhanskiy The End of ETL as We Know It - Paul Singman Building a Career as a Data Engineer - Vijay Kiran Modern Metadata for the Modern Data Stack - Prukalpa Sankar Your Data Tests Failed! Now What? - Sam Bail |
O'Reilly Data Engineering Books
|
|
Unlocking The Power of Data Lineage In Your Platform with OpenLineage
2021-05-18 · 14:00
Julien Le Dem
– creator of Parquet
,
Tobias Macey
– host
Summary Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information. In order to eliminate the wasted effort of building custom integrations every time you want to combine lineage information across systems Julien Le Dem introduced the OpenLineage specification. In this episode he explains his motivations for starting the effort, the far-reaching benefits that it can provide to the industry, and how you can start integrating it into your data platform today. This is an excellent conversation about how competing companies can still find mutual benefit in co-operating on open standards. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt. Your host is Tobias Macey and today I’m interviewing Julien Le Dem about Open Lineage, a new standard for structuring metadata to enable interoperability across the ecosystem of data management tools. Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what the Open Lineage project is and the story behind it? What is the current state of t |
|
|
Solving Data Lineage Tracking And Data Discovery At WeWork
2019-12-16 · 13:00
Summary Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email [email protected] with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata Interview Introduction How did you get involved in the area of data management? Can you start by describing what Marquez is? What was missing in existing metadata management platforms that necessitated the creation of Marquez? How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs? How does it compare to the Amundsen platform that Lyft recently released? What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see? What are some of the capabilities that are unique to Marquez and how are you using them at WeWork? What are the primary resource types that you support in Marquez? What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository? Can you explain how Marquez is architected and how the design has evolved since you first began working on it? Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez? What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database? How is the metadata itself stored and managed in Marquez? How much up-front data modeling is necessary and what types of schema representations are supported? Can you talk through the overall workflow of someone using Marquez in their environment? What is involved in registering and updating datasets? How do you define and track the health of a given dataset? What are some of the interesting questions that can be answered from the information stored in Marquez? What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases? For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it? What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform? When is Marquez the wrong choice for a metadata repository? What do you have planned for the future of Marquez? Contact Info Julien Le Dem @J_ on Twitter Email julienledem on GitHub Willy LinkedIn @wslulciuc on Twitter wslulciuc on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Marquez DataEngConf Presentation WeWork Canary Yahoo Dremio Hadoop Pig Parquet Podcast Episode Airflow Apache Atlas Amundsen Podcast Episode Uber DataBook LinkedIn DataHub Iceberg Table Format Podcast Episode Delta Lake Podcast Episode Great Expectations data pipeline unit testing framework Podcast.init Episode Redshift SnowflakeDB Podcast Episode Apache Kafka Schema Registry Podcast Episode Open Tracing Jaeger Zipkin DropWizard Java framework Marquez UI Cayley Graph Database Kubernetes Marquez Helm Chart Marquez Docker Container Dagster Podcast Episode Luigi DBT Podcast Episode Thrift Protocol Buffers The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss"… |
|
|
Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8
2017-11-22 · 14:00
Summary With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems. Interview Introduction How did you first get involved in the area of data management? What are the main serialization formats used for data storage and analysis? What are the tradeoffs that are offered by the different formats? How have the different storage and analysis tools influenced the types of storage formats that are available? You’ve each developed a new on-disk data format, Avro and Parquet respectively. What were your motivations for investing that time and effort? Why is it important for data engineers to carefully consider the format in which they transfer their data between systems? What are the switching costs involved in moving from one format to another after you have started using it in a production system? What are some of the new or upcoming formats that you are each excited about? How do you anticipate the evolving hardware, patterns, and tools for processing data to influence the types of storage formats that maintain or grow their popularity? Contact Information Doug: cutting on GitHub Blog @cutting on Twitter Julien Email @J_ on Twitter Blog julienledem on GitHub Links Apache Avro Apache Parquet Apache Arrow Hadoop Apache Pig Xerox Parc Excite Nutch Vertica Dremel White Paper Twitter Blog on Release of Parquet CSV XML Hive Impala Presto Spark SQL Brotli ZStandard Apache Drill Trevni Apache Calcite The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast |
|