Search – talk-data.com

Title & Speakers	Event
WEBINAR "Differentially-Private Synthetic Data for Everyone" 2025-06-04 · 16:00 Pre-registration is REQUIRED. Add to your calendar - https://hubs.li/Q03lF-X-0 In this hands-on session, you'll learn how to generate high-quality synthetic data that preserves privacy using differential privacy techniques. We’ll walk through how to train differentially private generative models with MOSTLY AI’s open-source Synthetic Data SDK and explore how this method compares to traditional anonymization approaches in terms of both utility and risk. You’ll gain practical insights into configuring privacy parameters, understanding the impact of privacy budgets, and evaluating synthetic data output. We’ll also cover how to assess the fidelity of synthetic datasets using predictive and discriminative machine learning models, and how to create hybrid datasets that blend real and synthetic data for improved utility. Through live demonstrations and real-world examples, you’ll develop a strong understanding of the privacy-utility trade-offs and how to confidently apply privacy-safe synthetic data in your own data science workflows. Session Outline: Lesson 1: Introduction to Differential Privacy Get familiar with the core concepts of differential privacy and how it differs from traditional anonymization techniques. By the end of this lesson, you’ll be able to explain what differential privacy is, what a privacy budget (epsilon) means, and why it provides stronger privacy guarantees than pseudonymization or masking. Lesson 2: Setting Up and Using the Synthetic Data SDK Learn how to install and configure MOSTLY AI’s open-source Synthetic Data SDK to generate synthetic datasets with differential privacy enabled. You’ll run the SDK in LOCAL mode using a prepared dataset, explore the configuration options for privacy settings, and review the structure of the synthetic output. Lesson 3: Evaluating Utility vs. Privacy Trade-offs Compare synthetic datasets generated with different privacy settings to understand how utility is impacted by stricter privacy budgets. By the end of this lesson, you’ll be able to evaluate the usefulness of differentially private synthetic data using predictive models and summary statistics. Lesson 4: Creating Hybrid Datasets with Real and Synthetic Data Explore how to combine real and synthetic data to create hybrid datasets that retain utility while improving privacy. You’ll walk through a practical example and learn how to use synthetic data to augment or replace sensitive parts of your dataset. Difficulty: Intermediate Pre-reqs: This tutorial is designed for data engineers, data scientists, ML engineers, and analysts with basic Python skills and familiarity with working in Jupyter Notebooks. Attendees should have a general understanding of machine learning workflows and working with tabular datasets (e.g., CSV files or pandas DataFrames). No prior experience with synthetic data is required. To participate fully in the hands-on exercises, attendees should have the following installed before the session: Python 3.11+, Git Installation and setup of the Synthetic Data SDK will be covered as part of the tutorial. But feel free to get started beforehand by visiting https://github.com/mostly-ai/mostlyai. Speaker: Dr. Michael Platzer, Co-Founder and CTO of MOSTLY AI Dr. Michael Platzer is co-founder and CTO of MOSTLY AI, a leader in privacy-safe synthetic data generation. He earned his degrees in mathematics and in business with distinction, led consumer analytic teams at global technology leaders, before starting his venture to pioneer the field of synthetic data. His company's mission is to democratize data access and data insights in a safe and responsible way for everyone. ODSC Links: • Get free access to more talks/trainings like this at Ai+ Training platform: https://hubs.li/H0Zycsf0 • ODSC blog: https://opendatascience.com/ • Facebook: https://www.facebook.com/OPENDATASCI • Twitter: https://twitter.com/_ODSC & @odsc • LinkedIn: https://www.linkedin.com/company/open-data-science • Slack Channel: https://hubs.li/Q038cQBy0 • Code of conduct: https://odsc.com/code-of-conduct/	WEBINAR "Differentially-Private Synthetic Data for Everyone"
WEBINAR "Differentially-Private Synthetic Data for Everyone" 2025-06-04 · 16:00 Pre-registration is REQUIRED. Add to your calendar - https://hubs.li/Q03lF-X-0 In this hands-on session, you'll learn how to generate high-quality synthetic data that preserves privacy using differential privacy techniques. We’ll walk through how to train differentially private generative models with MOSTLY AI’s open-source Synthetic Data SDK and explore how this method compares to traditional anonymization approaches in terms of both utility and risk. You’ll gain practical insights into configuring privacy parameters, understanding the impact of privacy budgets, and evaluating synthetic data output. We’ll also cover how to assess the fidelity of synthetic datasets using predictive and discriminative machine learning models, and how to create hybrid datasets that blend real and synthetic data for improved utility. Through live demonstrations and real-world examples, you’ll develop a strong understanding of the privacy-utility trade-offs and how to confidently apply privacy-safe synthetic data in your own data science workflows. Session Outline: Lesson 1: Introduction to Differential Privacy Get familiar with the core concepts of differential privacy and how it differs from traditional anonymization techniques. By the end of this lesson, you’ll be able to explain what differential privacy is, what a privacy budget (epsilon) means, and why it provides stronger privacy guarantees than pseudonymization or masking. Lesson 2: Setting Up and Using the Synthetic Data SDK Learn how to install and configure MOSTLY AI’s open-source Synthetic Data SDK to generate synthetic datasets with differential privacy enabled. You’ll run the SDK in LOCAL mode using a prepared dataset, explore the configuration options for privacy settings, and review the structure of the synthetic output. Lesson 3: Evaluating Utility vs. Privacy Trade-offs Compare synthetic datasets generated with different privacy settings to understand how utility is impacted by stricter privacy budgets. By the end of this lesson, you’ll be able to evaluate the usefulness of differentially private synthetic data using predictive models and summary statistics. Lesson 4: Creating Hybrid Datasets with Real and Synthetic Data Explore how to combine real and synthetic data to create hybrid datasets that retain utility while improving privacy. You’ll walk through a practical example and learn how to use synthetic data to augment or replace sensitive parts of your dataset. Difficulty: Intermediate Pre-reqs: This tutorial is designed for data engineers, data scientists, ML engineers, and analysts with basic Python skills and familiarity with working in Jupyter Notebooks. Attendees should have a general understanding of machine learning workflows and working with tabular datasets (e.g., CSV files or pandas DataFrames). No prior experience with synthetic data is required. To participate fully in the hands-on exercises, attendees should have the following installed before the session: Python 3.11+, Git Installation and setup of the Synthetic Data SDK will be covered as part of the tutorial. But feel free to get started beforehand by visiting https://github.com/mostly-ai/mostlyai. Speaker: Dr. Michael Platzer, Co-Founder and CTO of MOSTLY AI Dr. Michael Platzer is co-founder and CTO of MOSTLY AI, a leader in privacy-safe synthetic data generation. He earned his degrees in mathematics and in business with distinction, led consumer analytic teams at global technology leaders, before starting his venture to pioneer the field of synthetic data. His company's mission is to democratize data access and data insights in a safe and responsible way for everyone. ODSC Links: • Get free access to more talks/trainings like this at Ai+ Training platform: https://hubs.li/H0Zycsf0 • ODSC blog: https://opendatascience.com/ • Facebook: https://www.facebook.com/OPENDATASCI • Twitter: https://twitter.com/_ODSC & @odsc • LinkedIn: https://www.linkedin.com/company/open-data-science • Slack Channel: https://hubs.li/Q038cQBy0 • Code of conduct: https://odsc.com/code-of-conduct/	WEBINAR "Differentially-Private Synthetic Data for Everyone"
WEBINAR "Differentially-Private Synthetic Data for Everyone" 2025-06-04 · 16:00 Pre-registration is REQUIRED. Add to your calendar - https://hubs.li/Q03lF-X-0 In this hands-on session, you'll learn how to generate high-quality synthetic data that preserves privacy using differential privacy techniques. We’ll walk through how to train differentially private generative models with MOSTLY AI’s open-source Synthetic Data SDK and explore how this method compares to traditional anonymization approaches in terms of both utility and risk. You’ll gain practical insights into configuring privacy parameters, understanding the impact of privacy budgets, and evaluating synthetic data output. We’ll also cover how to assess the fidelity of synthetic datasets using predictive and discriminative machine learning models, and how to create hybrid datasets that blend real and synthetic data for improved utility. Through live demonstrations and real-world examples, you’ll develop a strong understanding of the privacy-utility trade-offs and how to confidently apply privacy-safe synthetic data in your own data science workflows. Session Outline: Lesson 1: Introduction to Differential Privacy Get familiar with the core concepts of differential privacy and how it differs from traditional anonymization techniques. By the end of this lesson, you’ll be able to explain what differential privacy is, what a privacy budget (epsilon) means, and why it provides stronger privacy guarantees than pseudonymization or masking. Lesson 2: Setting Up and Using the Synthetic Data SDK Learn how to install and configure MOSTLY AI’s open-source Synthetic Data SDK to generate synthetic datasets with differential privacy enabled. You’ll run the SDK in LOCAL mode using a prepared dataset, explore the configuration options for privacy settings, and review the structure of the synthetic output. Lesson 3: Evaluating Utility vs. Privacy Trade-offs Compare synthetic datasets generated with different privacy settings to understand how utility is impacted by stricter privacy budgets. By the end of this lesson, you’ll be able to evaluate the usefulness of differentially private synthetic data using predictive models and summary statistics. Lesson 4: Creating Hybrid Datasets with Real and Synthetic Data Explore how to combine real and synthetic data to create hybrid datasets that retain utility while improving privacy. You’ll walk through a practical example and learn how to use synthetic data to augment or replace sensitive parts of your dataset. Difficulty: Intermediate Pre-reqs: This tutorial is designed for data engineers, data scientists, ML engineers, and analysts with basic Python skills and familiarity with working in Jupyter Notebooks. Attendees should have a general understanding of machine learning workflows and working with tabular datasets (e.g., CSV files or pandas DataFrames). No prior experience with synthetic data is required. To participate fully in the hands-on exercises, attendees should have the following installed before the session: Python 3.11+, Git Installation and setup of the Synthetic Data SDK will be covered as part of the tutorial. But feel free to get started beforehand by visiting https://github.com/mostly-ai/mostlyai. Speaker: Dr. Michael Platzer, Co-Founder and CTO of MOSTLY AI Dr. Michael Platzer is co-founder and CTO of MOSTLY AI, a leader in privacy-safe synthetic data generation. He earned his degrees in mathematics and in business with distinction, led consumer analytic teams at global technology leaders, before starting his venture to pioneer the field of synthetic data. His company's mission is to democratize data access and data insights in a safe and responsible way for everyone. ODSC Links: • Get free access to more talks/trainings like this at Ai+ Training platform: https://hubs.li/H0Zycsf0 • ODSC blog: https://opendatascience.com/ • Facebook: https://www.facebook.com/OPENDATASCI • Twitter: https://twitter.com/_ODSC & @odsc • LinkedIn: https://www.linkedin.com/company/open-data-science • Slack Channel: https://hubs.li/Q038cQBy0 • Code of conduct: https://odsc.com/code-of-conduct/	WEBINAR "Differentially-Private Synthetic Data for Everyone"
WEBINAR "Differentially-Private Synthetic Data for Everyone" 2025-06-04 · 16:00 Pre-registration is REQUIRED. Add to your calendar - https://hubs.li/Q03lF-X-0 In this hands-on session, you'll learn how to generate high-quality synthetic data that preserves privacy using differential privacy techniques. We’ll walk through how to train differentially private generative models with MOSTLY AI’s open-source Synthetic Data SDK and explore how this method compares to traditional anonymization approaches in terms of both utility and risk. You’ll gain practical insights into configuring privacy parameters, understanding the impact of privacy budgets, and evaluating synthetic data output. We’ll also cover how to assess the fidelity of synthetic datasets using predictive and discriminative machine learning models, and how to create hybrid datasets that blend real and synthetic data for improved utility. Through live demonstrations and real-world examples, you’ll develop a strong understanding of the privacy-utility trade-offs and how to confidently apply privacy-safe synthetic data in your own data science workflows. Session Outline: Lesson 1: Introduction to Differential Privacy Get familiar with the core concepts of differential privacy and how it differs from traditional anonymization techniques. By the end of this lesson, you’ll be able to explain what differential privacy is, what a privacy budget (epsilon) means, and why it provides stronger privacy guarantees than pseudonymization or masking. Lesson 2: Setting Up and Using the Synthetic Data SDK Learn how to install and configure MOSTLY AI’s open-source Synthetic Data SDK to generate synthetic datasets with differential privacy enabled. You’ll run the SDK in LOCAL mode using a prepared dataset, explore the configuration options for privacy settings, and review the structure of the synthetic output. Lesson 3: Evaluating Utility vs. Privacy Trade-offs Compare synthetic datasets generated with different privacy settings to understand how utility is impacted by stricter privacy budgets. By the end of this lesson, you’ll be able to evaluate the usefulness of differentially private synthetic data using predictive models and summary statistics. Lesson 4: Creating Hybrid Datasets with Real and Synthetic Data Explore how to combine real and synthetic data to create hybrid datasets that retain utility while improving privacy. You’ll walk through a practical example and learn how to use synthetic data to augment or replace sensitive parts of your dataset. Difficulty: Intermediate Pre-reqs: This tutorial is designed for data engineers, data scientists, ML engineers, and analysts with basic Python skills and familiarity with working in Jupyter Notebooks. Attendees should have a general understanding of machine learning workflows and working with tabular datasets (e.g., CSV files or pandas DataFrames). No prior experience with synthetic data is required. To participate fully in the hands-on exercises, attendees should have the following installed before the session: Python 3.11+, Git Installation and setup of the Synthetic Data SDK will be covered as part of the tutorial. But feel free to get started beforehand by visiting https://github.com/mostly-ai/mostlyai. Speaker: Dr. Michael Platzer, Co-Founder and CTO of MOSTLY AI Dr. Michael Platzer is co-founder and CTO of MOSTLY AI, a leader in privacy-safe synthetic data generation. He earned his degrees in mathematics and in business with distinction, led consumer analytic teams at global technology leaders, before starting his venture to pioneer the field of synthetic data. His company's mission is to democratize data access and data insights in a safe and responsible way for everyone. ODSC Links: • Get free access to more talks/trainings like this at Ai+ Training platform: https://hubs.li/H0Zycsf0 • ODSC blog: https://opendatascience.com/ • Facebook: https://www.facebook.com/OPENDATASCI • Twitter: https://twitter.com/_ODSC & @odsc • LinkedIn: https://www.linkedin.com/company/open-data-science • Slack Channel: https://hubs.li/Q038cQBy0 • Code of conduct: https://odsc.com/code-of-conduct/	WEBINAR "Differentially-Private Synthetic Data for Everyone"
Introduction to Git for Open Source Science 2025-05-15 · 22:00 If you use Python as a researcher or for BI analysis, then you know Python is powerful. But if you want to take your work to the next level, learn software best practices with PyData NYC! You'll learn tools and concepts such as Git, version control, open-source principles, and how to integrate all this into your workflow. If you use Jupyter Notebooks and haven't used CLI much, don't worry! We'll start with the github.com UI and end with some CLI. Join our speaker, Mars Lee, and learn about how Git is used in open-source software! The event is hosted at 11 Times Square. Room: Imperial 5306 Thanks to our generous sponsors at Microsoft Reactor	Introduction to Git for Open Source Science
Version Your Data Lakehouse Like Your Software With Nessie 2024-03-10 · 15:45 Alex Merced – Developer Advocate @ Dremio , Tobias Macey – host Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Alex Merced, developer advocate at Dremio and co-author of the upcoming book from O'reilly, "Apache Iceberg, The definitive Guide", about Nessie, a git-like versioned catalog for data lakes using Apache Iceberg Interview Introduction How did you get involved in the area of data management? Can you describe what Nessie is and the story behind it? What are the core problems/complexities that Nessie is designed to solve? The closest analogue to Nessie that I've seen in the ecosystem is LakeFS. What are the features that would lead someone to choose one or the other for a given use case? Why would someone choose Nessie over native table-level branching in the Apache Iceberg spec? How do the versioning capabilities compare to/augment the data versioning in Iceberg? What are some of the sources of, and challenges in resolving, merge conflicts between table branches? Can you describe the architecture of Nessie? How have the design and goals of the project changed since it was first created? What is involved AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Science Delta Dremio Git Hudi Iceberg Cyber Security SQL Trino	Data Engineering Podcast Listen
Let's contribute to pandas - mentored workshop 2023-08-24 · 16:00 PyLadies Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted! If you don’t finish your contribution during the event, we hope you will continue to work on it after the workshop. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html . Facilitators: Noa Tamir Noa is a pandas core developer, member of the NumFOCUS Board of Directors, and PyLadies Berlin Organizer. They are a Lead Data Science Coach at neuefische. BSc Physics, MSc Business and Economics, MRes Economics. Patrick Hoefler Patrick is a member of the pandas core team since early 2021. He is currently working at Coiled as a Senior Software Engineer. He holds a Masters degree in Mathematics and is currently studying towards a Software Engineering degree. Sponsor: Spiced Academy - drinks and pizza Requirements 1. Bring your own laptop 2. Have Github account: https://github.com 3. Have git installed: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git Preparation (optional) For those who are more keen on using the workshop to work on their contribution to pandas, you may want to start setting up your development environment in advance. This way, by the time you arrive you are ready to get started on picking issues, and starting to contribute. To get the most out of the session, it's encouraged (but not required) that you have a look at the contributing guide beforehand: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html. Particularly, the development environment instructions: https://pandas.pydata.org/docs/dev/development/contributing_environment.html We also offer a development environment on gitpod. It can take some minutes to load, but provides you an instant and fresh development environment for each new task directly from your browser, using VScode. Please be aware that it could take longer to set up a development on a computer running a Windows operating system compared to MacOS or Unix. We will guide you through the steps, and they are useful to learn for many open source projects. Gender policy PyLadies aims to provide a friendly support network for women and a bridge to the larger Python world. Anyone with an interest in Python is encouraged to participate! By attending our event, you agree to the PyLadies Code of Conduct: http://www.pyladies.com/CodeOfConduct/ ❓ Can men attend ❓ Everyone is welcome :) If you identify yourself as someone well-represented in open source and in tech, please be mindful of the space and privileges you have, and use it to support others. Agenda: 18:00 Welcome and networking! 18:30 Introduction - what you can contribute, how to contribute, and how to set up your development environment or use gitpod 19:00 "office hours", during which you'll be mentored through setting up a development environment and making a contribution to pandas. Audience level Everyone is welcome to attend this session! If you've never contributed to open source software before, then you will learn how to, and if you have experience contributing, then you can either help mentor other attendees or you can work on more challenging contributions. It is useful to have some pandas, git, and python and experience. If you don't much experience with them, you might expect to spend time "learning by doing". Contact Interested in speaking at one of our events? Have a good idea for a Meetup? Get in touch with us at ([email protected]) Slack information Slack: https://slackin.pyladies.com Channels: #city-berlin, #germany, #jobs-europe	Let's contribute to pandas - mentored workshop
Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16 2018-01-29 · 03:00 Danielle Robinson – guest @ Dat Project , Joe Hand – guest @ Dat Project , Tobias Macey – host Summary Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements: There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register. Your host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future Interview Introduction How did you get involved in the area of data management? What is the Dat project and how did it get started? How have the grants to the Dat project influenced the focus and pace of development that was possible? Now that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project? Can you explain how the Dat protocol is designed and how it has evolved since it was first started? How does Dat manage conflict resolution and data versioning when replicating between multiple machines? One of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions? One of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made? How does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases? What mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default? For someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network? What have been the most challenging aspects of building and promoting Dat? What are some of the most interesting or inspiring uses of the Dat protocol that you are aware of? Contact Info Dat datproject.org Email @dat_project on Twitter Dat Chat Danielle Email @daniellecrobins Joe Email @joeahand on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Dat Project Code For Science and Society Neuroscience Cell Biology OpenCon Mozilla Science Open Education Open Access Open Data Fortune 500 Data Warehouse Knight Foundation Alfred P. Sloan Foundation Gordon and Betty Moore Foundation Dat In The Lab Dat in the Lab blog posts California Digital Library IPFS Dat on Open Collective – COMING SOON! ScienceFair Stencila eLIFE Git BitTorrent Dat Whitepaper Merkle Tree Certificate Transparency Dat Protocol Working Group Dat Multiwriter Development – Hyperdb Beaker Browser WebRTC IndexedDB Rust C Keybase PGP Wire Zenodo Dryad Data Sharing Dataverse RSync FTP Globus Fritter Fritter Demo Rotonde how to Joe’s website on Dat Dat Tutorial Data Rescue – NYTimes Coverage Data.gov Libraries+ Network UC Conservation Genomics Consortium Fair Data principles hypervision hypervision in browser The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Click here to read the unedited transcript… Tobias Macey 00:13… AI/ML CI/CD Data Engineering Data Management Data Science DWH Git Linux Rust	Data Engineering Podcast Listen

WEBINAR "Differentially-Private Synthetic Data for Everyone" 2025-06-04 · 16:00

Pre-registration is REQUIRED. Add to your calendar - https://hubs.li/Q03lF-X-0

In this hands-on session, you'll learn how to generate high-quality synthetic data that preserves privacy using differential privacy techniques. We’ll walk through how to train differentially private generative models with MOSTLY AI’s open-source Synthetic Data SDK and explore how this method compares to traditional anonymization approaches in terms of both utility and risk. You’ll gain practical insights into configuring privacy parameters, understanding the impact of privacy budgets, and evaluating synthetic data output. We’ll also cover how to assess the fidelity of synthetic datasets using predictive and discriminative machine learning models, and how to create hybrid datasets that blend real and synthetic data for improved utility. Through live demonstrations and real-world examples, you’ll develop a strong understanding of the privacy-utility trade-offs and how to confidently apply privacy-safe synthetic data in your own data science workflows.

Session Outline: Lesson 1: Introduction to Differential Privacy Get familiar with the core concepts of differential privacy and how it differs from traditional anonymization techniques. By the end of this lesson, you’ll be able to explain what differential privacy is, what a privacy budget (epsilon) means, and why it provides stronger privacy guarantees than pseudonymization or masking.

Lesson 2: Setting Up and Using the Synthetic Data SDK Learn how to install and configure MOSTLY AI’s open-source Synthetic Data SDK to generate synthetic datasets with differential privacy enabled. You’ll run the SDK in LOCAL mode using a prepared dataset, explore the configuration options for privacy settings, and review the structure of the synthetic output.

Lesson 3: Evaluating Utility vs. Privacy Trade-offs Compare synthetic datasets generated with different privacy settings to understand how utility is impacted by stricter privacy budgets. By the end of this lesson, you’ll be able to evaluate the usefulness of differentially private synthetic data using predictive models and summary statistics.

Lesson 4: Creating Hybrid Datasets with Real and Synthetic Data Explore how to combine real and synthetic data to create hybrid datasets that retain utility while improving privacy. You’ll walk through a practical example and learn how to use synthetic data to augment or replace sensitive parts of your dataset.

Difficulty: Intermediate

Pre-reqs: This tutorial is designed for data engineers, data scientists, ML engineers, and analysts with basic Python skills and familiarity with working in Jupyter Notebooks. Attendees should have a general understanding of machine learning workflows and working with tabular datasets (e.g., CSV files or pandas DataFrames). No prior experience with synthetic data is required. To participate fully in the hands-on exercises, attendees should have the following installed before the session: Python 3.11+, Git Installation and setup of the Synthetic Data SDK will be covered as part of the tutorial. But feel free to get started beforehand by visiting https://github.com/mostly-ai/mostlyai.

Speaker: Dr. Michael Platzer, Co-Founder and CTO of MOSTLY AI Dr. Michael Platzer is co-founder and CTO of MOSTLY AI, a leader in privacy-safe synthetic data generation. He earned his degrees in mathematics and in business with distinction, led consumer analytic teams at global technology leaders, before starting his venture to pioneer the field of synthetic data. His company's mission is to democratize data access and data insights in a safe and responsible way for everyone.

ODSC Links: • Get free access to more talks/trainings like this at Ai+ Training platform: https://hubs.li/H0Zycsf0 • ODSC blog: https://opendatascience.com/ • Facebook: https://www.facebook.com/OPENDATASCI • Twitter: https://twitter.com/_ODSC & @odsc • LinkedIn: https://www.linkedin.com/company/open-data-science • Slack Channel: https://hubs.li/Q038cQBy0 • Code of conduct: https://odsc.com/code-of-conduct/

WEBINAR "Differentially-Private Synthetic Data for Everyone"

WEBINAR "Differentially-Private Synthetic Data for Everyone" 2025-06-04 · 16:00

Pre-registration is REQUIRED. Add to your calendar - https://hubs.li/Q03lF-X-0

In this hands-on session, you'll learn how to generate high-quality synthetic data that preserves privacy using differential privacy techniques. We’ll walk through how to train differentially private generative models with MOSTLY AI’s open-source Synthetic Data SDK and explore how this method compares to traditional anonymization approaches in terms of both utility and risk. You’ll gain practical insights into configuring privacy parameters, understanding the impact of privacy budgets, and evaluating synthetic data output. We’ll also cover how to assess the fidelity of synthetic datasets using predictive and discriminative machine learning models, and how to create hybrid datasets that blend real and synthetic data for improved utility. Through live demonstrations and real-world examples, you’ll develop a strong understanding of the privacy-utility trade-offs and how to confidently apply privacy-safe synthetic data in your own data science workflows.

Session Outline: Lesson 1: Introduction to Differential Privacy Get familiar with the core concepts of differential privacy and how it differs from traditional anonymization techniques. By the end of this lesson, you’ll be able to explain what differential privacy is, what a privacy budget (epsilon) means, and why it provides stronger privacy guarantees than pseudonymization or masking.

Lesson 2: Setting Up and Using the Synthetic Data SDK Learn how to install and configure MOSTLY AI’s open-source Synthetic Data SDK to generate synthetic datasets with differential privacy enabled. You’ll run the SDK in LOCAL mode using a prepared dataset, explore the configuration options for privacy settings, and review the structure of the synthetic output.

Lesson 3: Evaluating Utility vs. Privacy Trade-offs Compare synthetic datasets generated with different privacy settings to understand how utility is impacted by stricter privacy budgets. By the end of this lesson, you’ll be able to evaluate the usefulness of differentially private synthetic data using predictive models and summary statistics.

Lesson 4: Creating Hybrid Datasets with Real and Synthetic Data Explore how to combine real and synthetic data to create hybrid datasets that retain utility while improving privacy. You’ll walk through a practical example and learn how to use synthetic data to augment or replace sensitive parts of your dataset.

Difficulty: Intermediate

Pre-reqs: This tutorial is designed for data engineers, data scientists, ML engineers, and analysts with basic Python skills and familiarity with working in Jupyter Notebooks. Attendees should have a general understanding of machine learning workflows and working with tabular datasets (e.g., CSV files or pandas DataFrames). No prior experience with synthetic data is required. To participate fully in the hands-on exercises, attendees should have the following installed before the session: Python 3.11+, Git Installation and setup of the Synthetic Data SDK will be covered as part of the tutorial. But feel free to get started beforehand by visiting https://github.com/mostly-ai/mostlyai.

Speaker: Dr. Michael Platzer, Co-Founder and CTO of MOSTLY AI Dr. Michael Platzer is co-founder and CTO of MOSTLY AI, a leader in privacy-safe synthetic data generation. He earned his degrees in mathematics and in business with distinction, led consumer analytic teams at global technology leaders, before starting his venture to pioneer the field of synthetic data. His company's mission is to democratize data access and data insights in a safe and responsible way for everyone.

ODSC Links: • Get free access to more talks/trainings like this at Ai+ Training platform: https://hubs.li/H0Zycsf0 • ODSC blog: https://opendatascience.com/ • Facebook: https://www.facebook.com/OPENDATASCI • Twitter: https://twitter.com/_ODSC & @odsc • LinkedIn: https://www.linkedin.com/company/open-data-science • Slack Channel: https://hubs.li/Q038cQBy0 • Code of conduct: https://odsc.com/code-of-conduct/

WEBINAR "Differentially-Private Synthetic Data for Everyone"

WEBINAR "Differentially-Private Synthetic Data for Everyone" 2025-06-04 · 16:00

Pre-registration is REQUIRED. Add to your calendar - https://hubs.li/Q03lF-X-0

In this hands-on session, you'll learn how to generate high-quality synthetic data that preserves privacy using differential privacy techniques. We’ll walk through how to train differentially private generative models with MOSTLY AI’s open-source Synthetic Data SDK and explore how this method compares to traditional anonymization approaches in terms of both utility and risk. You’ll gain practical insights into configuring privacy parameters, understanding the impact of privacy budgets, and evaluating synthetic data output. We’ll also cover how to assess the fidelity of synthetic datasets using predictive and discriminative machine learning models, and how to create hybrid datasets that blend real and synthetic data for improved utility. Through live demonstrations and real-world examples, you’ll develop a strong understanding of the privacy-utility trade-offs and how to confidently apply privacy-safe synthetic data in your own data science workflows.

Session Outline: Lesson 1: Introduction to Differential Privacy Get familiar with the core concepts of differential privacy and how it differs from traditional anonymization techniques. By the end of this lesson, you’ll be able to explain what differential privacy is, what a privacy budget (epsilon) means, and why it provides stronger privacy guarantees than pseudonymization or masking.

Lesson 2: Setting Up and Using the Synthetic Data SDK Learn how to install and configure MOSTLY AI’s open-source Synthetic Data SDK to generate synthetic datasets with differential privacy enabled. You’ll run the SDK in LOCAL mode using a prepared dataset, explore the configuration options for privacy settings, and review the structure of the synthetic output.

Lesson 3: Evaluating Utility vs. Privacy Trade-offs Compare synthetic datasets generated with different privacy settings to understand how utility is impacted by stricter privacy budgets. By the end of this lesson, you’ll be able to evaluate the usefulness of differentially private synthetic data using predictive models and summary statistics.

Lesson 4: Creating Hybrid Datasets with Real and Synthetic Data Explore how to combine real and synthetic data to create hybrid datasets that retain utility while improving privacy. You’ll walk through a practical example and learn how to use synthetic data to augment or replace sensitive parts of your dataset.

Difficulty: Intermediate

Pre-reqs: This tutorial is designed for data engineers, data scientists, ML engineers, and analysts with basic Python skills and familiarity with working in Jupyter Notebooks. Attendees should have a general understanding of machine learning workflows and working with tabular datasets (e.g., CSV files or pandas DataFrames). No prior experience with synthetic data is required. To participate fully in the hands-on exercises, attendees should have the following installed before the session: Python 3.11+, Git Installation and setup of the Synthetic Data SDK will be covered as part of the tutorial. But feel free to get started beforehand by visiting https://github.com/mostly-ai/mostlyai.

Speaker: Dr. Michael Platzer, Co-Founder and CTO of MOSTLY AI Dr. Michael Platzer is co-founder and CTO of MOSTLY AI, a leader in privacy-safe synthetic data generation. He earned his degrees in mathematics and in business with distinction, led consumer analytic teams at global technology leaders, before starting his venture to pioneer the field of synthetic data. His company's mission is to democratize data access and data insights in a safe and responsible way for everyone.

ODSC Links: • Get free access to more talks/trainings like this at Ai+ Training platform: https://hubs.li/H0Zycsf0 • ODSC blog: https://opendatascience.com/ • Facebook: https://www.facebook.com/OPENDATASCI • Twitter: https://twitter.com/_ODSC & @odsc • LinkedIn: https://www.linkedin.com/company/open-data-science • Slack Channel: https://hubs.li/Q038cQBy0 • Code of conduct: https://odsc.com/code-of-conduct/

WEBINAR "Differentially-Private Synthetic Data for Everyone"

WEBINAR "Differentially-Private Synthetic Data for Everyone" 2025-06-04 · 16:00

Pre-registration is REQUIRED. Add to your calendar - https://hubs.li/Q03lF-X-0

In this hands-on session, you'll learn how to generate high-quality synthetic data that preserves privacy using differential privacy techniques. We’ll walk through how to train differentially private generative models with MOSTLY AI’s open-source Synthetic Data SDK and explore how this method compares to traditional anonymization approaches in terms of both utility and risk. You’ll gain practical insights into configuring privacy parameters, understanding the impact of privacy budgets, and evaluating synthetic data output. We’ll also cover how to assess the fidelity of synthetic datasets using predictive and discriminative machine learning models, and how to create hybrid datasets that blend real and synthetic data for improved utility. Through live demonstrations and real-world examples, you’ll develop a strong understanding of the privacy-utility trade-offs and how to confidently apply privacy-safe synthetic data in your own data science workflows.

Session Outline: Lesson 1: Introduction to Differential Privacy Get familiar with the core concepts of differential privacy and how it differs from traditional anonymization techniques. By the end of this lesson, you’ll be able to explain what differential privacy is, what a privacy budget (epsilon) means, and why it provides stronger privacy guarantees than pseudonymization or masking.

Lesson 2: Setting Up and Using the Synthetic Data SDK Learn how to install and configure MOSTLY AI’s open-source Synthetic Data SDK to generate synthetic datasets with differential privacy enabled. You’ll run the SDK in LOCAL mode using a prepared dataset, explore the configuration options for privacy settings, and review the structure of the synthetic output.

Lesson 3: Evaluating Utility vs. Privacy Trade-offs Compare synthetic datasets generated with different privacy settings to understand how utility is impacted by stricter privacy budgets. By the end of this lesson, you’ll be able to evaluate the usefulness of differentially private synthetic data using predictive models and summary statistics.

Lesson 4: Creating Hybrid Datasets with Real and Synthetic Data Explore how to combine real and synthetic data to create hybrid datasets that retain utility while improving privacy. You’ll walk through a practical example and learn how to use synthetic data to augment or replace sensitive parts of your dataset.

Difficulty: Intermediate

Pre-reqs: This tutorial is designed for data engineers, data scientists, ML engineers, and analysts with basic Python skills and familiarity with working in Jupyter Notebooks. Attendees should have a general understanding of machine learning workflows and working with tabular datasets (e.g., CSV files or pandas DataFrames). No prior experience with synthetic data is required. To participate fully in the hands-on exercises, attendees should have the following installed before the session: Python 3.11+, Git Installation and setup of the Synthetic Data SDK will be covered as part of the tutorial. But feel free to get started beforehand by visiting https://github.com/mostly-ai/mostlyai.

Speaker: Dr. Michael Platzer, Co-Founder and CTO of MOSTLY AI Dr. Michael Platzer is co-founder and CTO of MOSTLY AI, a leader in privacy-safe synthetic data generation. He earned his degrees in mathematics and in business with distinction, led consumer analytic teams at global technology leaders, before starting his venture to pioneer the field of synthetic data. His company's mission is to democratize data access and data insights in a safe and responsible way for everyone.

ODSC Links: • Get free access to more talks/trainings like this at Ai+ Training platform: https://hubs.li/H0Zycsf0 • ODSC blog: https://opendatascience.com/ • Facebook: https://www.facebook.com/OPENDATASCI • Twitter: https://twitter.com/_ODSC & @odsc • LinkedIn: https://www.linkedin.com/company/open-data-science • Slack Channel: https://hubs.li/Q038cQBy0 • Code of conduct: https://odsc.com/code-of-conduct/

WEBINAR "Differentially-Private Synthetic Data for Everyone"

Introduction to Git for Open Source Science 2025-05-15 · 22:00

If you use Python as a researcher or for BI analysis, then you know Python is powerful. But if you want to take your work to the next level, learn software best practices with PyData NYC! You'll learn tools and concepts such as Git, version control, open-source principles, and how to integrate all this into your workflow. If you use Jupyter Notebooks and haven't used CLI much, don't worry! We'll start with the github.com UI and end with some CLI.

Join our speaker, Mars Lee, and learn about how Git is used in open-source software!

The event is hosted at 11 Times Square. Room: Imperial 5306 Thanks to our generous sponsors at Microsoft Reactor

Introduction to Git for Open Source Science

Version Your Data Lakehouse Like Your Software With Nessie 2024-03-10 · 15:45

Alex Merced – Developer Advocate @ Dremio , Tobias Macey – host

Summary

Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Alex Merced, developer advocate at Dremio and co-author of the upcoming book from O'reilly, "Apache Iceberg, The definitive Guide", about Nessie, a git-like versioned catalog for data lakes using Apache Iceberg

Interview

Introduction How did you get involved in the area of data management? Can you describe what Nessie is and the story behind it? What are the core problems/complexities that Nessie is designed to solve? The closest analogue to Nessie that I've seen in the ecosystem is LakeFS. What are the features that would lead someone to choose one or the other for a given use case? Why would someone choose Nessie over native table-level branching in the Apache Iceberg spec? How do the versioning capabilities compare to/augment the data versioning in Iceberg? What are some of the sources of, and challenges in resolving, merge conflicts between table branches? Can you describe the architecture of Nessie? How have the design and goals of the project changed since it was first created? What is involved

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Science Delta Dremio Git Hudi Iceberg Cyber Security SQL Trino

Data Engineering Podcast

Listen

Let's contribute to pandas - mentored workshop 2023-08-24 · 16:00

PyLadies Berlin are excited to bring you this open source workshop dedicated to contributing to pandas.

pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted!

If you don’t finish your contribution during the event, we hope you will continue to work on it after the workshop. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html .

Facilitators: Noa Tamir Noa is a pandas core developer, member of the NumFOCUS Board of Directors, and PyLadies Berlin Organizer. They are a Lead Data Science Coach at neuefische. BSc Physics, MSc Business and Economics, MRes Economics.

Patrick Hoefler Patrick is a member of the pandas core team since early 2021. He is currently working at Coiled as a Senior Software Engineer. He holds a Masters degree in Mathematics and is currently studying towards a Software Engineering degree.

Sponsor: Spiced Academy - drinks and pizza

Requirements 1. Bring your own laptop 2. Have Github account: https://github.com 3. Have git installed: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git

Preparation (optional)

For those who are more keen on using the workshop to work on their contribution to pandas, you may want to start setting up your development environment in advance. This way, by the time you arrive you are ready to get started on picking issues, and starting to contribute. To get the most out of the session, it's encouraged (but not required) that you have a look at the contributing guide beforehand: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html. Particularly, the development environment instructions: https://pandas.pydata.org/docs/dev/development/contributing_environment.html

We also offer a development environment on gitpod. It can take some minutes to load, but provides you an instant and fresh development environment for each new task directly from your browser, using VScode.

Please be aware that it could take longer to set up a development on a computer running a Windows operating system compared to MacOS or Unix. We will guide you through the steps, and they are useful to learn for many open source projects.

Gender policy

PyLadies aims to provide a friendly support network for women and a bridge to the larger Python world. Anyone with an interest in Python is encouraged to participate! By attending our event, you agree to the PyLadies Code of Conduct: http://www.pyladies.com/CodeOfConduct/

❓ Can men attend ❓ Everyone is welcome :) If you identify yourself as someone well-represented in open source and in tech, please be mindful of the space and privileges you have, and use it to support others.

Agenda:

18:00 Welcome and networking!
18:30 Introduction - what you can contribute, how to contribute, and how to set up your development environment or use gitpod
19:00 "office hours", during which you'll be mentored through setting up a development environment and making a contribution to pandas.

Audience level Everyone is welcome to attend this session! If you've never contributed to open source software before, then you will learn how to, and if you have experience contributing, then you can either help mentor other attendees or you can work on more challenging contributions. It is useful to have some pandas, git, and python and experience. If you don't much experience with them, you might expect to spend time "learning by doing".

Contact Interested in speaking at one of our events? Have a good idea for a Meetup? Get in touch with us at ([email protected])

Slack information Slack: https://slackin.pyladies.com Channels: #city-berlin, #germany, #jobs-europe

Let's contribute to pandas - mentored workshop

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16 2018-01-29 · 03:00

Danielle Robinson – guest @ Dat Project , Joe Hand – guest @ Dat Project , Tobias Macey – host

Summary Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements:

There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.

Your host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future

Interview

Introduction How did you get involved in the area of data management? What is the Dat project and how did it get started? How have the grants to the Dat project influenced the focus and pace of development that was possible?

Now that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project?

Can you explain how the Dat protocol is designed and how it has evolved since it was first started? How does Dat manage conflict resolution and data versioning when replicating between multiple machines? One of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions? One of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made? How does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases? What mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default? For someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network? What have been the most challenging aspects of building and promoting Dat? What are some of the most interesting or inspiring uses of the Dat protocol that you are aware of?

Contact Info

Dat

datproject.org Email @dat_project on Twitter Dat Chat

Danielle

Email @daniellecrobins

Joe

Email @joeahand on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Dat Project Code For Science and Society Neuroscience Cell Biology OpenCon Mozilla Science Open Education Open Access Open Data Fortune 500 Data Warehouse Knight Foundation Alfred P. Sloan Foundation Gordon and Betty Moore Foundation Dat In The Lab Dat in the Lab blog posts California Digital Library IPFS Dat on Open Collective – COMING SOON! ScienceFair Stencila eLIFE Git BitTorrent Dat Whitepaper Merkle Tree Certificate Transparency Dat Protocol Working Group Dat Multiwriter Development – Hyperdb Beaker Browser WebRTC IndexedDB Rust C Keybase PGP Wire Zenodo Dryad Data Sharing Dataverse RSync FTP Globus Fritter Fritter Demo Rotonde how to Joe’s website on Dat Dat Tutorial Data Rescue – NYTimes Coverage Data.gov Libraries+ Network UC Conservation Genomics Consortium Fair Data principles hypervision hypervision in browser

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the unedited transcript… Tobias Macey 00:13…

AI/ML CI/CD Data Engineering Data Management Data Science DWH Git Linux Rust

Data Engineering Podcast

Listen

talk-data.com

People (1 result)

Activities & events