talk-data.com talk-data.com

Topic

GitHub

version_control collaboration code_hosting

661

tagged

Activity Trend

79 peak/qtr
2020-Q1 2026-Q1

Activities

661 activities · Newest first

How to do real TDD in data science? A journey from pandas to polars with pelage!

In the world of data, inconsistencies or inaccuracies often presents a major challenge to extract valuable insights. Yet the number of robust tools and practices to address those issues remain limited. Particularly, the practice of TDD remains quite difficult in data science, while it is a standard among classic software development, also because of poorly adapted tools and frameworks.

To address this issue we released Pelage, an open-source Python package to facilitate data exploration and testing, which relies on Polars intuitive syntax and speed. Pelage empowers data scientists and analysts to facilitate data transformation, enhance data quality and improve code clarity.

We will demonstrate, in a test-first approach, how you can use this library in a meaningful data science workflow to gain greater confidence for your data transformations.

See website: https://alixtc.github.io/pelage/

PyPI in the face: running jokes that PyPI download stats can play on you

We all love to tell stories with data and we all love to listen to them. Wouldn't it be great if we could also draw actionable insights from these nice stories?

As scikit-learn maintainers, we would love to use PyPI download stats and other proxy metrics (website analytics, github repository statistics, etc ...) to help inform some of our decisions like: - how do we increase user awareness of best practices (please use Pipeline and cross-validation)? - how do we advertise our recent improvements (use HistGradientBoosting rather than GradientBoosting, TunedThresholdClassifier, PCA and a few other models can run on GPU) ? - do users care more about new features from recent releases or consolidation of what already exists? - how long should we support older versions of Python, numpy or scipy ?

In this talk we will highlight a number of lessons learned while trying to understand the complex reality behind these seemingly simple metrics.

Telling nice stories is not always hard, trying to grasp the reality behind these metrics is often tricky.

ActiveTigger: A Collaborative Text Annotation Research Tool for Computational Social Sciences

The exponential growth of textual data—ranging from social media posts and digital news archives to speech-to-text transcripts—has opened new frontiers for research in the social sciences. Tasks such as stance detection, topic classification, and information extraction have become increasingly common. At the same time, the rapid evolution of Natural Language Processing, especially pretrained language models and generative AI, has largely been led by the computer science community, often leaving a gap in accessibility for social scientists.

To address this, we initiated since 2023 the development of ActiveTigger, a lightweight, open-source Python application (with a web frontend in React) designed to accelerate annotation process and manage large-scale datasets through the integration of fine-tuned models. It aims to support computational social science for a large public both within and outside social sciences. Already used by a dynamic community in social sciences, the stable version is planned for early June 2025.

From a more technical prospect, the API is designed to manage the complete workflow from project creation, embeddings computation, exploration of the text corpus, human annotation with active learning, fine-tuning of pre-trained models (BERT-like), prediction on a larger corpus, and export. It also integrates LLM-as-a-service capabilities for prompt-based annotation and information extraction, offering a flexible approach for hybrid manual/automatic labeling. Accessible both with a web frontend and a Python client, ActiveTigger encourages customization and adaptation to specific research contexts and practices.

In this talk, we will delve into the motivations behind the creation of ActiveTigger, outline its technical architecture, and walk through its core functionalities. Drawing on several ongoing research projects within the Computational Social Science (CSS) group at CREST, we will illustrate concrete use cases where ActiveTigger has accelerated data annotation, enabled scalable workflows, and fostered collaborations. Beyond the technical demonstration, the talk will also open a broader reflection on the challenges and opportunities brought by generative AI in academic research—especially in terms of reliability, transparency, and methodological adaptation for qualitative and quantitative inquiries.

The repository of the project : https://github.com/emilienschultz/activetigger/

The development of this software is funded by the DRARI Ile-de-France and supported by Progédo.

Optimal Transport in Python: A Practical Introduction with POT

Optimal Transport (OT) is a powerful mathematical framework with applications in machine learning, statistics, and data science. This talk introduces the Python Optimal Transport toolbox (POT), an open-source library designed to efficiently solve OT problems. Attendees will learn the basics of OT, explore real-world use cases, and gain hands-on experience with POT (https://pythonot.github.io/) .

From Jupyter Notebook to Publish-Ready Report: Effortless Sharing with Quarto

See how Quarto can transform your Jupyter notebooks into stakeholder-ready web pages or PDFs, published online with just one command. This session features practical demonstrations of publishing with quarto publish, applying custom styles tailored to your organization thanks to brand.yml, and leveraging new features for reproducible research.

Designed for anyone looking to share their work, this talk requires only basic Python and notebook familiarity. You’ll walk away with the skills to elevate your reporting workflow and share insights professionally.

In this episode, we talk with Michael Lanham, an AI and software innovator with over two decades of experience spanning game development, fintech, oil and gas, and agricultural tech. Michael shares his journey from building neural network-based games and evolutionary algorithms to writing influential books on AI agents and deep learning. He offers insights into the evolving AI landscape, practical uses of AI agents, and the future of generative AI in gaming and beyond.

TIMECODES 00:00 Micheal Lanham’s career journey and AI agent books 05:45 Publishing journey: AR, Pokémon Go, sound design, and reinforcement learning 10:00 Evolution of AI: evolutionary algorithms, deep learning, and agents 13:33 Evolutionary algorithms in prompt engineering and LLMs 18:13 AI agent books second edition and practical applications 20:57 AI agent workflows: minimalism, task breakdown, and collaboration 26:25 Collaboration and orchestration among AI agents 31:24 Tools and reasoning servers for agent communication 35:17 AI agents in game development and generative AI impact 38:57 Future of generative AI in gaming and immersive content 41:42 Coding agents, new LLMs, and local deployment 45:40 AI model trends and data scientist career advice 53:36 Cognitive testing, evaluation, and monitoring in AI 58:50 Publishing details and closing remarks

Connect with Micheal Linkedin - https://www.linkedin.com/in/micheal-lanham-189693123/ Connect with DataTalks.Club: Join the community - https://datatalks.club/slack.htmlSubscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/...Check other upcoming events - https://lu.ma/dtc-eventsGitHub: https://github.com/DataTalksClubLinkedIn -   / datatalks-club   Twitter -   / datatalksclub   Website - https://datatalks.club/

In this episode, we talk with Daniel, an astrophysicist turned machine learning engineer and AI ambassador. Daniel shares his journey bridging astronomy and data science, how he leveraged live courses and public knowledge sharing to grow his skills, and his experiences working on cutting-edge radio astronomy projects and AI deployments. He also discusses practical advice for beginners in data and astronomy, and insights on career growth through community and continuous learning.TIMECODES00:00 Lunar eclipse story and Daniel’s astronomy career04:12 Electromagnetic spectrum and MEERKAT data explained10:39 Data analysis and positional cross-correlation challenges15:25 Physics behind radio star detection and observation limits16:35 Radio astronomy’s advantage and machine learning potential20:37 Radio astronomy progress and Daniel’s ML journey26:00 Python tools and experience with ZoomCamps31:26 Intel internship and exploring LLMs41:04 Sharing progress and course projects with orchestration tools44:49 Setting up Airflow 3.0 and building data pipelines47:39 AI startups, training resources, and NVIDIA courses50:20 Student access to education, NVIDIA experience, and beginner astronomy programs57:59 Skills, projects, and career advice for beginners59:19 Starting with data science or engineering1:00:07 Course sponsorship, data tools, and learning resourcesConnect with Daniel Linkedin -   / egbodaniel   Connect with DataTalks.Club: Join the community - https://datatalks.club/slack.htmlSubscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/...Check other upcoming events - https://lu.ma/dtc-eventsGitHub: https://github.com/DataTalksClubLinkedIn -   / datatalks-club   Twitter -   / datatalksclub   Website - https://datatalks.club/

In this episode, Conor and Bryce chat with Sean Parent about AI and Cursor! Link to Episode 253 on WebsiteDiscuss this episode, leave a comment, or ask a question (on GitHub)Socials ADSP: The Podcast: TwitterConor Hoekstra: Twitter | BlueSky | MastodonBryce Adelstein Lelbach: TwitterAbout the Guest: Sean Parent is a senior principal scientist and software architect managing Adobe's Software Technology Lab. Sean first joined Adobe in 1993 working on Photoshop and is one of the creators of Photoshop Mobile, Lightroom Mobile, and Lightroom Web. In 2009 Sean spent a year at Google working on Chrome OS before returning to Adobe. From 1988 through 1993 Sean worked at Apple, where he was part of the system software team that developed the technologies allowing Apple’s successful transition to PowerPC. Show Notes Date Recorded: 2025-08-21 Date Released: 2025-09-26 C++ Under the SeaBetter codeAdobe ASL Adam & Eve ArchitectureAdobe Software Technology LabASL LibrariesRust Programming LanguageIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Model Context Protocol: Principles and Practice

Large‑language‑model agents are only as useful as the context and tools they can reach.

Anthropic’s Model Context Protocol (MCP) proposes a universal, bidirectional interface that turns every external system—SQL databases, Slack, Git, web browsers, even your local file‑system—into first‑class “context providers.”

In just 30 minutes we’ll step from high‑level buzzwords to hands‑on engineering details:

  • How MCP’s JSON‑RPC message format, streaming channels, and version‑negotiation work under the hood.
  • Why per‑tool sandboxing via isolated client processes hardens security (and what happens when an LLM tries rm ‑rf /).
  • Techniques for hierarchical context retrieval that stretch a model’s effective window beyond token limits.
  • Real‑world patterns for accessing multiple tools—Postgres, Slack, GitHub—and plugging MCP into GenAI applications.

Expect code snippets and lessons from early adoption.

You’ll leave ready to wire your own services into any MCP‑aware model and level‑up your GenAI applications—without the N×M integration nightmare.

This talk dives into the challenge of measuring the causal impact of app installs on customer loyalty and value, a question at the heart of data-driven marketing. While randomized controlled trials are the gold standard, they’re rarely feasible in this context. Instead, we’ll explore how observational causal inference methods can be thoughtfully applied to estimate incremental value with careful consideration of confounding, selection, and measurement biases. This session is designed for data scientists, marketing analysts, and applied researchers with a working knowledge of statistics and causal inference concepts. We’ll keep the tone practical and informative, focusing on real-world challenges and solutions rather than heavy mathematical derivations.

Attendees will learn: * How to design robust observational studies for business impact * Strategies for covariate selection and bias mitigation * The use of multiple statistical and design-based causal inference approaches * Methods for validating and refuting causal claims in the absence of true randomization We’ll share actionable insights, code snippets, and a GitHub repository with example workflows so you can apply these techniques in your own organization. By the end of the talk, you’ll be equipped to design more transparent and credible causal studies-and make better decisions about where to invest your marketing dollars.

Requirements:
A basic understanding of causal inference and Python is recommended. Materials and relevant links will be shared during the session

This talk presents a technical case study of applying agentic AI systems to automate community operations at PyCon DE & PyData, treated as an open-source testbed. The key lesson is simple: AI only works when put on a leash. Reliable results required good architecture, a clear plan, and structured data models — from YAML and Pydantic schemas to reproducible pipelines with GitHub Actions. With that foundation, LLM agents supported logistics, FAQs, video processing, and scheduling; without it, they failed. By contrasting successes and failure modes across different coding agents, the talk demonstrates that robust design, validation, and controlled context are prerequisites for making agentic AI usable in real-world workflows.

Brought to You By: •⁠ Statsig ⁠ — ⁠ The unified platform for flags, analytics, experiments, and more. Statsig built a complete set of data tools that allow engineering teams to measure the impact of their work. This toolkit is SO valuable to so many teams, that OpenAI - who was a huge user of Statsig - decided to acquire the company, the news announced last week. Talk about validation! Check out Statsig. •⁠ Linear – The system for modern product development. Here’s an interesting story: OpenAI switched to Linear as a way to establish a shared vocabulary between teams. Every project now follows the same lifecycle, uses the same labels, and moves through the same states. Try Linear for yourself. — What does it take to do well at a hyper-growth company? In this episode of The Pragmatic Engineer, I sit down with Charles-Axel Dein, one of the first engineers at Uber, who later hired me there. Since then, he’s gone on to work at CloudKitchens. He’s also been maintaining the popular Professional programming reading list GitHub repo for 15 years, where he collects articles that made him a better programmer.  In our conversation, we dig into what it’s really like to work inside companies that grow rapidly in scale and headcount. Charles shares what he’s learned about personal productivity, project management, incidents, interviewing, plus how to build flexible skills that hold up in fast-moving environments.  Jump to interesting parts: • 10:41 – the reality of working inside a hyperscale company • 41:10 – the traits of high-performing engineers • 1:03:31 – Charles’ advice for getting hired in today’s job market We also discuss: • How to spot the signs of hypergrowth (and when it’s slowing down) • What sets high-performing engineers apart beyond shipping • Charles’s personal productivity tips, favorite reads, and how he uses reading to uplevel his skills • Strategic tips for building your resume and interviewing  • How imposter syndrome is normal, and how leaning into it helps you grow • And much more! If you’re at a fast-growing company, considering joining one, or looking to land your next role, you won’t want to miss this practical advice on hiring, interviewing, productivity, leadership, and career growth. — Timestamps (00:00) Intro (04:04) Early days at Uber as engineer #20 (08:12) CloudKitchens’ similarities with Uber (10:41) The reality of working at a hyperscale company (19:05) Tenancies and how Uber deployed new features (22:14) How CloudKitchens handles incidents (26:57) Hiring during fast-growth (34:09) Avoiding burnout (38:55) The popular Professional programming reading list repo (41:10) The traits of high-performing engineers  (53:22) Project management tactics (1:03:31) How to get hired as a software engineer (1:12:26) How AI is changing hiring (1:19:26) Unexpected ways to thrive in fast-paced environments (1:20:45) Dealing with imposter syndrome  (1:22:48) Book recommendations  (1:27:26) The problem with survival bias  (1:32:44) AI’s impact on software development  (1:42:28) Rapid fire round — The Pragmatic Engineer deepdives relevant for this episode: •⁠ Software engineers leading projects •⁠ The Platform and Program split at Uber •⁠ Inside Uber’s move to the Cloud •⁠ How Uber built its observability platform •⁠ From Software Engineer to AI Engineer – with Janvi Kalra — Production and marketing by ⁠⁠⁠⁠⁠⁠⁠⁠https://penname.co/⁠⁠⁠⁠⁠⁠⁠⁠. For inquiries about sponsoring the podcast, email [email protected].

Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe

As AI systems evolve, the need for robust infrastructure increases. Enter Dapr Agents: an open-source framework for creating production-grade AI agent systems. Built on top of the Dapr framework, Dapr Agents empowers developers to build intelligent agents capable of collaborating in complex workflows - leveraging Large Language Models (LLMs), durable state, built-in observability, and resilient execution patterns. This workshop will walk through the framework’s core components and through practical examples demonstrate how it solves real-world challenges.

In this episode, Conor and Bryce chat with Sean Parent about Rust and AI! Link to Episode 252 on WebsiteDiscuss this episode, leave a comment, or ask a question (on GitHub)Socials ADSP: The Podcast: TwitterConor Hoekstra: Twitter | BlueSky | MastodonBryce Adelstein Lelbach: TwitterAbout the Guest: Sean Parent is a senior principal scientist and software architect managing Adobe's Software Technology Lab. Sean first joined Adobe in 1993 working on Photoshop and is one of the creators of Photoshop Mobile, Lightroom Mobile, and Lightroom Web. In 2009 Sean spent a year at Google working on Chrome OS before returning to Adobe. From 1988 through 1993 Sean worked at Apple, where he was part of the system software team that developed the technologies allowing Apple’s successful transition to PowerPC. Show Notes Date Recorded: 2025-08-21 Date Released: 2025-09-19 C++ Under the SeaBetter codeAdobe ASL Adam & Eve ArchitectureAdobe Software Technology LabASL LibrariesRust Programming LanguageIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

In this episode, Conor and Bryce interview Sean Parent about his upcoming keynote at C++ Under the Sea! Link to Episode 251 on WebsiteDiscuss this episode, leave a comment, or ask a question (on GitHub)Socials ADSP: The Podcast: TwitterConor Hoekstra: Twitter | BlueSky | MastodonBryce Adelstein Lelbach: TwitterAbout the Guest: Sean Parent is a senior principal scientist and software architect managing Adobe's Software Technology Lab. Sean first joined Adobe in 1993 working on Photoshop and is one of the creators of Photoshop Mobile, Lightroom Mobile, and Lightroom Web. In 2009 Sean spent a year at Google working on Chrome OS before returning to Adobe. From 1988 through 1993 Sean worked at Apple, where he was part of the system software team that developed the technologies allowing Apple’s successful transition to PowerPC. Show Notes Date Recorded: 2025-08-21 Date Released: 2025-09-12 C++ Under the SeaAre We There Yet? - The Future of C++ Software Development - Sean Parent - C++Now 2025A Possible Future of Software Development - Sean Parent - Google Tech Talk 2008Sean Parent Zurich C++ Meetupcareers.adobe.comIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

In this episode, Conor and Bryce interview Sean Parent about his thoughts on AI, its impact on the software industry and society, and more! Link to Episode 250 on WebsiteDiscuss this episode, leave a comment, or ask a question (on GitHub)Socials ADSP: The Podcast: TwitterConor Hoekstra: Twitter | BlueSky | MastodonBryce Adelstein Lelbach: TwitterAbout the Guest: Sean Parent is a senior principal scientist and software architect managing Adobe's Software Technology Lab. Sean first joined Adobe in 1993 working on Photoshop and is one of the creators of Photoshop Mobile, Lightroom Mobile, and Lightroom Web. In 2009 Sean spent a year at Google working on Chrome OS before returning to Adobe. From 1988 through 1993 Sean worked at Apple, where he was part of the system software team that developed the technologies allowing Apple’s successful transition to PowerPC. Show Notes Date Recorded: 2025-08-21 Date Released: 2025-09-05 Snowcrash by Neal StephensonTech LayoffslumeWall-EAltered CarbonTerminatorIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Docling: Get your documents ready for gen AI

Docling, an open source package, is rapidly becoming the de facto standard for document parsing and export in the Python community. Earning close to 30,000 GitHub in less than one year and now part of the Linux AI & Data Foundation. Docling is redefining document AI with its ease and speed of use. In this session, we’ll introduce Docling and its features, including usages with various generative AI frameworks and protocols (e.g. MCP).

AI, data, numbers—without uploads. Hash, mask, and redact PII, then run data analytics locally for time-saving and privacy. In this episode, we build a No-Upload AI Analyst that keeps your PII safe: HMAC SHA-256 hashing, masking, and redaction using policy presets and client-side transforms. We’ll: • Reframe the problem (insights > risk) • Set four hard constraints (no uploads, local preferred, policy presets, human-readable audit) • Use rules-first privacy + schema semantics • Walk the 5-step workflow (paste headers → pick preset → set secret → transform → analyze) • Show real-world cases (HIPAA/HITECH-aware analytics, FERPA contexts, product analytics) • Share a checklist + quiz + local Streamlit approach Perfect for data teams in healthcare, finance, education, and privacy-sensitive orgs. Key takeaways Stop uploading customer data. Transform it client-side first.Use HMAC hashing to keep joins without exposing raw emails/IDs.Mask for human-readable UI; redact when you don’t need the field.Ship a data-handling report with every analysis.Run the app locally for maximum privacy.Affiliate note: I record with Riverside (affiliate) and host on RSS.com (affiliate). Links in show notes. Links Blog version: (Free): https://mukundansankar.substack.com/p/the-no-upload-ai-analyst-v4-secure Join the Discussion (comments hub): https://mukundansankar.substack.com/notesTools I use for my Podcast and Affiliate PartnersRecording Partner: Riverside → Sign up here (affiliate)Host Your Podcast: RSS.com (affiliate )Research Tools: Sider.ai (affiliate)Sourcetable AI: Join Here(affiliate)🔗 Connect with Me:Free Email NewsletterWebsite: Data & AI with MukundanGitHub: https://github.com/mukund14Twitter/X: @sankarmukund475LinkedIn: Mukundan SankarYouTube: Subscribe