talk-data.com
People (159 results)
See all 159 →Activities & events
| Title & Speakers | Event |
|---|---|
|
Building a multimodal lakehouse for AI (w/ Chang She)
2025-11-23 · 14:56
Chang She
– CEO
@ LanceDB
,
Tristan Handy
– CEO
@ dbt Labs
In this episode, Tristan Handy sits down with Chang She — a co-creator of Pandas and now CEO of LanceDB — to explore the convergence of analytics and AI engineering. The team at LanceDB is rebuilding the data lake from the ground up with AI as a first principle, starting with a new AI-native file format called Lance. Tristan traces Chang's journey as one of the original contributors to the pandas library to building a new infrastructure layer for AI-native data. Learn why vector databases alone aren't enough, why agents require new architecture, and how LanceDB is building a AI lakehouse for the future. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs. |
The Analytics Engineering Podcast |
|
Keynote: Chang She - Never Send a Human to do an Agent's Search
2025-11-08 · 17:00
Chang She
– CEO / Co-founder
@ LanceDB
Keynote by Chang She |
PyData Seattle 2025 |
|
PyCon 2025 Special Event: Hometown Heroes Hatchery Program
2025-05-17 · 17:45
PyCon US 2025 is coming to Pittsburgh this May 14–22, and PyData Pittsburgh is thrilled to be part of it! We’re hosting the Hometown Heroes Hatchery track on Saturday, May 17—a half-day event inside the conference celebrating the incredible work of Python developers, researchers, educators, and technologists from across our city. As part of PyCon’s Hatchery initiative, this track will feature presentations and lightning talks that highlight the creativity and impact of Pittsburgh’s Python community. If you're attending PyCon US 2025, we invite the PyData Pittsburgh community to join us at the Hometown Heroes track—come connect, engage, and help showcase the strength of our local tech scene. Please note: you must be registered for PyCon US 2025 to attend this event, and all attendees and speakers are responsible for securing their own tickets. You can find registration details for the Conference here:https://us.pycon.org/2025/attend/information/. HOMETOWN HEROES HATCHERY PROGRAM - May 17th TALK SCHEDULE: Decoding Spatial Biology with Python: Multi-Modal Insights into Breast Cancer Progression Time: 01:45 PM - 02:15 PM Speakers: Alex C. Chang, CMU-Pitt (Graduate Student PhD, Computational Biology ) and Brent Schlegel, University of Pittsburgh School of Medicine (Graduate Student PhD, Integrative Systems Biology) Python has rapidly become a cornerstone of scientific computing, computational biology, and bioinformatics due to its ease of use and scalability for handling large datasets—qualities that are critical in today’s “big data” era of clinical and translational research. As computational resources and data collection methods continue to expand, we are now empowered to ask larger and more clinically relevant questions that enable us to dissect complex biological systems with unprecedented detail. However, this surge in data complexity brings new challenges, from the integration of diverse data modalities to the need for sophisticated analytical methods capable of untangling intricate biological signals from background noise. In this talk, we describe how Python not only meets these challenges but also drives innovation through the development of novel bioinformatics tools like CITEgeist—a case study in harnessing Python’s capabilities for multi-modal spatial transcriptomics. Biological datasets often face challenges of high sparsity and noise. CITEgeist harnesses Python’s robust ecosystem to provide an efficient, scalable pipeline that deconvolutes messy spatial signals into actionable, clinically relevant features. Exploring Energy Burden in Pittsburgh Neighborhoods with Python Time: 02:30 PM - 03:00 PM Speakers: Ling Almoubayyed, SmithGroup, Inc. (Project Manager) and Husni Almoubayyed, Carnegie Learning National-level energy studies consistently find that energy burdens are a significant challenge, and that lower-income neighborhoods sometimes end up paying more for energy in cities including Pittsburgh. Using Python, we were able to extract and analyze data on energy consumption in the City of Pittsburgh, along with real-estate and geographic information system (GIS) data to compare trends in energy usage and burden across Pittsburgh neighborhoods, and across different housing types. We present statistical analyses and Python visualizations describing these trends across different features such as housing price, size, and neighborhood. Bottling Tesla's Solar: A Solar Dashboard with Python Time: 03:15 PM - 03:45 PM Speaker: Christopher Pitstick (Sr. SWE) Tesla's Powerwall/Inverter solar ecosystem are powerful yet notoriously opaque. For home labbers, extracting meaningful data can be daunting—but not impossible. In this talk, I'll share my journey of developing a custom solar dashboard using Grafana and PyPowerwall, navigating the quirks and closed nature of Tesla's ecosystem along the way. The backend is all Python, so I will demo my server code and dashboard to show how I was able find hundreds of kilowatt hours in lost solar production. In this talk, we'll do a deep dive into the way I altered the Python server code to be able to query multiple inverters at the same time with complex iptable rules. This presentation may conclude with the value of installing solar on your home, and how self-monitoring is a critical component of every nerd's arsenal. Strategies for Eliciting Structured Ouputs from LLMs Time: 03:50 PM - 03:55 PM Speaker: Utkarsh Tripathi, Solventum (Machine Learning Engineer) This lightning talk will provide a concise yet comprehensive overview of techniques for extracting structured, predictable outputs from Large Language Models. I will compare and demonstrate multiple state-of-the-art libraries (such as BAML, Instructor, Langchain, SGLang etc. + how they work under the hood), utilize pydantic / dataclass / etc. to get structured outputs. We will explore practical examples of JSON schema enforcement, markdown formatting directives, and template-based approaches that dramatically improve downstream processing capabilities. The presentation will include code snippets and prompt templates that participants can immediately implement in their own projects. Does Generative AI Know Statistics? Time: 03:55 PM - 04:00 PM Speaker: Louis Luangkesorn, Highmark Health (Lead Data Scientist) Generative AI has promise to impact many fields of endeavor. But experience has shown that it often has problems with nuance and context. This talk discusses some experiences using Generative AI as an aid in applied analytics and walks through an example that illustrates working around its weaknesses and taking advantage of its capabilities. Demystifying How Animal Behavior Affects Disease Spread Using Python Time: 04:00 PM - 04:05 PM Speaker: Carolyn Tett, University of Pittsburgh (Research Technician) Not all individuals contribute equally to disease spread. During COVID-19, social distancing reduced transmission for some, while high-contact individuals increased disease spread. Preventative measures for massive disease outbreaks, however, cannot rely solely on data from rare epidemic events. Instead, disease ecologists study animal models to understand how host behavior theoretically drives disease outbreaks. Tracking animal movement and interactions is essential for identifying transmission-relevant behaviors. In lab experiments, video recordings provide an abundance of behavioral data, now efficiently processed through automation, and coding languages like Python enable large-scale data analysis. The Stephenson Lab at the University of Pittsburgh uses Raspberry Pis to autonomously record guppies infected with an ectoparasite. These parasites transmit primarily through instances of close contact between hosts. Through autonomous video recordings, we generated 1,300 hours of footage—equivalent to 54 consecutive days of observation. Given that each video captures six guppies, manually tracking behavior would take tens of billions of days. Instead, animal tracking software reduces this processing time to a mere few months. The Many-Colored Functions of Async Python Time: 04:15 PM - 04:45 PM Speaker: Bryan C. Mills, Duolingo (Senior Software Engineer) You might think of functions in async Python in terms of “synchronous” and “async”, but the possibility of binding objects (such as Locks) to the asyncio event loop adds a whole new dimension to consider. We'll examine six vibrant kinds of functions and how they interact! This talk will examine code examples of how to adapt each kind of function to call other kinds, suggest design patterns that minimize the complexity of dealing with different kinds (such as non-blocking context managers), and examine patterns or libraries to safely synchronize concurrent calls involving multiple kinds of function. Automated Dependency Inference and its Applications Time: 05:00 PM - 05:30 PM Speaker: Jason R. Coombs, Microsoft (Principal Software Engineer) Last summer, I launched the Coherent Software Development System (https://bit.ly/coherent-system) with the principal that one should not have to repeat themselves when developing more than one Python project. One of the key innovations of that system is coherent.deps, a system for deriving package dependencies from the imports that a project or script uses. I'll explore some of the background motivations from Google's monorepo, some prior art at Meta, and some of the approaches that failed (AI-based inference) before going into the details of the implementation (AST parsing, world-readable MongoDB database, Big Table query to PyPI downloads). I'll additionally talk about some of the applications of this generalized library (coherent.build, pip-run), some of the maintenance challenges (expensive query, refresh interval), and possible other applications (on-demand dependency loader). SPEAKER BIOS: Alex C. Chang Alexander Chih-Chieh Chang is a fourth-year MSTP student in the CMU-Pitt Computational Biology Ph.D. Program, mentored by Drs. Lee and Oesterreich. He earned a BS/BA in Chemical and Biomolecular Engineering/Sociology from Johns Hopkins University in 2021. Previously, during his undergraduate research in the lab of Rong Li, Ph.D., he conducted large-scale genomic screens to study proteomic dysregulation and spent a gap year in the lab of Manish Aghi, MD. PhD., studying breast cancer metastasis to the brain. Currently, as a computational biologist and medical student, he coordinates the Hope for OTHERS tissue donation program in the Lee-Oesterreich Lab and computational research projects in breast cancer metastasis and genomic evolution. Brent Schlegel Brent Schlegel is a first-year PhD student in Integrative Systems Biology at the University of Pittsburgh School of Medicine, co-mentored by Drs. Adrian Lee and Steffi Oesterreich. He earned his AS in Mathematics and Sciences from CCAC (2019) and a BS in Computational Biology from Pitt (2021). Most recently, he worked as a Bioinformatics Analyst at the UPMC Children’s Hospital of Pittsburgh, where he specialized in the integrative analysis of large, complex biomedical datasets. Now, Brent combines data science, computational modeling, and multi-omic integration to tackle the systems biology of invasive lobular breast cancer, using patient-derived organoid models and leveraging “big data” to uncover hidden patterns and drive innovation in diagnosis and treatment. Ling Almoubayyed Ling is an experienced architecture and urban designer with extensive project management expertise. Specializing in urban design, planning, community engagement, and spatial analysis, she has successfully led projects ranging from individual buildings to comprehensive urban districts. Ling uses evidence-based design with data gathered through stakeholder engagement to identify the best design solutions to create built environments. She is currently a Project Manager with SmithGroup. Husni Almoubayyed Husni Almoubayyed is the Director of AI at Pittsburgh-based education technology company Carnegie Learning. Husni uses machine learning and data science methods to conduct research in education, specifically in topics such as personalization, equity, and predictive analytics. Prior to his work in education technology, Husni acquired a Ph.D. in Astrophysics from Carnegie Mellon University, where he worked on mitigating biases in astronomical data to advance understanding of dark energy. Needless to say, Python is Husni's favorite programming language, and PyCon is one of his favorite events of the year! Christopher Pitstick Christopher, a passionate software engineer who installed solar panels on his home in 2024, quickly immersed himself in system analysis to optimize performance—expertise that directly inspired this presentation. His programming journey began at age 12 with QBasic, igniting a lifelong passion that led to roles at industry giants including Microsoft, Amazon, and Argo AI before joining his current position at Latitude. Throughout his career, Christopher has mastered multiple programming languages from C++ to Perl and Python, approaching coding both as a profession and personal passion. As a dedicated neurodiversity advocate, he regularly shares his experiences through public speaking engagements, raising awareness and empowering others in the tech community. Utkarsh Tripathi Utkarsh Tripathi is a Machine Learning Engineer at Solventum, Inc., where he works on Solventum™ Fluency Align™ and Solventum™ Fluency Direct™ : AI-powered clinical documentation tools that leverage conversational and generative AI, along with ambient intelligence, to automate medical documentation. These solutions help reduce administrative work and physician burnout, while improving the overall patient care experience. Utkarsh holds degrees in Electrical Engineering, Chemistry, and Computer Science from BITS Pilani and the University of Chicago. Louis Luangkesorn Dr. Louis Luangkesorn is a Lead Data Scientist at Highmark Health where he works on projects applying statistical, predictive, operations research, and Generative AI models in use cases involving human resources and healthcare. He has contributed code to Scipy and a book appendix porting a simulation textbook's examples to Simpy. Carolyn Tett Carolyn is an ecologist that specializes in animal behavior and disease ecology. She works with guppies and their ectoparasites to better understand how host contact rate and physiological status impact disease spread. She captures guppy behaviors on video and uses Python to automate the video processing. Using these outputs, she quantifies guppy social metrics and runs statistical models to predict behavior-mediated parasite spread. Bryan C. Mills Bryan maintains Python core services at Duolingo, and was formerly a maintainer on the Go project at Google. Jason R. Coombs Jason's been a passionate contributor to Python and open source software since the 90's, is a core contributor to Python, and maintains hundreds of packages in PyPI. |
PyCon 2025 Special Event: Hometown Heroes Hatchery Program
|
|
Machine Learning Interviews
2023-11-30
Susan Shu Chang
– author
As tech products become more prevalent today, the demand for machine learning professionals continues to grow. But the responsibilities and skill sets required of ML professionals still vary drastically from company to company, making the interview process difficult to predict. In this guide, data science leader Susan Shu Chang shows you how to tackle the ML hiring process. Having served as principal data scientist in several companies, Chang has considerable experience as both ML interviewer and interviewee. She'll take you through the highly selective recruitment process by sharing hard-won lessons she learned along the way. You'll quickly understand how to successfully navigate your way through typical ML interviews. This guide shows you how to: Explore various machine learning roles, including ML engineer, applied scientist, data scientist, and other positions Assess your interests and skills before deciding which ML role(s) to pursue Evaluate your current skills and close any gaps that may prevent you from succeeding in the interview process Acquire the skill set necessary for each machine learning role Ace ML interview topics, including coding assessments, statistics and machine learning theory, and behavioral questions Prepare for interviews in statistics and machine learning theory by studying common interview questions |
O'Reilly AI & ML Books
|
|
Bringing AI to DuckDB with Lance columnar format for multi-modal AI – DuckCon #3 (San Francisco)
2023-08-10
Speaker: Chang She (LanceDB) Slides: https://blobs.duckdb.org/events/duckcon3/chang-she-lancedb-bringing-ai-to-duckdb-with-lance-columnar-format.pdf |
DuckCon #3 San Francisco 2023 |
|
Vector Data Lakes
2023-07-26 · 21:04
Vector databases such as ElasticSearch and Pinecone offer fast ingestion and querying on vector embeddings with ANNs. However, they typically do not decouple compute and storage, making them hard to integrate in production data stacks. Because data storage in these databases is expensive and not easily accessible, data teams typically maintain ETL pipelines to offload historical embedding data to blob stores. When that data needs to be queried, they get loaded back into the vector database in another ETL process. This is reminiscent of loading data from OLTP database to cloud storage, then loading said data into an OLAP warehouse for offline analytics. Recently, “lakehouse” offerings allow direct OLAP querying on cloud storage, removing the need for the second ETL step. The same could be done for embedding data. While embedding storage in blob stores cannot satisfy the high TPS requirements in online settings, we argue it’s sufficient for offline analytics use cases like slicing and dicing data based on embedding clusters. Instead of loading the embedding data back into the vector database for offline analytics, we propose direct processing on embeddings stored in Parquet files in Delta Lake. You will see that offline embedding workloads typically touch a large portion of the stored embeddings without the need for random access. As a result, the workload is entirely bound by network throughput instead of latency, making it quite suitable for blob storage backends. On a test one billion vector dataset, ETL into cloud storage takes around one hour on a dedicated GPU instance, while batched nearest neighbor search can be done in under one minute with four CPU instances. We believe future “lakehouses” will ship with native support for these embedding workloads. Talk by: Tony Wang and Chang She Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc |
Databricks DATA + AI Summit 2023 |