talk-data.com talk-data.com

Topic

Python

programming_language data_science web_development

1446

tagged

Activity Trend

185 peak/qtr
2020-Q1 2026-Q1

Activities

1446 activities · Newest first

Jupyter notebooks have been around for a while now and are an approachable starting point for anyone interested to explore data and programming. It's done a lot of good, but it was never really designed to be 'the' Python notebook. It's always wanted to support multiple languages and this is reflected in many of the design decisions. And that makes you wonder, what could one do if we really did do just that? What if we ignored other languages and really only cared about Python and data tools? Would the notebook be different?

The goal of this talk is to demonstrate this idea by talking about widgets, reactive Python, modern data tooling and also about the freedom to rethink your tools. There will also be demos. Lots. Of. Demos.

96 Common Challenges in Power Query: Practical Solutions for Mastering Data Transformation in Excel and Power BI

This comprehensive guide is designed to address the most frequent and challenging issues faced by users of Power Query, a powerful data transformation tool integrated into Excel, Power BI, and Microsoft Azure. By tackling 96 real-world problems with practical, step-by-step solutions, this book is an essential resource for data analysts, Excel enthusiasts, and Power BI professionals. It aims to enhance your data transformation skills and improve efficiency in handling complex data sets. Structured into 12 chapters, the book covers specific areas of Power Query such as data extraction, referencing, column splitting and merging, sorting and filtering, and pivoting and unpivoting tables. You will learn to combine data from Excel files with varying column names, handle multi-row headers, perform advanced filtering, and manage missing values using techniques such as linear interpolation and K-nearest neighbors (K-NN) imputation. The book also dives into advanced Power Query functions such as Table.Group, List.Accumulate, and List.Generate, explored through practical examples such as calculating running totals and implementing complex grouping and iterative processes. Additionally, it covers crucial topics such as error-handling strategies, custom function creation, and the integration of Python and R with Power Query. In addition to providing explanations on the use of functions and the M language for solving real-world challenges, this book discusses optimization techniques for data cleaning processes and improving computational speed. It also compares the execution time of functions across different patterns and proposes the optimal approach based on these comparisons. In today’s data-driven world, mastering Power Query is crucial for accurate and efficient data processing. But as data complexity grows, so do the challenges and pitfalls that users face. This book serves as your guide through the noise and your key to unlocking the full potential of Power Query. You’ll quickly learn to navigate and resolve common issues, enabling you to transform raw data into actionable insights with confidence and precision. What You Will Learn Master data extraction and transformation techniques for various Excel file structures Apply advanced filtering, sorting, and grouping methods to organize and analyze data Leverage powerful functions such as Table.Group, List.Accumulate, and List.Generate for complex transformations Optimize queries to execute faster Create and utilize custom functions to handle iterative processes and advanced list transformation Implement effective error-handling strategies, including removing erroneous rows and extracting error reasons Customize Power Query solutions to meet specific business needs and share custom functions across files Who This Book Is For Aspiring and developing data professionals using Power Query in Excel or Power BI who seek practical solutions to enhance their skills and streamline complex data transformation workflows

In response to the growing demand for integrating new data into our data platform, the Data Engineering Team at Okta has developed a solution utilizing Snowpark for Python to automate construction of data pipelines. Discover how Okta's Zero Touch Platform creates end-to-end pipelines that ingest events arriving on S3 and transforms data in Snowflake using Streams and Tasks. The platform features integrated capabilities to detect schema changes in data streams, facilitating automatic evolution of Snowflake table schemas. Crafted with privacy in mind, it also allows for data classification through tags and systemically masking data using tag-based masking policies.

Sponsored by: Redpanda | IoT for Fun & Prophet: Scaling IoT and predicting the future with Redpanda, Iceberg & Prophet

In this talk, we’ll walk through a complete real-time IoT architecture—from an economical, high-powered ESP32 microcontroller publishing environmental sensor data to AWS IoT, through Redpanda Connect into a Redpanda BYOC cluster, and finally into Apache Iceberg for long-term analytical storage. Once the data lands, we’ll query it using Python and perform linear regression with Prophet to forecast future trends. Along the way, we’ll explore the design of a scalable, cloud-native pipeline for streaming IoT data. Whether you're tracking the weather or building the future, this session will help you architect with confidence—and maybe even predict it.

What’s New in Apache Spark™ 4.0?

Join this session for a concise tour of Apache Spark™ 4.0’s most notable enhancements: SQL features: ANSI by default, scripting, SQL pipe syntax, SQL UDF, session variable, view schema evolution, etc. Data type: VARIANT type, string collation Python features: Python data source, plotting API, etc. Streaming improvements: State store data source, state store checkpoint v2, arbitrary state v2, etc. Spark Connect improvements: More API coverage, thin client, unified Scala interface, etc. Infrastructure: Better error message, structured logging, new Java/Scala version support, etc. Whether you’re a seasoned Spark user or new to the ecosystem, this talk will prepare you to leverage Spark 4.0’s latest innovations for modern data and AI pipelines.

Using Delta-rs and Delta-Kernel-rs to Serve CDC Feeds

Change data feeds are a common tool for synchronizing changes between tables and performing data processing in a scalable fashion. Serverless architectures offer a compelling solution for organizations looking to avoid the complexity of managing infrastructure. But how can you bring CDFs into a serverless environment? In this session, we'll explore how to integrate Change Data Feeds into serverless architectures using Delta-rs and Delta-kernel-rs—open-source projects that allow you to read Delta tables and their change data feeds in Rust or Python. We’ll demonstrate how to use these tools with Lakestore’s serverless platform to easily stream and process changes. You’ll learn how to: Leverage Delta tables and CDFs in serverless environments Utilize Databricks and Unity Catalog without needing Apache Spark

Next-Gen Data Science: How Posit and Databricks Are Transforming Analytics at Scale

Modern data science teams face the challenge of navigating complex landscapes of languages, tools and infrastructure. Positron, Posit’s next-generation IDE, offers a powerful environment tailored for data science, seamlessly integrating with Databricks to empower teams working in Python and R. Now integrated within Posit Workbench, Positron enables data scientists to efficiently develop, iterate and analyze data with Databricks — all while maintaining their preferred workflows. In this session, we’ll explore how Python and R users can develop, deploy and scale their data science workflows by combining Posit tools with Databricks. We’ll showcase how Positron simplifies development for both Python and R and how Posit Connect enables seamless deployment of applications, reports and APIs powered by Databricks. Join us to see how Posit + Databricks create a frictionless, scalable and collaborative data science experience — so your teams can focus on insights, not infrastructure.

Looking for a practical workshop on building an AI Agent on Databricks? Well, we have just the thing for you.This hands-on workshop takes you through the process of creating intelligent agents that can reason their way to useful outcomes. You'll start by building your own toolkit of SQL and Python functions that give your agent practical capabilities. Then we'll explore how to select the right foundation model for your needs, connect your custom tools, and watch as your agent tackles complex challenges through visible reasoning paths.The workshop doesn't just stop at building—you'll dive into evaluation techniques using evaluation datasets to identify where your agent shines and where it needs improvement. After implementing and measuring your changes, we'll explore deployment strategies, including a feedback collection interface that enables continuous improvement and governance mechanisms to ensure responsible AI usage in production environments.

The Full Stack of Innovation: Building Data and AI Products With Databricks Apps

In this deep-dive technical session, Ivan Trusov (Sr. SSA @ Databricks) and Giran Moodley (SA @ Databricks) — will explore the full-stack development of Databricks Apps, covering everything from frameworks to deployment. We’ll walk through essential topics, including: Frameworks & tooling — Pythonic (Dash, Streamlit, Gradio) vs. JS + Python stack Development lifecycle — Debugging, issue resolution and best practices Testing — Unit, integration and load testing strategies CI/CD & deployment — Automating with Databricks Asset Bundles Monitoring & observability — OpenTelemetry, metrics collection and analysis Expect a highly practical session with several live demos, showcasing the development loop, testing workflows and CI/CD automation. Whether you’re building internal tools or AI-powered products, this talk will equip you with the knowledge to ship robust, scalable Databricks Apps.

Join us to see how the powerful combination of ThoughtSpot's agentic analytics platform and the Databricks Data Intelligence Platform is changing the game for data-driven organizations. We'll demonstrate how DataSpot breaks down technical barriers to insight. You'll learn how to get trusted, real-time answers thanks to the seamless integration between ThoughtSpot's semantic layer and Databricks Unity Catalog. This session is for anyone looking to leverage data more effectively, whether you're a business leader seeking AI-driven insights, a data scientist building models in Python, or a product owner creating intelligent applications.

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

Building a custom Spark data source connector once required Java or Scala expertise, making it complex and limiting. This left many proprietary data sources without public SDKs disconnected from Spark. Additionally, data sources with Python SDKs couldn't harness Spark’s distributed power. Spark 4.0 changes this with a new Python API for data source connectors, allowing developers to build fully functional connectors without Java or Scala. This unlocks new possibilities, from integrating proprietary systems to leveraging untapped data sources. Supporting both batch and streaming, this API makes data ingestion more flexible than ever. In this talk, we’ll demonstrate how to build a Spark connector for Excel using Python, showcasing schema inference, data reads/writes and streaming support. Whether you're a data engineer or Spark enthusiast, you’ll gain the knowledge to integrate Spark with any data source — entirely in Python.

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. We’ll start with new single-node optimizations that make PySpark even more efficient for smaller datasets. Next, we’ll delve into a major “Pythonizing” overhaul — simpler installation, clearer error messages and Pythonic APIs. On the ETL side, we’ll explore greater data source flexibility (including the simplified Python Data Source API) and a thriving UDF ecosystem. We’ll also highlight enhanced support for real-time use cases, built-in data quality checks and the expanding Spark Connect ecosystem — bridging local workflows with fully distributed execution. Don’t miss this chance to see Spark’s next chapter!

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

Using Clean Rooms for Privacy-Centric Data Collaboration

Databricks Clean Rooms make privacy-safe collaboration possible for data, analytics, and AI — across clouds and platforms. Built on Delta Sharing, Clean Rooms enable organizations to securely share and analyze data together in a governed, isolated environment — without ever exposing raw data. In this session, you’ll learn how to get started with Databricks Clean Rooms and unlock advanced use cases including: Cross-platform collaboration and joint analytics Training machine learning and AI models Enforcing custom privacy policies Analyzing unstructured data Incorporating proprietary libraries in Python and SQL notebooks Auditing clean room activity for compliance Whether you're a data scientist, engineer or data leader, this session will equip you to drive high-value collaboration while maintaining full control over data privacy and governance.

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers

PySpark’s DataFrame API is evolving to support more expressive and modular workflows. In this session, we’ll introduce two powerful additions: table-valued functions (TVFs) and the new subquery API. You’ll learn how to define custom TVFs using Python User-Defined Table Functions (UDTFs), including support for polymorphism, and how subqueries can simplify complex logic. We’ll also explore how lateral joins connect these features, followed by practical tools for the PySpark developer experience—such as plotting, profiling, and a preview of upcoming capabilities like UDF logging and a Python-native data source API. Whether you're building production pipelines or extending PySpark itself, this talk will help you take full advantage of the latest features in the PySpark ecosystem.

Summary In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?What are the foundational architectural modifications that you had to make to enable those capabilities?For the vector storage and indexing, what modifications did you have to make to iceberg?What was your reasoning for not using a format like Lance?For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?When is Starburst/lakehouse the wrong choice for a given AI use case?What do you have planned for the future of AI on Starburst?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarburstPodcast EpisodeAWS AthenaMCP == Model Context ProtocolLLM Tool UseVector EmbeddingsRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeStarburst Data ProductsLanceLanceDBParquetORCpgvectorStarburst IcehouseThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA