talk-data.com talk-data.com

Topic

API

Application Programming Interface (API)

integration software_development data_exchange

856

tagged

Activity Trend

65 peak/qtr
2020-Q1 2026-Q1

Activities

856 activities · Newest first

Sponsored by: Boomi, LP | From Pipelines to Agents: Manage Data and AI on One Platform for Maximum ROI

In the age of agentic AI, competitive advantage lies not only in AI models, but in the quality of the data agents reason on and the agility of the tools that feed them. To fully realize the ROI of agentic AI, organizations need a platform that enables high-quality data pipelines and provides scalable, enterprise-grade tools. In this session, discover how a unified platform for integration, data management, MCP server management, API management, and agent orchestration can help you to bring cohesion and control to how data and agents are used across your organization.

Sponsored by: EY | Unlocking Value Through AI at Takeda Pharmaceuticals

In the rapidly evolving landscape of pharmaceuticals, the integration of AI and GenAI is transforming how organizations operate and deliver value. We will explore the profound impact of the AI program at Takeda Pharmaceuticals and the central role of Databricks. We will delve into eight pivotal AI/GenAI use cases that enhance operational efficiency across commercial, R&D, manufacturing, and back-office functions, including these capabilities: Responsible AI Guardrails: Scanners that validate and enforce responsible AI controls on GenAI solutions Reusable Databricks Native Vectorization Pipeline: A scalable solution enhancing data processing with quality and governance One-Click Deployable RAG Pattern: Simplifying deployment for AI applications, enabling rapid experimentation and innovation AI Asset Registry: A repository for foundational models, vector stores, and APIs, promoting reuse and collaboration

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

Building a custom Spark data source connector once required Java or Scala expertise, making it complex and limiting. This left many proprietary data sources without public SDKs disconnected from Spark. Additionally, data sources with Python SDKs couldn't harness Spark’s distributed power. Spark 4.0 changes this with a new Python API for data source connectors, allowing developers to build fully functional connectors without Java or Scala. This unlocks new possibilities, from integrating proprietary systems to leveraging untapped data sources. Supporting both batch and streaming, this API makes data ingestion more flexible than ever. In this talk, we’ll demonstrate how to build a Spark connector for Excel using Python, showcasing schema inference, data reads/writes and streaming support. Whether you're a data engineer or Spark enthusiast, you’ll gain the knowledge to integrate Spark with any data source — entirely in Python.

Sponsored by: Google Cloud | Building Powerful Agentic Ecosystems with Google Cloud's A2A

This session unveils Google Cloud's Agent2Agent (A2A) protocol, ushering in a new era of AI interoperability where diverse agents collaborate seamlessly to solve complex enterprise challenges. Join our panel of experts to discover how A2A empowers you to deeply integrate these collaborative AI systems with your existing enterprise data, custom APIs, and critical workflows. Ultimately, learn to build more powerful, versatile, and securely managed agentic ecosystems by combining specialized Google-built agents with your own custom solutions (Vertex AI or no-code). Extend this ecosystem further by serving these agents with Databricks Model Serving and governing them with Unity Catalog for consistent security and management across your enterprise.

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. We’ll start with new single-node optimizations that make PySpark even more efficient for smaller datasets. Next, we’ll delve into a major “Pythonizing” overhaul — simpler installation, clearer error messages and Pythonic APIs. On the ETL side, we’ll explore greater data source flexibility (including the simplified Python Data Source API) and a thriving UDF ecosystem. We’ll also highlight enhanced support for real-time use cases, built-in data quality checks and the expanding Spark Connect ecosystem — bridging local workflows with fully distributed execution. Don’t miss this chance to see Spark’s next chapter!

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

Sponsored by: Fivetran | Scalable Data Ingestion: Building custom pipelines with the Fivetran Connector SDK and Databricks

Organizations have hundreds of data sources, some of which are very niche or difficult to access. Incorporating this data into your lakehouse requires significant time and resources, hindering your ability to work on more value-add projects. Enter the Fivetran Connector SDK- a powerful new tool that enables your team to create custom pipelines for niche systems, custom APIs, and sources with specific data filtering requirements, seamlessly integrating with Databricks. During this session, Fivetran will demonstrate how to (1) Leverage the Connector SDK to build scalable connectors, enabling the ingestion of diverse data into Databricks (2) Gain flexibility and control over historical and incremental syncs, delete capture, state management, multithreading data extraction, and custom schemas (3) Utilize practical examples, code snippets, and architectural considerations to overcome data integration challenges and unlock the full potential of your Databricks environment.

What’s New in PySpark: TVFs, Subqueries, Plots, and Profilers

PySpark’s DataFrame API is evolving to support more expressive and modular workflows. In this session, we’ll introduce two powerful additions: table-valued functions (TVFs) and the new subquery API. You’ll learn how to define custom TVFs using Python User-Defined Table Functions (UDTFs), including support for polymorphism, and how subqueries can simplify complex logic. We’ll also explore how lateral joins connect these features, followed by practical tools for the PySpark developer experience—such as plotting, profiling, and a preview of upcoming capabilities like UDF logging and a Python-native data source API. Whether you're building production pipelines or extending PySpark itself, this talk will help you take full advantage of the latest features in the PySpark ecosystem.

Summary In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?What are the foundational architectural modifications that you had to make to enable those capabilities?For the vector storage and indexing, what modifications did you have to make to iceberg?What was your reasoning for not using a format like Lance?For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?When is Starburst/lakehouse the wrong choice for a given AI use case?What do you have planned for the future of AI on Starburst?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarburstPodcast EpisodeAWS AthenaMCP == Model Context ProtocolLLM Tool UseVector EmbeddingsRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeStarburst Data ProductsLanceLanceDBParquetORCpgvectorStarburst IcehouseThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

From Imperative to Declarative Paradigm: Rebuilding a CI/CD Infrastructure Using Hatch and DABs

Building and deploying Pyspark pipelines to Databricks should be effortless. However, our team at FreeWheel has, for the longest time, struggled with a convoluted and hard-to-maintain CI/CD infrastructure. It followed an imperative paradigm, demanding that every project implement custom scripts to build artifacts and deploy resources, and resulting in redundant boilerplate code and awkward interactions with the Databricks REST API. We set our mind on rebuilding it from scratch, following a declarative paradigm instead. We will share how we were able to eliminate thousands of lines of code from our repository, create a fully configuration-driven infrastructure where projects can be easily onboarded, and improve the quality of our codebase using Hatch and Databricks Asset Bundles as our tools of choice. In particular, DAB has made deploying across our 3 environments a breeze, and has allowed us to quickly adopt new features as soon as they are released by Databricks.

Real-Time Analytics Pipeline for IoT Device Monitoring and Reporting

This session will show how we implemented a solution to support high-frequency data ingestion from smart meters. We implemented a robust API endpoint that interfaces directly with IoT devices. This API processes messages in real time from millions of distributed IoT devices and meters across the network. The architecture leverages cloud storage as a landing zone for the raw data, followed by a streaming pipeline built on Lakeflow Declarative Pipelines. This pipeline implements a multi-layer medallion architecture to progressively clean, transform and enrich the data. The pipeline operates continuously to maintain near real-time data freshness in our gold layer tables. These datasets connect directly to Databricks Dashboards, providing stakeholders with immediate insights into their operational metrics. This solution demonstrates how modern data architecture can handle high-volume IoT data streams while maintaining data quality and providing accessible real-time analytics for business users.

Simplify Data Ingest and Egress with the New Python Data Source API

Data engineering teams are frequently tasked with building bespoke ingest and/or egress solutions for myriad custom, proprietary, or industry-specific data sources or sinks. Many teams find this work cumbersome and time-consuming. Recognizing these challenges, Databricks interviewed numerous companies across different industries to better understand their diverse data integration needs. This comprehensive feedback led us to develop the Python Data Source API for Apache Spark™.

Turn Genie Into an Agent Using Conversation APIs

Transform your AI/BI Genie into a text-to-SQL powerhouse using the Genie Conversation APIs. This session explores how Genie functions as an intelligent agent, translating natural language queries into SQL to accelerate insights and enhance self-service analytics. You'll learn practical techniques for configuring agents, optimizing queries and handling errors — ensuring Genie delivers accurate, relevant responses in real time. A must-attend for teams looking to level up their AI/BI capabilities and deliver smarter analytics experiences.

The Future of DSv2 in Apache Spark™

DSv2, Spark's next-generation Catalog API, is gaining traction among data source developers. It shifts complexity to Apache Spark™, improves connector reliability and unlocks new functionality such as catalog federation, MERGE operations, storage-partitioned joins, aggregate pushdown, stored procedures and more. This session covers the design of DSv2, current strengths and gaps and its evolving roadmap. It's intended for Spark users and developers working with data sources, whether custom-built or off-the-shelf.

Unified Advanced Analytics: Integrating Power BI and Databricks Genie for Real-time Insights

In today’s data-driven landscape, business users expect seamless, interactive analytics without having to switch between different environments. This presentation explores our web application that unifies a Power BI dashboard with Databricks Genie, allowing users to query and visualize insights from the same dataset within a single, cohesive interface. We will compare two integration strategies: one that leverages a traditional webpage enhanced by an Azure bot to incorporate Genie’s capabilities, and another that utilizes Databricks Apps to deliver a smoother, native experience. We use the Genie API to build this solution. Attendees will learn the architecture behind these solutions, key design considerations and challenges encountered during implementation. Join us to see live demos of both approaches, and discover best practices for delivering an all-in-one, interactive analytics experience.

Kernel, Catalog, Action! Reimagining our Delta-Spark Connector with DSv2

Delta Lake is redesigning its Spark connector through the combination of three key technologies: First, we're updating our Spark APIs to DSv2 to achieve deeper catalog integration and improved integration with the Spark optimizer. Second, we're fully integrating on top of Delta Kernel to take advantage of its simplified abstraction of Delta protocol complexities, accelerating feature adoption and improving maintainability. Third, we are transforming Delta to become a catalog-aware lakehouse format with Catalog Commits, enabling more efficient metadata management, governance and query performance. Join us to explore how we're advancing Delta Lake's architecture, pushing the boundaries of metadata management and creating a more intelligent, performant data lakehouse platform.

As first-party data becomes increasingly invaluable to organizations, Walmart Data Ventures is dedicated to bringing to life new applications of Walmart’s first-party data to better serve its customers. Through Scintilla, its integrated insights ecosystem, Walmart Data Ventures continues to expand its offerings to deliver insights and analytics that drive collaboration between our merchants, suppliers, and operators.​Scintilla users can now access Walmart data using Cloud Feeds, based on Databricks Delta Sharing technologies. In the past, Walmart used API-based data sharing models, which required users to possess certain skills and technical attributes that weren’t always available. Now, with Cloud Feeds, Scintilla users can more easily access data without a dedicated technical team behind the scenes making it happen. Attendees will gain valuable insights into how Walmart has built its robust data sharing architecture and strategies to design scalable and collaborative data sharing architectures in their own organizations.

Delta Kernel for Rust and Java

Delta Kernel makes it easy for engines and connectors to read and write Delta tables. It supports many Delta features and robust connectors, including DuckDB, Clickhouse, Spice AI and delta-dotnet. In this session, we'll cover lessons learned about how to build a high-performance library that lets engines integrate the way they want, while not having to worry about the details of the Delta protocol. We'll talk through how we streamlined the API as well as its changes and underlying motivations. We'll discuss some new highlight features like write support, and the ability to do CDF scans. Finally we'll cover the future roadmap for the Kernel project and what you can expect from the project over the coming year.

Gaining Insight From Image Data in Databricks Using Multi-Modal Foundation Model API

Unlock the hidden potential in your image data without specialized computer vision expertise! This session explores how to leverage Databricks' multi-modal Foundation Model APIs to analyze, classify and extract insights from visual content. Learn how Databricks provides a unified API to understand images using powerful foundation models within your data workflows. Key takeaways: Implementing efficient workflows for image data processing within your Databricks lakehouse Understanding multi-modal foundation models for image understanding Integrating image analysis with other data types for business insights Using OpenAI-compatible APIs to query multi-modal models Building end-to-end pipelines from image ingestion to model deployment Whether analyzing product images, processing visual documents or building content moderation systems, you'll discover how to extract valuable insights from your image data within the Databricks ecosystem.