talk-data.com talk-data.com

Topic

SQL

Structured Query Language (SQL)

database_language data_manipulation data_definition programming_language

1751

tagged

Activity Trend

107 peak/qtr
2020-Q1 2026-Q1

Activities

1751 activities · Newest first

Retail Genie: No-Code AI Apps for Empowering BI Users to be Self-Sufficient

Explore how Databricks AI/BI Genie revolutionizes retail analytics, empowering business users to become self-reliant data explorers. This session highlights no-code AI apps that create a conversational interface for retail data analysis. Genie spaces harness NLP and generative AI to convert business questions into actionable insights, bypassing complex SQL queries. We'll showcase retail teams effortlessly analyzing sales trends, inventory and customer behavior through Genie's intuitive interface. Witness real-world examples of AI/BI Genie's adaptive learning, enhancing accuracy and relevance over time. Learn how this technology democratizes data access while maintaining governance via Unity Catalog integration. Discover Retail Genie's impact on decision-making, accelerating insights and cultivating a data-driven retail culture. Join us to see the future of accessible, intelligent retail analytics in action.

Selectively Overwrite Data With Delta Lake’s Dynamic Insert Overwrite

Dynamic Insert Overwrite is an important Delta Lake feature that allows fine-grained updates by selectively overwriting specific rows, eliminating the need for full-table rewrites. For examples, this capability is essential for: DBT-Databricks' incremental models/workloads, enabling efficient data transformations by processing only new or updated records ETL Slowly Changing Dimension (SCD) Type 2 In this lightning talk, we will: Introduce Dynamic Insert Overwrite: Understand its functionality and how it works Explore key use cases: Learn how it optimizes performance and reduces costs Share best practices: Discover practical tips for leveraging this feature on Databricks, including on the cutting-edge Serverless SQL Warehouses

Using Clean Rooms for Privacy-Centric Data Collaboration

Databricks Clean Rooms make privacy-safe collaboration possible for data, analytics, and AI — across clouds and platforms. Built on Delta Sharing, Clean Rooms enable organizations to securely share and analyze data together in a governed, isolated environment — without ever exposing raw data. In this session, you’ll learn how to get started with Databricks Clean Rooms and unlock advanced use cases including: Cross-platform collaboration and joint analytics Training machine learning and AI models Enforcing custom privacy policies Analyzing unstructured data Incorporating proprietary libraries in Python and SQL notebooks Auditing clean room activity for compliance Whether you're a data scientist, engineer or data leader, this session will equip you to drive high-value collaboration while maintaining full control over data privacy and governance.

What’s New in Security and Compliance on the Databricks Data Intelligence Platform

In this session, we’ll walk through the latest advancements in platform security and compliance on Databricks — from networking updates to encryption, serverless security and new compliance certifications across AWS, Azure and Google Cloud. We’ll also share our roadmap and best practices for how to securely configure workloads on Databricks SQL Serverless, Unity Catalog, Mosaic AI and more — at scale. If you're building on Databricks and want to stay ahead of evolving risk and regulatory demands, this session is your guide.

Your Wish is AI Command — Get to Grips With Databricks Genie

Picture the scene — you're exploring a deep, dark cave looking for insights to unearth when, in a burst of smoke, Genie appears and offers you not three but unlimited data wishes. This isn't a folk tale, it's the growing wave of Generative BI that is going to be a part of analytics platforms. Databricks Genie is a tool powered by a SQL-writing LLM that redefines how we interact with data. We'll look at the basics of creating a new Genie room, scoping its data tables and asking questions. We'll help it out with some complex pre-defined questions and ensure it has the best chance of success. We'll give the tool a personality, set some behavioural guidelines and prepare some hidden easter eggs for our users to discover. Generative BI is going to be a fundamental part of the analytics toolset used across businesses. If you're using Databricks, you should be aware of Genie, if you're not, you should be planning your Generative BI Roadmap, and this session will answer your wishes.

Summary In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?What are the foundational architectural modifications that you had to make to enable those capabilities?For the vector storage and indexing, what modifications did you have to make to iceberg?What was your reasoning for not using a format like Lance?For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?When is Starburst/lakehouse the wrong choice for a given AI use case?What do you have planned for the future of AI on Starburst?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarburstPodcast EpisodeAWS AthenaMCP == Model Context ProtocolLLM Tool UseVector EmbeddingsRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeStarburst Data ProductsLanceLanceDBParquetORCpgvectorStarburst IcehouseThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Enterprise Cost Management for Data Warehousing with Databricks SQL

This session shows you how to gain visibility into your Databricks SQL spend and ensure cost efficiency. Learn about the latest features to gain detailed insights into Databricks SQL expenses so you can easily monitor and control your costs. Find out how you can enable attribution to internal projects, understand the Total Cost of Ownership, set up proactive controls and find ways to continually optimize your spend.

Sponsored by: RowZero | Spreadsheets in the modern data stack: security, governance, AI, and self-serve analytics

Despite the proliferation of cloud data warehousing, BI tools, and AI, spreadsheets are still the most ubiquitous data tool. Business teams in finance, operations, sales, and marketing often need to analyze data in the cloud data warehouse but don't know SQL and don't want to learn BI tools. AI tools offer a new paradigm but still haven't broadly replaced the spreadsheet. With new AI tools and legacy BI tools providing business teams access to data inside Databricks, security and governance are put at risk. In this session, Row Zero CEO, Breck Fresen, will share examples and strategies data teams are using to support secure spreadsheet analysis at Fortune 500 companies and the future of spreadsheets in the world of AI. Breck is a former Principal Engineer from AWS S3 and was part of the team that wrote the S3 file system. He is an expert in storage, data infrastructure, cloud computing, and spreadsheets.

Turn Genie Into an Agent Using Conversation APIs

Transform your AI/BI Genie into a text-to-SQL powerhouse using the Genie Conversation APIs. This session explores how Genie functions as an intelligent agent, translating natural language queries into SQL to accelerate insights and enhance self-service analytics. You'll learn practical techniques for configuring agents, optimizing queries and handling errors — ensuring Genie delivers accurate, relevant responses in real time. A must-attend for teams looking to level up their AI/BI capabilities and deliver smarter analytics experiences.

AI Meets SQL: Leverage GenAI at Scale to Enrich Your Data

This session is repeated. Integrating AI into existing data workflows can be challenging, often requiring specialized knowledge and complex infrastructure. In this session, we'll share how SQL users can leverage AI/ML to access large language models (LLMs) and traditional machine learning directly from within SQL, simplifying the process of incorporating AI into data workflows. We will demonstrate how to use Databricks SQL for natural language processing, traditional machine learning, retrieval augmented generation and more. You'll learn about best practices and see examples of solving common use cases such as opinion mining, sentiment analysis, forecasting and other common AI/ML tasks.

Real-Time Mode Technical Deep Dive: How We Built Sub-300 Millisecond Streaming Into Apache Spark™

Real-time mode is a new low-latency execution mode for Apache Spark™ Structured Streaming. It can consistently provide p99 latencies less than 300 milliseconds for a broad set of stateless and stateful streaming queries. Our talk focuses on the technical aspects of making this possible in Spark. We’ll dive into the core architecture that enables these dramatic latency improvements, including a concurrent stage scheduler and a non-blocking shuffle. We’ll explore how we maintained Spark’s fault-tolerance guarantees, and we’ll also share specific optimizations we made to our streaming SQL operators. These architectural improvements have already enabled Databricks customers to build workloads with latencies up to 10x lower than before. Early adopters in our Private Preview have successfully implemented real-time enrichment pipelines and feature engineering for machine learning — use cases that were previously impossible at these latencies.

Geospatial Insights With Databricks SQL: Techniques and Applications

Spatial data is increasingly important, but working with it can be complex. In this session, we’ll explore how Databricks SQL supports spatial analysis and helps analysts and engineers get more value from location-based data. We’ll cover what’s coming in the Public Preview of Spatial SQL, when and how to use the new Geometry and Geography data types, and practical use cases for H3. You’ll also learn about common challenges with spatial data and how we're addressing them, along with a look at the near-term roadmap.

From Days to Seconds — Reducing Query Times on Large Geospatial Datasets by 99%

The Global Water Security Center translates environmental science into actionable insights for the U.S. Department of Defense. Prior to incorporating Databricks, responding to these requests required querying approximately five hundred thousand raster files representing over five hundred billion points. By leveraging lakehouse architecture, Databricks Auto Loader, Spark Streaming, Databricks Spatial SQL, H3 geospatial indexing and Databricks Liquid Clustering, we were able to drastically reduce our “time to analysis” from multiple business days to a matter of seconds. Now, our data scientists execute queries on pre-computed tables in Databricks, resulting in a “time to analysis” that is 99% faster, giving our teams more time for deeper analysis of the data. Additionally, we’ve incorporated Databricks Workflows, Databricks Asset Bundles, Git and Git Actions to support CI/CD across workspaces. We completed this work in close partnership with Databricks.

Geo-Powering Insights: The Art of Spatial Data Integration and Visualization

In this presentation, we will explore how to leverage Databricks' SQL engine to efficiently ingest and transform geospatial data. We'll demonstrate the seamless process of connecting to external systems such as ArcGIS to retrieve datasets, showcasing the platform's versatility in handling diverse data sources. We'll then delve into the power of Databricks Apps, illustrating how you can create custom geospatial dashboards using various frameworks like Streamlit and Flask, or any framework of your choice. This flexibility allows you to tailor your visualizations to your specific needs and preferences. Furthermore, we'll highlight the Databricks Lakehouse's integration capabilities with popular dashboarding tools such as Tableau and Power BI. This integration enables you to combine the robust data processing power of Databricks with the advanced visualization features of these specialized tools.

SQL-First ETL: Building Easy, Efficient Data Pipelines With Lakeflow Declarative Pipelines

This session explores how SQL-based ETL can accelerate development, simplify maintenance and make data transformation more accessible to both engineers and analysts. We'll walk through how Databricks Lakeflow Declarative Pipelines and Databricks SQL warehouse support building production-grade pipelines using familiar SQL constructs.Topics include: Using streaming tables for real-time ingestion and processing Leveraging materialized views to deliver fast, pre-computed datasets Integrating with tools like dbt to manage batch and streaming workflows at scale By the end of the session, you’ll understand how SQL-first approaches can streamline ETL development and support both operational and analytical use cases.

Unifying Data Delivery: Using Databricks as Your Enterprise Serving Layer

This session will take you on our journey of integrating Databricks as the core serving layer in a large enterprise, demonstrating how you can build a unified data platform that meets diverse business needs. We will walk through the steps for constructing a central serving layer by leveraging Databricks’ SQL Warehouse to efficiently deliver data to analytics tools and downstream applications. To tackle low latency requirements, we’ll show you how to incorporate an interim scalable relational database layer that delivers sub-second performance for hot data scenarios. Additionally, we’ll explore how Delta Sharing enables secure and cost-effective data distribution beyond your organization, eliminating silos and unnecessary duplication for a truly end-to-end centralized solution. This session is perfect for data architects, engineers and decision-makers looking to unlock the full potential of Databricks as a centralized serving hub.

Using Catalogs for a Well-Governed and Efficient Data Ecosystem

The ability to enforce data management controls at scale and reduce the effort required to manage data pipelines is critical to operating efficiently. Capital One has scaled its data management capabilities and invested in platforms to help address this need. In the past couple of years, the role of “the catalog” in a data platform architecture has transitioned from just providing SQL to providing a full suite of capabilities that can help solve this problem at scale. This talk will give insight into how Capital One is thinking about leveraging Databricks Unity Catalog to help tackle these challenges.

Improving User Experience and Efficiency Using DBSQL

To scale Databricks SQL to 2,000 users efficiently and cost-effectively, we adopted serverless, ensuring dynamic scalability and resource optimization. During peak times, resources scale up automatically; during low demand, they scale down, preventing waste. Additionally, we implemented a strong content governance model. We created continuous monitoring to assess query and dashboard performance, notifying users about adjustments and ensuring only relevant content remains active. If a query exceeds time or impact limits, access is reviewed and, if necessary, deactivated. This approach brought greater efficiency, cost reduction and an improved user experience, keeping the platform well-organized and high-performing.

De-Risking Investment Decisions: QCG's Smarter Deal Evaluation Process Leveraging Databricks

Quantum Capital Group (QCG) screens hundreds of deals across the global Sustainable Energy Ecosystem, requiring deep technical due diligence. With over 1.5 billion records sourced from public, premium and proprietary datasets, their challenge was how to efficiently curate, analyze and share this data to drive smarter investment decisions. QCG partnered with Databricks & Tiger Analytics to modernize its data landscape. Using Delta tables, Spark SQL, and Unity Catalog, the team built a golden dataset that powers proprietary evaluation models and automates complex workflows. Data is now seamlessly curated, enriched and distributed — both internally and to external stakeholders — in a secure, governed and scalable way. This session explores how QCG’s investment in data intelligence has turned an overwhelming volume of information into a competitive advantage, transforming deal evaluation into a faster, more strategic process.

Pushing the Limits of What Your Warehouse Can Do Using Python and Databricks

SQL warehouses in Databricks can run more than just SQL. Join this session to learn how to get more out of your SQL warehouses and any tools built on top of it by leveraging Python. After attending this session, you will be familiar with Python user-defined functions and how to bring in custom dependencies from PyPi, as a custom wheel or even securely invoke cloud services with performance at scale.