talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

It’s time for another episode of the Data Engineering Central Podcast. In this episode we cover … * Apache Airflow vs Databricks Workflows * End-of-Year Engineering Planning for 2025 * 10 Billion Row Challenge with DuckDB vs Daft vs Polars * Raw Data Ingestion. As usual, the full episode is available to paid subscribers, and a shortened version to you free loaders out there, don’t worry, I still love you though.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

In this episode, I had the pleasure of speaking with Ken Pickering, VP of Engineering at Going, about the intricacies of streaming data into a Trino and Iceberg lakehouse. Ken shared his journey from product engineering to becoming deeply involved in data-centric roles, highlighting his experiences in ecommerce and InsurTech. At Going, Ken leads the data platform team, focusing on finding travel deals for consumers, a task that involves handling massive volumes of flight data and event stream information.

Ken explained the dual approach of passive and active search strategies used by Going to manage the vast data landscape. Passive search involves aggregating data from global distribution systems, while active search is more transactional, querying specific flight prices. This approach helps Going sift through approximately 50 petabytes of data annually to identify the best travel deals.

We delved into the technical architecture supporting these operations, including the use of Confluent for data streaming, Starburst Galaxy for transformation, and Databricks for modeling. Ken emphasized the importance of an open lakehouse architecture, which allows for flexibility and scalability as the business grows.

Ken also discussed the composition of Going's engineering and data teams, highlighting the collaborative nature of their work and the reliance on vendor tooling to streamline operations. He shared insights into the challenges and strategies of managing data life cycles, ensuring data quality, and maintaining uptime for consumer-facing applications.

Throughout our conversation, Ken provided a glimpse into the future of Going's data architecture, including potential expansions into other travel modes and the integration of large language models for enhanced customer interaction. This episode offers a comprehensive look at the complexities and innovations in building a data-driven travel advisory service.

Building Modern Data Applications Using Databricks Lakehouse

This book, "Building Modern Data Applications Using Databricks Lakehouse," provides a comprehensive guide for data professionals to master the Databricks platform. You'll learn to effectively build, deploy, and monitor robust data pipelines with Databricks' Delta Live Tables, empowering you to manage and optimize cloud-based data operations effortlessly. What this Book will help me do Understand the foundations and concepts of Delta Live Tables and its role in data pipeline development. Learn workflows to process and transform real-time and batch data efficiently using the Databricks lakehouse architecture. Master the implementation of Unity Catalog for governance and secure data access in modern data applications. Deploy and automate data pipeline changes using CI/CD, leveraging tools like Terraform and Databricks Asset Bundles. Gain advanced insights in monitoring data quality and performance, optimizing cloud costs, and managing DataOps tasks effectively. Author(s) Will Girten, the author, is a seasoned Solutions Architect at Databricks with over a decade of experience in data and AI systems. With a deep expertise in modern data architectures, Will is adept at simplifying complex topics and translating them into actionable knowledge. His books emphasize real-time application and offer clear, hands-on examples, making learning engaging and impactful. Who is it for? This book is geared towards data engineers, analysts, and DataOps professionals seeking efficient strategies to implement and maintain robust data pipelines. If you have a basic understanding of Python and Apache Spark and wish to delve deeper into the Databricks platform for streamlining workflows, this book is tailored for you.

Coalesce 2024: How Riot Games is building player-first gaming experiences with Databricks and dbt

Riot Games, creator of hit titles like League of Legends and Valorant, is building an ultimate gaming experience by using data and AI to deliver the most optimal player journeys. In this session, you'll learn how Riot's data platform team paired with analytics engineering, machine learning, and insights teams to integrate Databricks Data Intelligence Platform and dbt Cloud to significantly mature its data capabilities. The outcome: a scalable, collaborative analytics environment that serves millions of players worldwide.

You’ll hear how Riot Games: - Centralized petabytes of game telemetry on Databricks for fast processing and analytics - Modernized their data platform by integrating dbt Cloud, unlocking governance for modular, version-controlled data transformations and testing for a diverse set of user personas - Uses Generative AI to automate the enforcement of good documentation and quality code and plans to use Databricks AI to further speed up its ability to unlock the value of data - Deployed machine learning models for personalized recommendations and player behavior analysis

You'll come away with practical insights on architecting a modern data stack that can handle massive scale while empowering teams across the organization. Whether you're in gaming or any data-intensive industry, you'll learn valuable lessons from Riot's journey to build a world-class data platform.

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

It’s time for another episode of Data Engineering Central Podcast, our third one! Topics in this episode … * Should you use DuckDB or Polars? * Small Engineering Changes (PR Reviews) * Daft vs Spark on Databricks with Unity Catalog (Delta Lake) * Primary and Foreign keys in the Lake House Enjoy!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Coalesce 2024: Empowering data-driven success: How Chick-fil-A deploys dbt at scale

Dive into Chick-fil-A's data-centric culture, where efficient data transformation is pivotal for operational success. With data serving as a cornerstone, teams must rapidly process, clean, and validate vast volumes of data across diverse departments. Leveraging a robust toolkit including dbt and Databricks, Chick-fil-A is managing over 30 dbt projects seamlessly. Hear from the Chick-fil-A analytics engineering team as they explore the intricacies of scaling data transformation initiatives and the pivotal role dbt plays in overcoming challenges inherent in managing data at such scale. Gain insights into the specific obstacles addressed by dbt and how its features empower Chick-fil-A's teams to navigate complex data landscapes with agility and precision.

Speakers: Tony Yuan Senior Principal Team Leader Chick-fil-A

Dustin Dorsey Principal Data Architect Onix

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Businesses are collecting more data than ever before. But is bigger always better? Many companies are starting to question whether massive datasets and complex infrastructure are truly delivering results or just adding unnecessary costs and complications. How can you make sure your data strategy is aligned with your actual needs? What if focusing on smaller, more manageable datasets could improve your efficiency and save resources, all while delivering the same insights? Ryan Boyd is the Co-Founder & VP, Marketing + DevRel at MotherDuck. Ryan started his career as a software engineer, but since has led DevRel teams for 15+ years at Google, Databricks and Neo4j, where he developed and executed numerous marketing and DevRel programs. Prior to MotherDuck, Ryan worked at Databricks and focussed the team on building an online community during the pandemic, helping to organize the content and experience for an online Data + AI Summit, establishing a regular cadence of video and blog content, launching the Databricks Beacons ambassador program, improving the time to an “aha” moment in the online trial and launching a University Alliance program to help professors teach the latest in data science, machine learning and data engineering. In the episode, Richie and Ryan explore data growth and computation, the data 1%, the small data movement, data storage and usage, the shift to local and hybrid computing, modern data tools, the challenges of big data, transactional vs analytical databases, SQL language enhancements, simple and ergonomic data solutions and much more.  Links Mentioned in the Show: MotherDuckThe Small Data ManifestoConnect with RyanSmall DataSF conferenceRelated Episode: Effective Data Engineering with Liya Aizenberg, Director of Data Engineering at AwayRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Databricks Data Intelligence Platform: Unlocking the GenAI Revolution

This book is your comprehensive guide to building robust Generative AI solutions using the Databricks Data Intelligence Platform. Databricks is the fastest-growing data platform offering unified analytics and AI capabilities within a single governance framework, enabling organizations to streamline their data processing workflows, from ingestion to visualization. Additionally, Databricks provides features to train a high-quality large language model (LLM), whether you are looking for Retrieval-Augmented Generation (RAG) or fine-tuning. Databricks offers a scalable and efficient solution for processing large volumes of both structured and unstructured data, facilitating advanced analytics, machine learning, and real-time processing. In today's GenAI world, Databricks plays a crucial role in empowering organizations to extract value from their data effectively, driving innovation and gaining a competitive edge in the digital age. This book will not only help you master the Data Intelligence Platform but also help power your enterprise to the next level with a bespoke LLM unique to your organization. Beginning with foundational principles, the book starts with a platform overview and explores features and best practices for ingestion, transformation, and storage with Delta Lake. Advanced topics include leveraging Databricks SQL for querying and visualizing large datasets, ensuring data governance and security with Unity Catalog, and deploying machine learning and LLMs using Databricks MLflow for GenAI. Through practical examples, insights, and best practices, this book equips solution architects and data engineers with the knowledge to design and implement scalable data solutions, making it an indispensable resource for modern enterprises. Whether you are new to Databricks and trying to learn a new platform, a seasoned practitioner building data pipelines, data science models, or GenAI applications, or even an executive who wants to communicate the value of Databricks to customers, this book is for you. With its extensive feature and best practice deep dives, it also serves as an excellent reference guide if you are preparing for Databricks certification exams. What You Will Learn Foundational principles of Lakehouse architecture Key features including Unity Catalog, Databricks SQL (DBSQL), and Delta Live Tables Databricks Intelligence Platform and key functionalities Building and deploying GenAI Applications from data ingestion to model serving Databricks pricing, platform security, DBRX, and many more topics Who This Book Is For Solution architects, data engineers, data scientists, Databricks practitioners, and anyone who wants to deploy their Gen AI solutions with the Data Intelligence Platform. This is also a handbook for senior execs who need to communicate the value of Databricks to customers. People who are new to the Databricks Platform and want comprehensive insights will find the book accessible.

In his keynote talks at the Snowflake and Databricks Summit this year, Jenson Huang, the Founder CEO at NVIDEA, talked at length about how, to compete today, organizations have to build data flywheels: where they take their proprietary business data, use AI on that data to build proprietary intelligence, use that insight to build proprietary products and services that your customers love and use that to create more proprietary data to feed AIs to build more proprietary intelligence and so on.

But what does this mean in practice? Jenson's example of NVIDEA is intriguing - but how can the rest of us build data flywheels in our own organizations? What practical steps can they take?

In this talk, Yali Sassoon, Snowplow cofounder and CPO, will start to answer these questions, drawing on examples from Snowplow customers in retail, media and technology that have successfully built customer data flywheels on top of their proprietary 1st party customer data.

Join Scott Gamester as he challenges the outdated promises of legacy BI and self-service analytics tools. This session will explore the key issues that have hindered true data-driven decision-making and how modern solutions like Sigma Computing, Databricks, and Snowflake are redefining the landscape. Scott will demonstrate how integrating these platforms empowers business analysts, driving innovation at the edge and enabling AI-enhanced insights. Attendees will learn how these advancements are transforming business empowerment and fostering a new era of creativity and efficiency in analytics.

Have you ever received 50 emails from an Executive asking why a number changed on a business reporting dashboard? Come and hear how Databricks customers are experimenting with Gen BI to explain “The Why”.

• IT and data teams build dashboards, the business continues to use Excel - the last-mile problem

• Our analytics tools don’t explain “The Why”

• GenAI allows us to blend context and semantics in a way we could never previously scale

We call this Generative BI, enabled by the Databricks Data Intelligence Platform.

Welcome to the Data Engineering Central Podcast —— a no-holds-barred discussion on the Data Landscape. Welcome to Episode 01 In today’s episode we will talk about the following topics from the Data Engineering perspective … * Snowflake vs Databricks. * Is Apache Spark being replaced?? * Notebooks in Production. Bad.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Perhaps the biggest complaint about generative AI is hallucination. If the text you want to generate involves facts, for example, a chatbot that answers questions, then hallucination is a problem. The solution to this is to make use of a technique called retrieval augmented generation, where you store facts in a vector database and retrieve the most appropriate ones to send to the large language model to help it give accurate responses. So, what goes into building vector databases and how do they improve LLM performance so much? Ram Sriharsha is currently the CTO at Pinecone. Before this role, he was the Director of Engineering at Pinecone and previously served as Vice President of Engineering at Splunk. He also worked as a Product Manager at Databricks. With a long history in the software development industry, Ram has held positions as an architect, lead product developer, and senior software engineer at various companies. Ram is also a long time contributor to Apache Spark.  In the episode, Richie and Ram explore common use-cases for vector databases, RAG in chatbots, steps to create a chatbot, static vs dynamic data, testing chatbot success, handling dynamic data, choosing language models, knowledge graphs, implementing vector databases, innovations in vector data bases, the future of LLMs and much more.  Links Mentioned in the Show: PineconeWebinar - Charting the Path: What the Future Holds for Generative AICourse - Vector Databases for Embeddings with PineconeRelated Episode: The Power of Vector Databases and Semantic Search with Elan Dekel, VP of Product at PineconeRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

Databricks Customers at Data + AI Summit

At this year's event, over 250 customers shared their data and AI journies. They showcased a wide variety of use cases, best practices and lessons from their leadership and innovation with the latest data and AI technologies.

See how enterprises are leveraging generative AI in their data operations and how innovative data management and data governance are fueling organizations as they race to develop GenAI applications. https://www.databricks.com/blog/how-real-world-enterprises-are-leveraging-generative-ai

To see more real-world use cases and customer success stories, visit: https://www.databricks.com/customers

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don’t), where industry insights meet laid-back banter. Whether you’re a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let’s get into the heart of data, unplugged style! In this episode, join us along with guests Vitale and David as we explore: Euro 2024 Predictions with AI: Using Snowflake's machine learning models for data-driven predictions and sharing our own predictions. Can animals predict wins better than ML models?Tech in football: From VAR to connected ball technology, is it all a good idea?Nvidia overtaking Apple and Microsoft as the biggest tech corporation? Discussing Nvidia's leap to surpass Apple and Microsoft, and the implications for the GPU market and AI development.Unity Catalog vs. Polaris: Comparing Unity+Delta with Polaris+Iceberg and their roles in data cataloging and management. Explore the details on GitHub Unity Catalog, YouTube, and insights on LinkedIn. Databricks Data and AI Summit recap: Discussing the biggest announcements from the summit, including Mosaic AI integration, serverless options, and the open-source unity catalog.Exploring BM25: Discussing the BM25 algorithm and its advancements over traditional TF-IDF for document classification.