talk-data.com talk-data.com

Topic

Spark

Apache Spark

big_data distributed_computing analytics

581

tagged

Activity Trend

71 peak/qtr
2020-Q1 2026-Q1

Activities

581 activities · Newest first

In this lab you'll help a coffee shop unify their operational and analytical workloads with Cosmos DB in Microsoft Fabric. You'll blend operational data with curated sources using cross-database SQL, stream and visualize real-time POS events, and create a gold layer for personalization. Finally, you'll implement reverse ETL to Cosmos for lightning-fast serving and train a lightweight Spark notebook model to deliver the right offer at the right time before your customer’s order is ready.

Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees.

Real-time analytics and AI apps with Cosmos DB in Fabric

See how Cosmos DB in Fabric makes it easy to power AI-driven applications at scale. In this session, you’ll learn how to process customer data with Real-Time Intelligence, use Apache Spark to train ML-based recommendation engines, and combine vector search with Cosmos DB to deliver high-performance, personalized experiences. We’ll also cover collaborative filtering algorithms, blue-green deployments, User data functions, and Notebooks to build, test, and ship AI-enabled apps in real time.

In this lab you'll help a coffee shop unify their operational and analytical workloads with Cosmos DB in Microsoft Fabric. You'll blend operational data with curated sources using cross-database SQL, stream and visualize real-time POS events, and create a gold layer for personalization. Finally, you'll implement reverse ETL to Cosmos for lightning-fast serving and train a lightweight Spark notebook model to deliver the right offer at the right time before your customer’s order is ready.

Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees.

In this lab you'll help a coffee shop unify their operational and analytical workloads with Cosmos DB in Microsoft Fabric. You'll blend operational data with curated sources using cross-database SQL, stream and visualize real-time POS events, and create a gold layer for personalization. Finally, you'll implement reverse ETL to Cosmos for lightning-fast serving and train a lightweight Spark notebook model to deliver the right offer at the right time before your customer’s order is ready.

Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees.

I missed my parents, so I built an AI that talks like them. This isn’t about replacing people—it’s about remembering the voices that make us feel safe. In this 90-minute episode of Data & AI with Mukundan, we explore what happens when technology stops chasing efficiency and starts chasing empathy. Mukundan shares the story behind “What Would Mom & Dad Say?”, a Streamlit + GPT-4 experiment that generates comforting messages in the voice of loved ones. You’ll hear: The emotional spark that inspired the projectThe plain-English prompts anyone can use to teach AI empathyBoundaries & ethics of emotional AIHow this project reframed loneliness, creativity, and connectionTakeaway: AI can’t love you—but it can remind you of the people who do. 🔗 Try the free reflection prompts below: THE ONE-PROMPT VERSION: “What Would Mom & Dad Say?”
“You are speaking to me as one of my parents. Choose the tone I mention: either Mom (warm and reflective) or Dad (practical and encouraging). First, notice the emotion in what I tell you—fear, stress, guilt, joy, or confusion—and name it back to me so I feel heard. Then reply in 3 parts: Start by validating what I’m feeling, in a caring way.Share a short story, lesson, or perspective that fits the situation.End with one hopeful or guiding question that helps me think forward. Keep your words gentle, honest, and simple. No technical language. Speak like someone who loves me and wants me to feel calm and capable again.”

Join the Discussion (comments hub): https://mukundansankar.substack.com/notes Tools I use for my Podcast and Affiliate PartnersRecording Partner: Riverside → Sign up here (affiliate)Host Your Podcast: RSS.com (affiliate )Research Tools: Sider.ai (affiliate)Sourcetable AI: Join Here(affiliate)🔗 Connect with Me:Free Email NewsletterWebsite: Data & AI with MukundanGitHub: https://github.com/mukund14Twitter/X: @sankarmukund475LinkedIn: Mukundan SankarYouTube: Subscribe

Most AI pipelines still treat models like Python UDFs, just another function bolted onto Spark, Pandas, or Ray. But models aren’t functions: they’re expensive, stateful, and difficult to configure. In this talk, we’ll explore why this mental model breaks at scale and share practical patterns for treating models as first-class citizens in your pipelines.

PySpark’s Arrow-based Python UDFs open the door to dramatically faster data processing by avoiding expensive serialization overhead. At the same time, Polars, a high-performance DataFrame library built on Rust, offers zero-copy interoperability with Apache Arrow. This talk shows how combining these two technologies unlocks new performance gains: writing Arrow UDFs with Polars in PySpark can deliver performance speedups compared to Python UDFs. Attendees will learn how Arrow UDFs work in PySpark, how it can be used with other data processing libraries, and how to apply this approach to real-world Spark pipelines for faster, more efficient workloads.

talk
by Holden Karau (Fight Health Insurance)

In this talk the somewhat biased Apache Spark PMC Holden will explore the times when using Spark is more likely to lead to disappointment and pages than success and promotions. We'll, of course, look at places where Spark can excel but also explore heuristics like if it fits in Excel double check if you need Spark. By using Spark only when it's truly beneficial you can demonstrate that elusive "thought leadership" that always seems to be required for the next level of promotion. We'll explore how some of Spark's largest disadvantages are changing, but also which ones are likely to stick around -- allowing you to seem like you have a magic tech eightball next time someone asks you to design your analytics strategy. Come for a place to sit after lunch and stay for the OOM therapy.

Session: When Microsoft Fabric was released, it came with Apache Spark out of the box. Spark’s ability to work with more programming languages opened up possibilities for creating data-driven and automated lakehouses. With Python Notebooks, we have a better tool for handling metadata, automation, and processing of more trivial workloads, while still having the option to use Spark Notebooks for handling more demanding processing. We will cover: The difference between Python Notebooks and a Single Node Spark cluster, and why Spark Notebooks are more costly and less performant with certain types of workloads. When to use Python Notebooks and when to use Spark Notebooks. Where to use Python Notebooks in a meta-driven Lakehouse. A brief introduction to tooling and move workload between Python Notebooks and Spark Notebooks. How to avoid overload the Lakehouse tech stack with python technologies. Costs

A l’occasion de cette démo, en partant d’une page blanche et de différentes sources de données, nous irons jusqu’à déployer une application Data Analytics augmentée par des LMM en utilisant ces deux produits lancés par OVHcloud en 2025.

OVHcloud DataPlatform : une solution unifiée et permettant vos équipes de gérer en self-service de bout en bout vos projets Data & Analytics : de la collecte de tous types de données, leur exploration, leur stockage, leurs transformations, jusqu’à la construction de tableaux de bords partagés via des applications dédiées. Une service pay-as-you-go pour accélérer de déploiement et simplifier la gestion des projets Data.

AI Endpoints : une solution serverless qui permet aux développeurs d’intégrer facilement des fonctionnalités d'IA avancées à leurs applications. Grâce à plus de 40 modèles open-source de pointe incluant LLM et IA générative – pour des usages comme les agents conversationnels, modèles vocaux, assistants de code, etc. - AI Endpoints démocratise l’utilisation de l'IA, indépendamment de la taille ou du secteur de l'organisation.

Et cela en s’appuyant sur les meilleurs standards Data open-source (Apache Iceberg, Spark, SuperSet, Trino, Jupyter Notebooks…) dans des environnements respectueux de votre souveraineté technologique.

This presentation provides an overview of how NVIDIA RAPIDS accelerates data science and data engineering workflows end-to-end. Key topics include leveraging RAPIDS for machine learning, large-scale graph analytics, real-time inference, hyperparameter optimization, and ETL processes. Case studies demonstrate significant performance improvements and cost savings across various industries using RAPIDS for Apache Spark, XGBoost, cuML, and other GPU-accelerated tools. The talk emphasizes the impact of accelerated computing on modern enterprise applications, including LLMs, recommenders, and complex data processing pipelines.

What does it really mean to be a data-driven organisation? In this session, Hazal Muhtar, Senior Director of Analytics at Wise, will share real-life examples of how customer data fuels product innovation, marketing and growth at one of the world’s leading fintech companies.

Hazal will explore how data is driving change across industries and reshaping the way businesses operate. She’ll also discuss the importance of data democratization—how winning teams across an organisation access to insights can spark innovation, accelerate decision-making and build a culture where data is a shared language.

Join this session to discover how Wise embraces data at scale, and leave with practical takeaways on how customer data can become your most powerful asset in unlocking transformation and growth.

Face To Face
by Cali Wood (AXA UK&I) , Seana Tomlinson (Women in Data®) , Katie Straker (Women in Data®) , Caroline Carruthers (Carruthers and Jackson) , Dr. Nadia Zaheer (Vanquis Banking Group)

Wherever you are in your career, this session will hopefully inspire you to take that next LEAP.  

Join us as we delve into mentorship circles, cross-disciplinary team challenges, and real-time feedback loops that accelerate personal skill growth and spark creative solutions to take back to your businesses and drive measurable impact.  

This focused 30-minute session will teach you the power of the Women in Data® Leadership, Equity, Acceleration Programme (LEAP). We will explore its hands-on curriculum and how its dynamic peer network is empowering data professionals to collaborate, innovate, and lead.

Powered by: Women in Data®

In this talk, we will introduce Ordeq, a cutting-edge data pipeline development framework used by data engineers, scientists and analysts across ING. Ordeq helps you modularise pipeline logic and abstract IO, elevating projects from proof-of-concepts to maintainable production-level applications. We will demonstrate how Ordeq integrates seamlessly with popular data processing tools like Spark, Polars, Matplotlib, DSPy, and orchestration tools such as Airflow. Additionally, we showcase how you can leverage Ordeq on public cloud offering like GCP. Ordeq has 0 dependencies and is available under MIT license.

Discover how to build a powerful AI Lakehouse and unified data fabric natively on Google Cloud. Leverage BigQuery's serverless scale and robust analytics capabilities as the core, seamlessly integrating open data formats with Apache Iceberg and efficient processing using managed Spark environments like Dataproc. Explore the essential components of this modern data environment, including data architecture best practices, robust integration strategies, high data quality assurance, and efficient metadata management with Google Cloud Data Catalog. Learn how Google Cloud's comprehensive ecosystem accelerates advanced analytics, preparing your data for sophisticated machine learning initiatives and enabling direct connection to services like Vertex AI. 

So you’ve heard of Databricks, but still not sure what the fuss is all about. Yes you’ve heard it’s Spark, but then there’s this Delta thing that’s both a data lake and a data warehouse (isn’t that what Iceberg is?) And then there's Unity Catalog, that's not just a catalog, it also does access management but even surprising things like optimise your data and programmatic access to lineage and billing? But then serverless came out and now you don’t even have to learn Spark? And of course there’s a bunch of AI stuff to use or create yourself. So why not spend 30 mins learning the details of what Databricks does, and how it can turn you into a rockstar Data Engineer.