Learn Live: Build an AI-enabled chat w/ Azure OpenAI & Azure Cosmos DB | BRK404LL

2023-11-16 · Microsoft Ignite 2023 Watch

video

by Bethany Jepchumba (Microsoft) , Julia Muiruri (Microsoft) , Akah Mandela Munab

AI/ML Azure C#/.NET Cosmos LLM Microsoft NoSQL

Connect an existing ASP.NET Core Blazor web application to Azure Cosmos DB for NoSQL and Azure OpenAI using their .NET SDKs. Your code manages and queries items in an API for NoSQL container. Your code also sends prompts to Azure OpenAI and parses the responses. This LIVE session is presented by two experts, and our moderators will answer your questions directly in the chat.

𝗦𝗽𝗲𝗮𝗸𝗲𝗿𝘀: * B Jb * Julia Muiruri * Tim Fish * Akah Mandela Munab * DMITRII SOLOVEV * Jay Gordon * Julian Sharp * Konstantin Berezovsky * DE Producer 9 * Olivia Guzzardo

𝗦𝗲𝘀𝘀𝗶𝗼𝗻 𝗜𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻: This video is one of many sessions delivered for the Microsoft Ignite 2023 event. View sessions on-demand and learn more about Microsoft Ignite at https://ignite.microsoft.com

BRK404LL | English (US) | Data

MSIgnite

Why and How to Enable Data Science with an Independent Semantic Layer - Audio Blog

2023-11-10 · Secrets of Data Analytics Leaders Listen

podcast_episode

AI/ML Data Science

The need for an independent semantic layer continues to rise as data science gains traction in the enterprise. Its five primary elements—metrics, caching, metadata management, APIs, and access controls—support AI/ML use cases as part of data science projects. Published at: https://www.eckerson.com/articles/why-and-how-to-enable-data-science-with-an-independent-semantic-layer

Central application for all your dbt packages - Coalesce 2023

2023-10-27 · dbt Coalesce 2023 Watch

video

by Adrien Boutreau (Infinite Lambda)

Analytics AWS Lambda Cloud Computing DataViz dbt DWH KPI

dbt packages are libraries for dbt. Packages can produce information about best practice for your dbt project (ex: dbt project evaluator) and cloud warehouse cost overviews. Unfortunately, all theses KPIs are stored in your data warehouse and it can be painful and expensive to create data visualization dashboards. This application build automatically dashboards from dbt packages that you are using. You just need to parameter your dbt Cloud API key - that's it! In this session, you'll learn how.

Speaker: Adrien Boutreau, Head of Analytics Engineers , Infinite Lambda

Register for Coalesce at https://coalesce.getdbt.com

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

2023-10-15 · Data Engineering Podcast Listen

podcast_episode

by Eric Sammer (Decodable) , Tobias Macey

AI/ML Airbyte Analytics Flink Kinesis BI CI/CD Cloud Computing Data Engineering Data Management Data Quality Data Science +21 more

Summary

Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES. Your host is Tobias Macey and today I'm interviewing Eric Sammer about starting your stream processing journey with Decodable

Interview

Introduction How did you get involved in the area of data management? Can you describe what Decodable is and the story behind it?

What are the notable changes to the Decodable platform since we last spoke? (October 2021) What are the industry shifts that have influenced the product direction?

What are the problems that customers are trying to solve when they come to Decodable? When you launched your focus was on SQL transformations of streaming data. What was the process for adding full Java support in addition to SQL? What are the developer experience challenges that are particular to working with streaming data?

How have you worked to address that in the Decodable platform and interfaces?

As you evolve the technical and product direction, what is your heuristic for balancing the unification of interfaces and system integration against the ability to swap different components or interfaces as new technologies are introduced? What are the most interesting, innovative, or unexpected ways that you have seen Decodable used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable? When is Decodable the wrong choice? What do you have planned for the future of Decodable?

Contact Info

esammer on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Decodable

Podcast Episode

Understanding the Apache Flink Journey Flink

Podcast Episode

Debezium

Podcast Episode

Kafka Redpanda

Podcast Episode

Kinesis PostgreSQL

Podcast Episode

Snowflake

Podcast Episode

Databricks Startree Pinot

Podcast Episode

Rockset

Podcast Episode

Druid InfluxDB Samza Storm Pulsar

Podcast Episode

ksqlDB

Podcast Episode

dbt GitHub Actions Airbyte Singer Splunk Outbox Pattern

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Neo4J: NODES Conference Logo

NODES 2023 is a free online conference focused on graph-driven innovations with content for all skill levels. Its 24 hours are packed with 90 interactive technical sessions from top developers and data scientists across the world covering a broad range of topics and use cases. The event tracks: - Intelligent Applications: APIs, Libraries, and Frameworks – Tools and best practices for creating graph-powered applications and APIs with any software stack and programming language, including Java, Python, and JavaScript - Machine Learning and AI – How graph technology provides context for your data and enhances the accuracy of your AI and ML projects (e.g.: graph neural networks, responsible AI) - Visualization: Tools, Techniques, and Best Practices – Techniques and tools for exploring hidden and unknown patterns in your data and presenting complex relationships (knowledge graphs, ethical data practices, and data representation)

Don’t miss your chance to hear about the latest graph-powered implementations and best practices for free on October 26 at NODES 2023. Go to Neo4j.com/NODES today to see the full agenda and register!Rudderstack:

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackMaterialize:

You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.

That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.

Go to materialize.com today and get 2 weeks free!Datafold:

This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare…

Hands-On Web Scraping with Python - Second Edition

2023-10-06 · O'Reilly Data Science Books O'Reilly Amazon

book

by Anish Chapagain

Data Science Pandas Plotly Python Selenium data data-science data-science-tasks web-scraping

In "Hands-On Web Scraping with Python," you'll learn how to harness the power of Python libraries to extract, process, and analyze data from the web. This book provides a practical, step-by-step guide for beginners and data enthusiasts alike. What this Book will help me do Master the use of Python libraries like requests, lxml, Scrapy, and Beautiful Soup for web scraping. Develop advanced techniques for secure browsing and data extraction using APIs and Selenium. Understand the principles behind regex and PDF data parsing for comprehensive scraping. Analyze and visualize data using data science tools such as Pandas and Plotly. Build a portfolio of real-world scraping projects to demonstrate your capabilities. Author(s) Anish Chapagain, the author of "Hands-On Web Scraping with Python," is an experienced programmer and instructor who specializes in Python and data-related technologies. With his vast experience in teaching individuals from diverse backgrounds, Anish approaches complex concepts with clarity and a hands-on methodology. Who is it for? This book is perfect for aspiring data scientists, Python beginners, and anyone who wants to delve into web scraping. Readers should have a basic understanding of how websites work but no prior coding experience is required. If you aim to develop scraping skills and understand data analysis, this book is the ideal starting point.

Introduction to Integration Suite Capabilities: Learn SAP API Management, Open Connectors, Integration Advisor and Trading Partner Management

2023-08-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jaspreet Bagga

Cloud Computing SAP Cyber Security data data-engineering

Discover the power of SAP Integration Suite's capabilities with this hands-on guide. Learn how this integration platform (iPaaS) can help you connect and automate your business processes with integrations, connectors, APIs, and best practices for a faster ROI. Over the course of this book, you will explore the powerful capabilities of SAP Integration Suite, including API Management, Open Connectors, Integration Advisor, Trading Partner Management, Migration Assessment, and Integration Assessment. With detailed explanations and real-world examples, this book is the perfect resource for anyone looking to unlock the full potential of SAP Integration Suite. With each chapter, you'll gain a greater understanding of why SAP Integration Suite can be the proverbial swiss army knife in your toolkit to design and develop enterprise integration scenarios, offering simplified integration, security, and governance for your applications. Author Jaspreet Bagga demonstrates howto create, publish, and monitor APIs with SAP API Management, and how to use its features to enhance your API lifecycle. He also provides a detailed walkthrough of how other capabilities of SAP Integration Suite can streamline your connectivity, design, development, and architecture methodology with a tool-based approach completely managed by SAP. Whether you are a developer, an architect, or a business user, this book will help you unlock the potential of SAP's Integration Suite platform, API Management, and accelerate your digital transformation. What You Will Learn Understand what APIs are, what they are used for, and why they are crucial for building effective and reliable applications Gain an understanding of SAP Integration Suite's features and benefits Study SAP Integration assessment process, patterns, and much more Explore tools and capabilities other than the Cloud Integration that address the full value chain of the enterprise integration components Who This Book Is For Web developers and application leads who want to learn SAP API Management.

Building Data Science Applications with FastAPI - Second Edition

2023-07-31 · O'Reilly Data Science Books O'Reilly Amazon

book

by François Voron

AI/ML Data Science Python fastapi python-web-frameworks web-development web-mobile

Building Data Science Applications with FastAPI is your comprehensive guide to mastering the FastAPI framework to build efficient, reliable data science applications and APIs. You'll explore examples and projects that integrate machine learning models, manage databases, and leverage advanced FastAPI features like asynchronous I/O and WebSockets. What this Book will help me do Develop an understanding of the fundamentals and advanced features of the FastAPI framework, like dependency injection and type hinting. Learn how to integrate machine learning models into a FastAPI-based web backend effectively. Master concepts of authentication, database connections, and asynchronous programming in Python. Build and deploy two practical AI applications: a real-time object detection tool and a text-to-image generator. Acquire skills to monitor, log, and maintain software systems for optimal performance and reliability. Author(s) François Voron is an experienced Python developer and data scientist with extensive knowledge of western frameworks including FastAPI. With years of experience designing and deploying machine learning and data science applications, François focuses on empowering developers with practical techniques and real-world applications. His guidance helps readers tackle contemporary challenges in software development. Who is it for? This book is ideal for data scientists and software engineers looking to broaden their skillset by creating robust web APIs for data science applications. Readers are expected to have a working knowledge of Python and basic data science concepts, offering them a chance to expand into backend development. If you're keen to deploy machine learning models and integrate them seamlessly with web technologies, this book is for you. It provides both fundamental insights and advanced techniques to serve a broad range of learners.

LLMs for Everyone - Meryem Arik

2023-07-28 · DataTalks.Club Listen

podcast_episode

by Meryem Arik (TitanML)

Data Quality GitHub HTML LLM MLOps Vector DB

We talked about:

Meryam's background The constant evolution of startups How Meryam became interested in LLMs What is an LLM (generative vs non-generative models)? Why LLMs are important Open source models vs API models What TitanML does How fine-tuning a model helps in LLM use cases Fine-tuning generative models How generative models change the landscape of human work How to adjust models over time Vector databases and LLMs How to choose an open source LLM or an API Measuring input data quality Meryam's resource recommendations

Links:

Website: https://www.titanml.co/ Beta docs: https://titanml.gitbook.io/iris-documentation/overview/guide-to-titanml... Using llama2.0 in TitanML Blog: https://medium.com/@TitanML/the-easiest-way-to-fine-tune-and-inference-llama-2-0-8d8900a57d57 Discord: https://discord.gg/83RmHTjZgf Meryem LinkedIn: https://www.linkedin.com/in/meryemarik/

Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Rob Bajra , Derrick Olson

AI/ML Analytics BI Data Science Databricks NLP Data Streaming

JetBlue has embarked over the past year on an AI and ML transformation. Databricks has been instrumental in this transformation due to the ability to integrate streaming pipelines, ML training using MLflow, ML API serving using ML registry and more in one cohesive platform. Using real-time streams of weather, aircraft sensors, FAA data feeds, JetBlue operations and more are used for the world's first AI and ML operating system orchestrating a digital-twin, known as BlueSky for efficient and safe operations. JetBlue has over 10 ML products (multiple models each product) in production across multiple verticals including dynamic pricing, customer recommendation engines, supply chain optimization, customer sentiment NLP and several more.

The core JetBlue data science and analytics team consists of Operations Data Science, Commercial Data Science, AI and ML engineering and Business Intelligence. To facilitate the rapid growth and faster go-to-market strategy, the team has built an internal Data Catalog + AutoML + AutoDeploy wrapper called BlueML using Databricks features to empower data scientists including advanced analysts with the ability to train and deploy ML models in less than five lines of code.

Talk by: Derrick Olson and Rob Bajra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Rapidly Implementing Major Retailer API at the Hershey Company

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Simon Whiteley (Advancing Analytics) , Jordan Donmoyer

Analytics Data Lakehouse Databricks Delta SQL Data Streaming

Accurate, reliable, and timely data is critical for CPG companies to stay ahead in highly competitive retailer relationships, and for a company like the Hershey Company, the commercial relationship with Walmart is one of the most important. The team at Hershey found themselves with a looming deadline for their legacy analytics services and targeted a migration to the brand new Walmart Luminate API. Working in partnership with Advancing Analytics, the Hershey Company leveraged a metadata-driven Lakehouse Architecture to rapidly onboard the new Luminate API, helping the category management teams to overhaul how they measure, predict, and plan their business operations.

In this session, we will discuss the impact Luminate has had on Hershey's business covering key areas such as sales, supply chain, and retail field execution, and the technical building blocks that can be used to rapidly provision business users with the data they need, when they need it. We will discuss how key technologies enable this rapid approach, with Databricks Autoloader ingesting and shaping our data, Delta Streaming processing the data through the lakehouse and Databricks SQL providing a responsive serving layer. The session will include commentary as well as cover the technical journey.

Talk by: Simon Whiteley and Jordan Donmoyer

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Chris Inkpen , Paul Mracek

Data Engineering Data Lakehouse Databricks Delta Data Streaming

Honeywell manages the control of equipment for hundreds of thousands of buildings worldwide. Many of our outcomes relating to energy and comfort rely on knowing where people are in the building at any one time. This is so we can target health and comfort conditions more suitably to areas where are more densely populated. Many of these buildings have Cisco IT infrastructure in them. Using their WIFI points and the RSSI signal strength from people’s laptops and phones, Cisco can calculate the number of people in each area of the building. Cisco Spaces offer this data up as a real-time streaming source. Honeywell HBT has utilized this stream of data by writing delta live table pipelines to consume this data source.

Honeywell buildings can now receive this firehose data from hundreds of concurrent customers and provide this occupancy data as a service to our vertical offerings in commercial, health, real estate and education. We will discuss the benefits of using DLT to handle this sort of incoming stream data, and illustrate the pain points we had and the resolutions we undertook in successfully receiving the stream of Cisco data. We will illustrate how our DLT pipeline was designed, and how it scaled to deal with huge quantities of real-time streaming data.

Talk by: Paul Mracek and Chris Inkpen

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

D-Lite: Integrating a Lightweight ChatGPT-Like Model Based on Dolly into Organizational Workflows

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Ian Sotnek , Jacob Renn

AI/ML Analytics Databricks LLM MLOps

DLite is a new instruction-following model developed by AI Squared by fine-tuning the smallest GPT-2 model on the Alpaca dataset. Despite having only 124 million parameters, DLite exhibits impressive ChatGPT-like interactivity and can be fine-tuned on a single T4 GPU for less than $15.00. Due to its small relative size, DLite can be run locally on a wide variety of compute environments, including laptop CPUs, and can be used without sending data to any third-party API. This lightweight property of DLite makes it highly accessible for personal use, empowering users to integrate machine learning models and advanced analytics into their workflows quickly, securely, and cost-effectively.

Leveraging DLite within AI Squared's platform can empower organizations to orchestrate the integration of Dolly/DLite into business workflows, creating personalized versions of Dolly/DLite, chaining models or analytics to contextualize Dolly/Dlite responses/prompts, and curating new datasets leveraging real-time feedback.

Talk by: Jacob Renn and Ian Sotnek

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Harsh Mishra , Deepak Sekar

Data Lakehouse Databricks Delta IoT KPI Data Streaming

As the backbone of Australia’s supply chain, the Australia Rail Track Corporation (ARTC) plays a vital role in the management and monitoring of goods transportation across 8,500km of its rail network throughout Australia. ARTC provides weighbridges along their track which read train weights as they pass at speeds of up to 60 kilometers an hour. This information is highly valuable and is required both by ARTC and their customers to provide accurate haulage weight details, analyze technical equipment, and help ensure wagons have been loaded correctly.

A total of 750 trains run across a network of 8500 km in a day and generate real-time data at approximately 50 sensor platforms. With the help of structured streaming and Delta Lake, ARTC was able to analyze and store:

Precise train location
Weight of the train in real-time
Train crossing time to the second level
Train speed, temperature, sound frequency, and friction
Train schedule lookups

Once all the IoT data has been pulled together from an IoT event hub, it is processed in real-time using structured streaming and stored in Delta Lake. To understand the train GPS location, API calls are then made per minute per train from the Lakehouse. API calls are made in real-time to another scheduling system to lookup customer info. Once the processed/enriched data is stored in Delta Lake, an API layer was also created on top of it to expose this data to all consumers.

The outcome: increased transparency on weight data as it is now made available to customers; we built a digital data ecosystem that now ARTC’s customers use to meet their KPIs/ planning; the ability to determine temporary speed restrictions across the network to improve train scheduling accuracy and also schedule network maintenance based on train schedules and speed.

Talk by: Deepak Sekar and Harsh Mishra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lineage System Table in Unity Catalog

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Menglei Sun

Data Governance Data Lakehouse Databricks Delta Python Scala SQL

Unity Catalog provides fully automated data lineage for all workloads in SQL, R, Python, Scala and across all asset types at Databricks. The aggregated view has been available to end users through data explorer and API. In this session, we are excited to share that lineage is available via delta table in their UC metastore. It stores full history of recent lineage records and it is near real time. Additionally, customers can query it through standard SQL interface. With that, customers can get significant operational insights about their workload for impact analysis, troubleshooting, quality assurance, data discovery, and data governance.

Together with the system table platform effort, which provides query history, job run operational data, audit logs and more, lineage table will be a critical piece to link all the data asset and entity asset together, providing better lakehouse observability and unification to customers.

Talk by: Menglei Sun

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Processing Prescriptions at Scale at Walgreens

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Daniel Zafar

Cosmos Data Engineering Data Lakehouse Databricks Microsoft Spark Data Streaming

We designed a scalable Spark Streaming job to manage 100s of millions of prescription-related operations per day at an end-to-end SLA of a few minutes and a lookup time of one second using CosmosDB.

In this session, we will share not only the architecture, but the challenges and solutions to using the Spark Cosmos connector at scale. We will discuss usages of the Aggregator API, custom implementations of the CosmosDB connector, and the major roadblocks we encountered with the solutions we engineered. In addition, we collaborated closely with Cosmos development team at Microsoft and will share the new features which resulted. If you ever plan to use Spark with Cosmos, you won't want to miss these gotchas!

Talk by: Daniel Zafar

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

An API for Deep Learning Inferencing on Apache Spark™

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Lee Yang

AWS Glue Big Data Databricks ETL/ELT LLM MLOps PySpark Spark

Apache Spark is a popular distributed framework for big data processing. It is commonly used for ETL (extract, transform and load) across large datasets. Today, the transform stage can often include the application of deep learning models on the data. For example, common models can be used for classification of images, sentiment analysis of text, language translation, anomaly detection, and many other use cases. Applying these models within Spark can be done today with the combination of PySpark, Pandas_UDF, and a lot of glue code. Often, that glue code can be difficult to get right, because it requires expertise across multiple domains - deep learning frameworks, PySpark APIs, pandas_UDF internal behavior, and performance optimization.

In this session, we introduce a new, simplified API for deep learning inferencing on Spark, introduced in SPARK-40264 as a collaboration between NVIDIA and Databricks, which seeks to standardize and open source this glue code to make deep learning inference integrations easier for everyone. We discuss its design and demonstrate its usage across multiple deep learning frameworks and models.

Talk by: Lee Yang

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Building Apps on the Lakehouse with Databricks SQL

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Adriana Ispas (Databricks) , Chris Stevens

BI CRM Data Lakehouse Databricks DWH JavaScript Python SaaS SQL

BI applications are undoubtedly one of the major consumers of a data warehouse. Nevertheless, the prospect of accessing data using standard SQL is appealing to many more stakeholders than just the data analysts. We’ve heard from customers that they experience an increasing demand to provide access to data in their lakehouse platforms from external applications beyond BI, such as e-commerce platforms, CRM systems, SaaS applications, or custom data applications developed in-house. These applications require an “always on” experience, which makes Databricks SQL Serverless a great fit.

In this session, we give an overview of the approaches available to application developers to connect to Databricks SQL and create modern data applications tailored to needs of users across an entire organization. We discuss when to choose one of the Databricks native client libraries for languages such as Python, Go, or node.js and when to use the SQL Statement Execution API, the newest addition to the toolset. We also explain when ODBC and JDBC might not be the best for the task and when they are your best friends. Live demos are included.

Talk by: Adriana Ispas and Chris Stevens

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Python with Spark Connect

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Ruifeng Zheng , Hyukjin Kwon (Databricks)

Databricks Pandas PySpark Python Spark

PySpark has accomplished many milestones such as Project Zen, and been increasingly growing. We introduced pandas API on Spark, and hugely improved usability such as error messages, type hints, etc., and PySpark has become almost the very standard of distributed computing in Python. With this trend, the kind of PySpark use cases became also very complicated especially for modern data applications such as notebooks, IDEs, even devices such as smart home devices leveraging the power of data, that virtually need a lightweight separate client. However, today’s PySpark client is considerably heavy, and does not allow the separation from its scheduler, optimizer and analyzer as an example.

In Apache Spark 3.4, one of the key features we introduced in PySpark is the Python client for Spark Connect that decouples client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Apache Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications. In this talk, we will introduce what Spark Connect is, the internals of Spark Connect with Python, how to use Spark Connect with Python in the end-user perspective, and what’s next beyond Apache Spark 3.4.

Talk by: Hyukjin Kwon and Ruifeng Zheng

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Streamlining API Deploy ML Models Across Multiple Brands: Ahold Delhaize's Experience on Serverless

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Basak Eskili , Maria Vechtomova (Marvelous MLOps)

AI/ML Databricks

At Ahold Delhaize, we have 19 local brands. Most of our brands have common goals, such as providing personalized offers to their customers, a better search engine on e-commerce websites, and forecasting models to reduce food waste and ensure availability. As a central team, our goal is to standardize the way of working across all of these brands, including the deployment of machine learning models. To this end, we have adopted Databricks as our standard platform for our batch inference models.

However, API deployment for real time inference models remained challenging due to the varying capabilities of our brands. Our attempts to standardize API deployments with different tools failed due to complexity of our organization. Fortunately, Databricks has recently introduced a new feature: serverless API deployment. Since all our brands already use Databricks, this feature was easy to adopt. It allows us to easily reuse API deployment across all of our brands, significantly reducing time to market (from 6-12 months to one month), increasing efficiency, and reducing the costs. In this session, you will see the solution architecture, sample use case specifically used to cross-sell model deployed to four different brands, and API deployment using Databricks Serverless API with custom model.

Talk by: Maria Vechtomova and Basak Eskili

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Jeff Mroz

AI/ML Analytics Cloud Computing Data Engineering Data Governance Data Lake Data Lakehouse Data Management Data Quality Databricks Delta DWH +12 more

The US Army Corps of Engineers (USACE) is responsible for maintaining and improving nearly 12,000 miles of shallow-draft (9'-14') inland and intracoastal waterways, 13,000 miles of deep-draft (14' and greater) coastal channels, and 400 ports, harbors, and turning basins throughout the United States. Because these components of the national waterway network are considered assets to both US commerce and national security, they must be carefully managed to keep marine traffic operating safely and efficiently.

The National DQM Program is tasked with providing USACE a nationally standardized remote monitoring and documentation system across multiple vessel types with timely data access, reporting, dredge certifications, data quality control, and data management. Government systems have often lagged commercial systems in modernization efforts, and the emergence of the cloud and Data Lakehouse Architectures have empowered USACE to successfully move into the modern data era.

This session incorporates aspects of these topics: Data Lakehouse Architecture: Delta Lake, platform security and privacy, serverless, administration, data warehouse, Data Lake, Apache Iceberg, Data Mesh GIS: H3, MOSAIC, spatial analysis data engineering: data pipelines, orchestration, CDC, medallion architecture, Databricks Workflows, data munging, ETL/ELT, lakehouses, data lakes, Parquet, Data Mesh, Apache Spark™ internals. Data Streaming: Apache Spark Structured Streaming, real-time ingestion, real-time ETL, real-time ML, real-time analytics, and real-time applications, Delta Live Tables. ML: PyTorch, TensorFlow, Keras, scikit-learn, Python and R ecosystems data governance: security, compliance, RMF, NIST data sharing: sharing and collaboration, delta sharing, data cleanliness, APIs.

Talk by: Jeff Mroz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

talk-data.com

API

Activity Trend

Top Events

Top Speakers

Learn Live: Build an AI-enabled chat w/ Azure OpenAI & Azure Cosmos DB | BRK404LL

MSIgnite

Why and How to Enable Data Science with an Independent Semantic Layer - Audio Blog

Central application for all your dbt packages - Coalesce 2023

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Hands-On Web Scraping with Python - Second Edition

Introduction to Integration Suite Capabilities: Learn SAP API Management, Open Connectors, Integration Advisor and Trading Partner Management

Building Data Science Applications with FastAPI - Second Edition

LLMs for Everyone - Meryem Arik

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

Rapidly Implementing Major Retailer API at the Hershey Company

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

D-Lite: Integrating a Lightweight ChatGPT-Like Model Based on Dolly into Organizational Workflows

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

Lineage System Table in Unity Catalog

Processing Prescriptions at Scale at Walgreens

An API for Deep Learning Inferencing on Apache Spark™

Building Apps on the Lakehouse with Databricks SQL

Python with Spark Connect

Streamlining API Deploy ML Models Across Multiple Brands: Ahold Delhaize's Experience on Serverless

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight