talk-data.com talk-data.com

Topic

Azure

Microsoft Azure

cloud cloud_provider microsoft infrastructure

723

tagged

Activity Trend

278 peak/qtr
2020-Q1 2026-Q1

Activities

723 activities · Newest first

Applied Predictive Maintenance in Aviation: Without Sensor Data

We will show how using Azure Databricks Lakehouse is modernizing our data & analytics environment which has given us new capability to create custom predictive models for hundreds of families of aircraft components without sensor data. We currently have over 95% success rate with over $1.3 million in avoided operational impact costs in FY21.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cloud and Data Science Modernization of Veterans Affairs Financial Service Center with Azure Databri

The Department of Veterans Affairs (VA) is home to over 420,000 employees, provides health care for 9.16 million enrollees and manages the benefits of 5.75 million recipients. The VA also hosts an array of financial management, professional, and administrative services at their Financial Service Center (FSC), located in Austin, Texas. The FSC is divided into various service groups organized around revenue centers and product lines, including the Data Analytics Service (DAS). To support the VA mission, in 2021 FSC DAS continued to press forward with their cloud modernization efforts, successfully achieving four key accomplishments:

Office of Community Care (OCC) Financial Time Series Forecast - Financial forecasting enhancements to predict claims CFO Dashboard - Productivity and capability enhancements for financial and audit analytics Datasets Migrated to the Cloud - Migration of on-prem datasets to the cloud for down-stream analytics (includes a supply chain proof-of-concept) Data Science Hackathon - A hackathon to predict bad claims codes and demonstrate DAS abilities to accelerate a ML use case using Databricks AutoML

This talk discusses FSC DAS’ cloud and data science modernization accomplishments in 2021, lessons learned, and what’s ahead.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

A Case Study in Rearchitecting an On-Premises Pipeline in the Cloud

This talk will give a detailed discussion of some of the many considerations that must be taken into account when rebuilding on-premises data pipelines in the cloud. I will give an initial overview of the original pipeline and the reasons that we chose to migrate this pipeline to Azure. Next, I will discuss the decisions that lead to the architecture we used to replace the original pipeline, and give a thorough overview of the new cloud pipeline, including design components and networking. I will also discuss the many lessons we learned along the way to successfully migrating this pipeline.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Improving Apache Spark Application Processing Time by Configurations, Code Optimizations, etc.

In this session, we'll go over several use-cases and describe the process of improving our spark structured streaming application micro-batch time from ~55 to ~30 seconds in several steps.

Our app is processing ~ 700 MB/s of compressed data, it has very strict KPIs, and it is using several technologies and frameworks such as: Spark 3.1, Kafka, Azure Blob Storage, AKS and Java 11.

We'll share our work and experience in those fields, and go over a few tips to create better Spark structured streaming applications.

The main areas that will be discussed are: Spark Configuration changes, code optimizations and the implementation of the Spark custom data source.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Improving patient care with Databricks

Learn how Wipro helped a world leader in medical technology to modernize its data used the PySpark interface on Azure Databricks to create reusable generic frameworks, including slowly changing dimensions (SCDs), data validation/reconciliation tools, and delta lake tables created from metadata.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Optimizing Speed and Scale of User-Facing Analytics Using Apache Kafka and Pinot

Apache Kafka is the de facto standard for real-time event streaming, but what do you do if you want to perform user-facing, ad-hoc, real-time analytics too? That's where Apache Pinot comes in.

Apache Pinot is a realtime distributed OLAP datastore, which is used to deliver scalable real time analytics with low latency. It can ingest data from batch data sources (S3, HDFS, Azure Data Lake, Google Cloud Storage) as well as streaming sources such as Kafka. Pinot is used extensively at LinkedIn and Uber to power many analytical applications such as Who Viewed My Profile, Ad Analytics, Talent Analytics, Uber Eats and many more serving 100k+ queries per second while ingesting 1Million+ events per second.

Apache Kafka's highly performant, distributed, fault-tolerant, real-time publish-subscribe messaging platform powers big data solutions at Airbnb, LinkedIn, MailChimp, Netflix, the New York Times, Oracle, PayPal, Pinterest, Spotify, Twitter, Uber, Wikimedia Foundation, and countless other businesses.

Come hear from Neha Power, Founding Engineer at a StarTree and PMC and committer of Apache Pinot, and Karin Wolok, Head of Developer Community at StarTree, on an introduction to both systems and a view of how they work together.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Accidentally Building a Petabyte-Scale Cybersecurity Data Mesh in Azure With Delta Lake at HSBC

Due to the unique cybersecurity challenges that HSBC faces daily - from high data volumes to untrustworthy sources to the privacy and security restrictions of a highly regulated industry - the resulting architecture was an unwieldy set of disparate data silos. So, how do we build a cybersecurity advanced analytics environment to enrich and transform these myriad data sources into a unified, well-documented, robust, resilient, repeatable, scalable, maintainable platform that will empower the cyber analysts of the future? That at the same time remains cost-effective and enables everyone from the less-technical junior reporting user to the senior machine learning engineers?

In this session, Ryan Harris, Principal Cybersecurity Engineer at HSBC, dives into the infrastructure and architecture employed, ranging from the landing zone concepts, secure access workstations, data lake structure, and isolated data ingestion, to the enterprise integration layer. In the process of building the data pipelines and lakehouses, we ended up building a hybrid data mesh leveraging Delta Lake. The result is a flexible, secure, self-service environment that is unlocking the capabilities of our humans.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Pushing the limits of scale/performance for enterprise-wide analytics: A fire-side chat with Akamai

With the world’s most distributed compute platform — from cloud to edge — Akamai makes it easy for businesses to develop and run applications, while keeping experiences closer to users and threats farther away. ​So when it was time to scale it’s legacy Hadoop-like infrastructure reaching its capacity limits, while keeping their global operations running uninterrupted, Akamai partnered with Microsoft and Databricks to migrate to Azure Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Your fastest path to Lakehouse and beyond

Azure Databricks is an easy, open, and collaborative service for data, analytics & AI use cases, enabled by Lakehouse architecture. Join this session to discover how you can get the most out of your Azure investments by combining the best of Azure Synapse Analytics, Azure Databricks and Power BI for building a complete analytics & AI solution based on Lakehouse architecture.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building a Data Science as a Service Platform in Azure with Databricks

Machine learning in the enterprise is rarely delivered by a single team. In order to enable Machine Learning across an organisation you need to target a variety of different skills, processes, technologies, and maturity's. To do this is incredibly hard and requires a composite of different techniques to deliver a single platform which empowers all users to build and deploy machine learning models.

In this session we discuss how Databricks enabled a data science as a service platform for one of the UKs largest household insurers. We look at how this platform is empowering users of all abilities to build models, deploy models and realise and return on investment earlier.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cloud Fetch: High-bandwidth Connectivity With BI Tools

Business Intelligence (BI) tools such as Tableau and Microsoft Power BI are notoriously slow at extracting large query results from traditional data warehouses because they typically fetch the data in a single thread through a SQL endpoint that becomes a data transfer bottleneck. Data analysts can connect their BI tools to Databricks SQL endpoints to query data in tables through an ODBC/JDBC protocol integrated in our Simba drivers. With Cloud Fetch, which we released in Databricks Runtime 8.3 and Simba ODBC 2.6.17 driver, we introduce a new mechanism for fetching data in parallel via cloud storage such as AWS S3 and Azure Data Lake Storage to bring the data faster to BI tools. In our experiments using Cloud Fetch, we observed a 10x speed-up in extract performance due to parallelism.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Detecting Financial Crime Using an Azure Advanced Analytics Platform and MLOps Approach

As gatekeepers of the financial system, banks play a crucial role in reporting possible instances of financial crime. At the same time, criminals continuously reinvent their approaches to hide their activities among dense transaction data. In this talk, we describe the challenges of detecting money laundering and outline why employing machine learning via MLOps is critically important to identify complex and ever-changing patterns.

In anti-money-laundering, machine learning answers to a dire need for vigilance and efficiency where previous-generation systems fall short. We will demonstrate how our open platform facilitates a gradual migration towards a model-driven landscape, using the example of transaction-monitoring to showcase applications of supervised and unsupervised learning, human explainability, and model monitoring. This environment enables us to drive change from the ground up in how the bank understands its clients to detect financial crime.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How the Largest County in the US is Transforming Hiring with a Modern Data Lakehouse

Los Angeles County’s Department of Human Resources (DHR) is responsible for attracting a diverse workforce for the 37 departments it supports. Each year, DHR processes upwards of 400,000 applications for job opportunities making it one of the largest employers in the nation. Managing a hiring process of this scale is complex with many complicated factors such as background checks and skills examination. These processes, if not managed properly, can create bottlenecks and a poor experience for both candidates and hiring managers.

In order to identify areas for improvement, DHR set out to build detailed operational metrics across each stage of the hiring process. DHR used to conduct high level analysis manually using excel and other disparate tools. The data itself was limited, difficult to obtain, and analyze. In addition, it was taking analysts weeks to manually pull data from half a dozen siloed systems into excel for cleansing and analysis. This process was labor-intensive, inefficient, and prone to human error.

To overcome these challenges, DHR in partnership with Internal Services Department (ISD) adopted a modern data architecture in the cloud. Powered by the Azure Databricks Lakehouse, DHR was able to bring together their diverse volumes of data into a single platform for data analytics. Manual ETL processes that took weeks could now be automated in 10 minutes or less. With this new architecture, DHR has built Business Intelligence dashboards to unpack the hiring process to get a clear picture of where the bottlenecks are and track the speed with which candidates move through the process The dashboards allow the County departments innovate and make changes to enhance and improve the experience of potential job seekers and improve the timeliness of securing highly qualified and diverse County personnel at all employment levels.

In this talk, we’ll discuss DHR’s journey towards building a data-driven hiring process, the architecture decisions that enabled this transformation and the types of analytics that we’ve deployed to improve hiring efforts.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Setting up On Shelf Availability Alerts at Scale with Databricks and Azure

Tredence' s OSA accelerator is a robust quick-start guide that is the foundation for a full Out of Stock or Supply Chain solution. The OSA solution focuses on driving sales through improved stock availability on the shelves. The following components make up the OSA accelerator.

• Identifying OOS Situation: ML models to identify the Out-Of-Stock scenario in a store at a SKU level taking in account the level of phantom inventory • Identifying Off-Sales Behavior: ML models to identify the off-sale behavior of a SKU in particular which is attributable to phantom inventory, stock less than presentation stock or improper operations within the store • Smart Alerts: Alert mechanism for the store manager and merchandizing reps in order to maintain healthy stock in the store and increase the revenue

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Swedbank: Enterprise Analytics in Cloud

Swedbank is the largest bank in Sweden & third largest in Nordics. They have about 7-8M customers across retail, mortgage , and investment (pensions). One of the key drivers for the bank was to look at data across all silos and build analytics to drive their ML models - they couldn’t. That’s when Swedbank made a strategic decision to go to the cloud and make bets on Databricks, Immuta, and Azure.

-Enterprise analytics in cloud is an initiative to move Swedbanks on-premise Hadoop based data lake into the cloud to provide improved analytical capabilities at scale. The strategic goals of the “Analytics Data Lake” are: -Advanced analytics: Improve analytical capabilities in terms of functionality, reduce analytics time to market and better predictive modelling -A Catalyst for Sharing Data: Make data Visible, Accessible, Understandable, Linked, and Trusted Technical advancements: Future proof with ability to add new tools/libraries, support for 3rd party solutions for Deep Learning/AI

To achieve these goals, Swedbank had to migrate existing capabilities and application services to Azure Databricks & implement Immuta as its unified access control plane. A “data discovery” space was created for data scientists to be able to come & scan (new) data, develop, train & operationalise ML models. To meet these goals Swedbank requires dynamic and granular data access controls to both mitigate data exposure (due to compromised accounts, attackers monitoring a network, and other threats) while empowering users via self-service data discovery & analytics. Protection of sensitive data is key to enable Swedbank to support key financial services use cases.

The presentation will focus on this journey, calling out key technical challenges, learning & benefits observed.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Vision AI—Animal Health Industry Use Cases Using Databricks on Azure

Vision AI and Azure Cognitive services can be applied in a variety of ways for healthcare, especially for Animal Health. Animal Diagnostics market size is valued at over USD 4.5 Billion in 2020 and is expected to grow at CAGR of 8.5% from 2021 to 2027(Markets&Markets Study).

The overall livestock advanced monitoring market is expected to grow from USD 1.4 billion in 2021 to USD 2.3 billion by 2026; it is expected to grow at a CAGR of 10.4% during 2021–2026.

We hope to showcase various uses of AI/ML for the care of livestock and companion animals to help assist vets and farm-owners. Live demos will include real life case studies and forward looking applications of the same using reinforced learning techniques and services.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Summary Data engineering is a large and growing subject, with new technologies, specializations, and "best practices" emerging at an accelerating pace. This podcast does its best to explore this fractal ecosystem, and has been at it for the past 5+ years. In this episode Joe Reis, founder of Ternary Data and co-author of "Fundamentals of Data Engineering", turns the tables and interviews the host, Tobias Macey, about his journey into podcasting, how he runs the show behind the scenes, and the other things that occupy his time.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today we’re flipping the script. Joe Reis of Ternary Data will be interviewing me about my time as the host of this show and my perspectives on the data ecosystem

Interview

Introduction How did you get involved in the area of data management? Now I’ll hand it off to Joe…

Joe’s Notes

You do a lot of podcasts. Why? Podcast.init started in 2015, and your first episode of Data Engineering was published January 14, 2017. Walk us through the start of these podcasts. why not a data science podcast? why DE? You’ve published 306 of shows of the Data Engineering Podcast, plus 370 for the init podcast, then you’ve got a new ML podcast. How have you kept the motivation over the years? What’s the process for the show (finding guests, topics, etc….recording, publishing)? It’s a lot of work. Walk us through this process. You’ve done a ton of shows and have a lot of context with what’s going on in the field of both data engineering and Python. What have been some of the

The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

Design and implement a modern data lakehouse on the Azure Data Platform using Delta Lake, Apache Spark, Azure Databricks, Azure Synapse Analytics, and Snowflake. This book teaches you the intricate details of the Data Lakehouse Paradigm and how to efficiently design a cloud-based data lakehouse using highly performant and cutting-edge Apache Spark capabilities using Azure Databricks, Azure Synapse Analytics, and Snowflake. You will learn to write efficient PySpark code for batch and streaming ELT jobs on Azure. And you will follow along with practical, scenario-based examples showing how to apply the capabilities of Delta Lake and Apache Spark to optimize performance, and secure, share, and manage a high volume, high velocity, and high variety of data in your lakehouse with ease. The patterns of success that you acquire from reading this book will help you hone your skills to build high-performing and scalable ACID-compliant lakehouses using flexible and cost-efficient decoupled storage and compute capabilities. Extensive coverage of Delta Lake ensures that you are aware of and can benefit from all that this new, open source storage layer can offer. In addition to the deep examples on Databricks in the book, there is coverage of alternative platforms such as Synapse Analytics and Snowflake so that you can make the right platform choice for your needs. After reading this book, you will be able to implement Delta Lake capabilities, including Schema Evolution, Change Feed, Live Tables, Sharing, and Clones to enable better business intelligence and advanced analytics on your data within the Azure Data Platform. What You Will Learn Implement the Data Lakehouse Paradigm on Microsoft’s Azure cloud platform Benefit from the new Delta Lake open-source storage layer for data lakehouses Take advantage of schema evolution, change feeds, live tables, and more Writefunctional PySpark code for data lakehouse ELT jobs Optimize Apache Spark performance through partitioning, indexing, and other tuning options Choose between alternatives such as Databricks, Synapse Analytics, and Snowflake Who This Book Is For Data, analytics, and AI professionals at all levels, including data architect and data engineer practitioners. Also for data professionals seeking patterns of success by which to remain relevant as they learn to build scalable data lakehouses for their organizations and customers who are migrating into the modern Azure Data Platform.

Summary Building and maintaining reliable data assets is the prime directive for data engineers. While it is easy to say, it is endlessly complex to implement, requiring data professionals to be experts in a wide range of disparate topics while designing and implementing complex topologies of information workflows. In order to make this a tractable problem it is essential that engineers embrace automation at every opportunity. In this episode Chris Riccomini shares his experiences building and scaling data operations at WePay and LinkedIn, as well as the lessons he has learned working with other teams as they automated their own systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Chris Riccomini about building awareness of data usage into CI/CD pipelines for application development

Interview

Introduction How did you get involved in the area of data management? What are the pieces of data platforms and processing that have been most difficult to scale in an organizational sense? What are the opportunities for automation to alleviate some of the toil that data and analytics engineers get caught up in? The application delivery ecosystem has been going through ongoing transformation in the form of CI/CD, infrastructure as code, etc. What are the parallels in the data ecosystem that are still nascent? What are the principles that still need to be translated for data practitioners? Which are subject to impedance mismatch and may never make sense to translate? As someone with a software engineering background and extensive e