talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

120

Filtering by: Spark ×

Sessions & talks

Showing 26–50 of 120 · Newest first

Search within this event →
Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift

Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift

2023-07-26 Watch
video

Delta Lake is an open source project that helps implement modern data lake architectures commonly built on cloud storages. With Delta Lake, you can achieve ACID transactions, time travel queries, CDC, and other common use cases on the cloud.

There are a lot of use cases of Delta tables on AWS. AWS has invested a lot in this technology, and now Delta Lake is available with multiple AWS services, such as AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. With AWS Glue, you can easily ingest data from multiple data sources such as on-prem databases, Amazon RDS, DynamoDB, MongoDB into Delta Lake on Amazon S3 even without expertise in coding.

This session will demonstrate how to get started with processing Delta Lake tables on Amazon S3 using AWS Glue, and querying from Amazon Athena, and Amazon Redshift. The session also covers recent AWS service updates related to Delta Lake.

Talk by: Noritaka Sekiyama and Akira Ajisaka

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Python with Spark Connect

Python with Spark Connect

2023-07-26 Watch
video
Ruifeng Zheng , Hyukjin Kwon (Databricks)

PySpark has accomplished many milestones such as Project Zen, and been increasingly growing. We introduced pandas API on Spark, and hugely improved usability such as error messages, type hints, etc., and PySpark has become almost the very standard of distributed computing in Python. With this trend, the kind of PySpark use cases became also very complicated especially for modern data applications such as notebooks, IDEs, even devices such as smart home devices leveraging the power of data, that virtually need a lightweight separate client. However, today’s PySpark client is considerably heavy, and does not allow the separation from its scheduler, optimizer and analyzer as an example.

In Apache Spark 3.4, one of the key features we introduced in PySpark is the Python client for Spark Connect that decouples client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Apache Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications. In this talk, we will introduce what Spark Connect is, the internals of Spark Connect with Python, how to use Spark Connect with Python in the end-user perspective, and what’s next beyond Apache Spark 3.4.

Talk by: Hyukjin Kwon and Ruifeng Zheng

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Simon + Denny Live: Ask Us Anything

Simon + Denny Live: Ask Us Anything

2023-07-26 Watch
video
Denny Lee (Databricks) , Simon Whiteley (Advancing Analytics)

Simon and Denny have been discussing and debating all things Delta, Lakehouse and Apache Spark™ on their regular webshow. Whether you want advice on lake structures, want to hear their opinions on the latest trends and hype in the data world, or you simply have a tech implementation question to throw at two seasoned experts, these two will have something to say on the matter. In their previous shows, Simon and Denny focused on building out a sample lakehouse architecture, refactoring and tinkering as new features came out, but now we're throwing the doors open for any and every question you might have.

So if you've had a persistent question and think these two can help, this is the session for you. There will be a question submission form shared prior to the event, so the team will be prepped with a whole bunch of topics to talk through. Simon and Denny want to hear your questions, which they can field drawing from a wealth of industry experience, wide ranging community engagement and their differing perspectives as external consultant and internal Databricks respectively. There's also a chance they'll get distracted and go way off track talking about coffee, sci-fi, nerdery or the English weather. It happens.

Talk by: Simon Whiteley and Denny Lee

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

2023-07-26 Watch
video

The US Army Corps of Engineers (USACE) is responsible for maintaining and improving nearly 12,000 miles of shallow-draft (9'-14') inland and intracoastal waterways, 13,000 miles of deep-draft (14' and greater) coastal channels, and 400 ports, harbors, and turning basins throughout the United States. Because these components of the national waterway network are considered assets to both US commerce and national security, they must be carefully managed to keep marine traffic operating safely and efficiently.

The National DQM Program is tasked with providing USACE a nationally standardized remote monitoring and documentation system across multiple vessel types with timely data access, reporting, dredge certifications, data quality control, and data management. Government systems have often lagged commercial systems in modernization efforts, and the emergence of the cloud and Data Lakehouse Architectures have empowered USACE to successfully move into the modern data era.

This session incorporates aspects of these topics: Data Lakehouse Architecture: Delta Lake, platform security and privacy, serverless, administration, data warehouse, Data Lake, Apache Iceberg, Data Mesh GIS: H3, MOSAIC, spatial analysis data engineering: data pipelines, orchestration, CDC, medallion architecture, Databricks Workflows, data munging, ETL/ELT, lakehouses, data lakes, Parquet, Data Mesh, Apache Spark™ internals. Data Streaming: Apache Spark Structured Streaming, real-time ingestion, real-time ETL, real-time ML, real-time analytics, and real-time applications, Delta Live Tables. ML: PyTorch, TensorFlow, Keras, scikit-learn, Python and R ecosystems data governance: security, compliance, RMF, NIST data sharing: sharing and collaboration, delta sharing, data cleanliness, APIs.

Talk by: Jeff Mroz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

High Volume Intelligent Streaming with Sub-Minute SLA for Near Real-Time Data Replication

High Volume Intelligent Streaming with Sub-Minute SLA for Near Real-Time Data Replication

2023-07-26 Watch
video

Attend this session and learn about an innovative solution built around Databricks structured streaming and Delta Live Tables (DLT) to replicate thousands of tables from on-premises to cloud-based relational databases. A highly desirable pattern for many enterprises across the industries to replicate on-premises data to cloud-based data lakes and data stores in near real time for consumption.

This powerful architecture can offload legacy platform workloads and accelerate cloud journey. The intelligent cost-efficient solution leverages thread-pools, multi-task jobs, Kafka, Apache Spark™ structured streaming and DLT. This session will go into detail about problems, solutions, lessons-learned and best practices.

Talk by: Suneel Konidala and Murali Madireddi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Leveraging Machine Learning on Databricks to Deliver Best in Class Customer Engagement

Leveraging Machine Learning on Databricks to Deliver Best in Class Customer Engagement

2023-07-26 Watch
video
Ryan Kennedy , Raja Lanka (Morgan Stanley)

In today's competitive business environment, customer engagement is a top priority for organizations looking to retain and grow their customer base. In this session, we will showcase how we used Databricks, a powerful machine learning platform, to build and deploy distributed deep learning machine learning models using Apache Spark™ and Horovod for best-in-class customer engagement. We will discuss the challenges we faced and the solutions we implemented, including data preparation, model training, and model deployment. We will also share the results of our efforts, including increased customer retention and improved customer satisfaction. Attendees will walk away with practical tips and best practices for using Databricks to drive customer engagement in for their own organizations. In this session we will:

  • ]Explore Morgan Stanley’s approach to best-in-class customer engagement
  • Discuss how data and technology was leveraged to help solve the business problem
  • Share our experience using Databricks to build and deploy machine learning models for customer engagement
  • Provide practical tips and best practices for using Databricks in a production environment

Talk by: Raja Lanka and Ryan Kennedy

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Scaling AI Applications with Databricks, HuggingFace and Pinecone

Scaling AI Applications with Databricks, HuggingFace and Pinecone

2023-07-26 Watch
video

The production and management of large-scale vector embeddings can be a challenging problem. The integration of Databricks, Hugging Face and Pinecone offers a powerful solution. Vector embeddings have become an essential tool in the development of AI powered applications. Embeddings are representations of data learned by machine models. High quality embeddings are unlocking use cases like semantic search, recommendation engines, and anomaly detection. Databricks' Apache Spark™ ecosystem together with Hugging Face's Transformers library enable large-scale vector embeddings production using GPU processing, Pinecone's vector database provides ultra-low latency querying and upserting of billions of embeddings, allowing for high-quality embeddings at scale for real-time AI apps.

In this session, we will present a concrete use case of this integration in the context of a natural language processing application. We will demonstrate how Pinecone's vector database can be integrated with Databricks and Hugging Face to produce large-scale vector embeddings of text data and how these embeddings can be used to improve the performance of various AI applications. You will see the benefits of this integration in terms of speed, scalability, and cost efficiency. By leveraging the GPU processing capabilities of Databricks and the ultra low-latency querying capabilities of Pinecone, we can significantly improve the performance of NLP tasks while reducing the cost and complexity of managing large-scale vector embeddings. You will learn about the technical details of this integration and how it can be implemented in your own AI projects, and gain insights into the speed, scalability, and cost efficiency benefits of using this solution.

Talk by: Roie Schwaber-Cohen

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Use Apache Spark™ from Anywhere: Remote Connectivity with Spark Connect

Use Apache Spark™ from Anywhere: Remote Connectivity with Spark Connect

2023-07-26 Watch
video
Martin Grund (Databricks) , Stefania Leone (Databricks)

Over the past decade, developers, researchers, and the community at large have successfully built tens of thousands of data applications using Apache Spark™. Since then, use cases and requirements of data applications have evolved. Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data. However, Spark's driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements as there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL.

Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, notebooks and programming languages. This session highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We will do a deep dive into the architecture of Spark Connect and provide an outlook on how the community can participate in the extension of Spark Connect for new programming languages and frameworks bringing the power of Spark everywhere.

Talk by: Martin Grund and Stefania Leone

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lakehouse / Spark AMA

Lakehouse / Spark AMA

2023-07-26 Watch
video
Hyukjin Kwon (Databricks) , Martin Grund (Databricks) , Wenchen Fan (Databricks)

Have some great questions about Apache Spark™ and Lakehouses?  Well, come by and ask the experts your questions!

Talk by: Martin Grund, Hyukjin Kwon, and Wenchen Fan

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Apache Spark™ Streaming and Delta Live Tables Accelerates KPMG Clients For Real Time IoT Insights

Apache Spark™ Streaming and Delta Live Tables Accelerates KPMG Clients For Real Time IoT Insights

2023-07-26 Watch
video

Unplanned downtime in manufacturing costs firms up to a trillion dollars annually. Time that materials spend sitting on a production line is lost revenue. Even just 15 hours of downtime a week adds up to over 800 hours of downtime yearly. The use of Internet of Things or IoT devices can cut this time down by providing details of machine metrics. However, IoT predictive maintenance is challenged by the lack of effective, scalable infrastructure and machine learning solutions. IoT data can be the size of multiple terabytes per day and can come in a variety of formats. Furthermore, without any insights and analysis, this data becomes just another table.

The KPMG Databricks IoT Accelerator is a comprehensive solution enabling manufacturing plant operators to have a bird’s eye view of their machines’ health and empowers proactive machine maintenance across their portfolio of IoT devices. The Databricks Accelerator ingests IoT streaming data at scale and implements the Databricks Medallion architecture while leveraging Delta Live Tables to clean and process data. Real time machine learning models are developed from IoT machine measurements and are managed in MLflow. The AI predictions and IoT device readings are compiled in the gold table powering downstream dashboards like Tableau. Dashboards inform machine operators of not only machines’ ailments, but action they can take to mitigate issues before they arise. Operators can see fault history to aid in understanding failure trends, and can filter dashboards by fault type, machine, or specific sensor reading. 

Talk by: MacGregor Winegard

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Structured Streaming: Demystifying Arbitrary Stateful Operations

Structured Streaming: Demystifying Arbitrary Stateful Operations

2023-07-26 Watch
video
Angela Chu (Databricks)

Let’s face it -- data is messy. And your company’s business requirements? Even messier. You’re staring at your screen, knowing there is a tool that will let you give your business partners the information they need as quickly as they need it. There’s even a Python version of it now. But…it looks kind of scary. You’ve never used it before, and you don’t know where to start. Yes, we’re talking about the dreaded flatMapGroupsWithState. But fear not - we’ve got you covered.

In this session, we’ll take a real-word use case and use it to show you how to break down flatMapGroupsWithState into its basic building blocks. We’ll explain each piece in both Scala and the newly-released Python, and at the end we’ll illustrate how it all comes together to enable the implementation of arbitrary stateful operations with Spark Structured Streaming.

Talk by: Angela Chu

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using NLP to Evaluate 100 Million Global Webpages Daily to Contextually Target Consumers

Using NLP to Evaluate 100 Million Global Webpages Daily to Contextually Target Consumers

2023-07-26 Watch
video
Xuefu Wang , Mark Lee (Databricks)

This session will cover the challenges and the solution that The Trade Desk went through to scale their ML models for NLP for 100 million web pages per day.

TTD's contextual targeting team needs to analyze 100 million web pages per day. Fifty percent of the webpages are non-English. Half of the content was not being properly analyzed and targeted intelligently. TTD attempted to build a model using Spark NLP, however the package could not scale and was not cost-effective. GPU utilization was low and the solution was cost prohibitive. TTD engaged with Databricks in early 2022 to build an NLP model on Databricks. Our teams partnered closely together. We were able to build a solution using distributed inference (150-200 GPUs running at 80%+ utilization); Each day, Databricks translated two hundred times faster across 50 million web pages that are in for over 35 + languages and at a fraction of the cost. This solution enables TTD teams to standardize on English for contextual targeting ML models. TTD can now be a one-stop shop for their customers' global advertising needs.

The Trade Desk is headquartered in Ventura, California. It is the largest independent demand-side platform in the world, competing against Google, Facebook, and others. Unlike traditional marketing, programmatic marketing is operated by real-time, split-second decisions based on user identity, device information, and other data points. It enables highly personalized consumer experiences and improves return-on-investment for companies and advertisers.

Talk by: Xuefu Wang and Mark Lee

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Deep Dive into the New Features of Apache Spark™ 3.4

Deep Dive into the New Features of Apache Spark™ 3.4

2023-07-25 Watch
video

Join us for this Technical Deep Dive session. In 2022, Apache Spark™ was awarded the prestigious SIGMOD Systems Award, because Spark is the de facto standard for data processing.

In this session, we will share the latest progress in Apache Spark community. With tremendous contribution from the open source community, Spark 3.4 managed to resolve in excess of 2,400 Jira tickets. We will talk about the major features and improvements in Spark 3.4. The major updates are Spark Connect, numerous PySpark and SQL language features, engine performance enhancements, as well as operational improvements in Spark UX and error handling.

Talk by: Xiao Li and Daniel Tenedorio

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks Connect Powered by Spark Connect: Develop and Debug Spark From Any Developer Tool

Databricks Connect Powered by Spark Connect: Develop and Debug Spark From Any Developer Tool

2023-07-25 Watch
video
Martin Grund (Databricks) , Stefania Leone (Databricks)

Spark developers want to develop and debug their code using their tools of choice and development best practices while ensuring high-production fidelity on the target remote cluster. However, Spark's driver architecture is monolithic, with no built-in capability to directly connect to a remote Spark cluster from languages other than SQL. This makes it hard to enable such interactive developer experiences from a user’s local IDE of choice. Spark Connect’s decoupled client-server architecture introduces remote connectivity to Spark clusters and with that, enables interactive development experience - Spark and its open ecosystem can be leveraged from everywhere.

In this session, we show how we leverage Spark Connect to build a completely redesigned version of Databricks Connect, a first-class IDE-based developer experience that offers interactive debugging from any IDE. We show how developers can easily ensure consistency between their local and remote environments. We walk the audience through real-live examples of how to locally debug code running on Databrick. We also show how Databricks Connect integrates into the Databricks Visual Studio Code extension for an even better developer experience.

Talk by: Martin Grund and Stefania Leone

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Photon for Dummies: How Does this New Execution Engine Actually Work?

Photon for Dummies: How Does this New Execution Engine Actually Work?

2023-07-25 Watch
video
Holly Smith (Databricks)

Did you finish the Photon whitepaper and think, wait, what? I know I did; it’s my job to understand it, explain it, and then use it. If your role involves using Apache Spark™ on Databricks, then you need to know about Photon and where to use it. Join me, chief dummy, nay "supreme" dummy, as I break down this whitepaper into easy to understand explanations that don’t require a computer science degree. Together we will unravel mysteries such as:

  • Why is a Java Virtual Machine the current bottleneck for Spark enhancements?
  • What does vectorized even mean? And how was it done before?
  • Why is the relationship status between Spark and Photon "complicated?"

In this session, we’ll start with the basics of Apache Spark, the details we pretend to know, and where those performance cracks are starting to show through. Only then will we start to look at Photon, how it’s different, where the clever design choices are and how you can make the most of this in your own workloads. I’ve spent over 50 hours going over the paper in excruciating detail; every reference, and in some instances, the references of the references so that you don’t have to.

Talk by: Holly Smith

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Top Mistakes to Avoid in Streaming Applications

Top Mistakes to Avoid in Streaming Applications

2023-07-25 Watch
video

Are you a data engineer seeking to enhance the performance of your streaming applications? Join our session where we will share valuable insights and best practices gained from handling diverse customer streaming use cases using Apache Spark™ Structured Streaming.

In this session, we will delve into the common pitfalls that can hinder your streaming workflows. Learn practical tips and techniques to overcome these challenges during different stages of application development. By avoiding these errors, you can unlock faster performance, improved data reliability, and smoother data processing.

Don't miss out on this opportunity to level up your streaming skills and excel in your data engineering journey. Join us to gain valuable knowledge and practical techniques that will empower you to optimize your streaming applications and drive exceptional results.

Talk by: Vikas Reddy Aravabhumi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

The English SDK for Apache Spark™

The English SDK for Apache Spark™

2023-07-25 Watch
video
Gengliang Wang (Databricks) , Allison Wang (Databricks)

In the fast-paced world of data science and AI, we will explore how large language models (LLMs) can elevate the development process of Apache Spark applications.

We'll demonstrate how LLMs can simplify SQL query creation, data ingestion, and DataFrame transformations, leading to faster development and clearer code that's easier to review and understand. We'll also show how LLMs can assist in creating visualizations and clarifying data insights, making complex data easy to understand.

Furthermore, we'll discuss how LLMs can be used to create user-defined data sources and functions, offering a higher level of adaptability in Apache Spark applications.

Our session, filled with practical examples, highlights the innovative role of LLMs in the realm of Apache Spark development. We invite you to join us in this exploration of how these advanced language models can drive innovation and boost efficiency in the sphere of data science and AI.

Talk by: Gengliang Wang and Allison Wang

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Live from the Lakehouse: Ethics in AI with Adi Polak & gaining from open source with Vini Jaiswal

Live from the Lakehouse: Ethics in AI with Adi Polak & gaining from open source with Vini Jaiswal

2023-07-14 Watch
video
Vini Jaiswal (ByteDance) , Adi Polak (Treeverse)

Hear from two guests. First, Adi Polak (VP of Developer Experience, Treeverse, and author of #1 new release - Scaling ML with Spark) on how AI helps us be more productive. Second guest, Vini Jaiswal (Principal Developer Advocate, ByteDance) on gaining with the open source community, overcoming scalability challenges, and taking innovation to the next stage. Hosted by Pearl Ubaru (Sr Technical Marketing Engineer, Databricks)

Cutting the Edge in Fighting Cybercrime: Reverse-Engineering a Search Language to Cross-Compile

Cutting the Edge in Fighting Cybercrime: Reverse-Engineering a Search Language to Cross-Compile

2022-07-22 Watch
video

Traditional cybersecurity Security Information and Event Management (SIEM) ways do not scale well for data sources with 30TiB per day, leading HSBC to create a Cybersecurity Lakehouse with Delta and Spark. Creating a platform to overcome several conventional technical constraints, the limitation in the amount of data for long-term analytics available in traditional platforms and query languages is difficult to scale and time-consuming to run. In this talk, we’ll learn how to implement (or actually reverse-engineer) a language with Scala and translate it into what Apache Spark understands, the Catalyst engine. We’ll guide you through the technical journey of building equivalents of a query language into Spark. We’ll learn how HSBC business benefited from this cutting-edge innovation, like decreasing time and resources for Cyber data processing migration, improving Cyber threat Incident Response, and fast onboarding of HSBC Cyber Analysts on Spark with Cybersecurity Lakehouse platform.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Connecting the Dots with DataHub: Lakehouse and Beyond

Connecting the Dots with DataHub: Lakehouse and Beyond

2022-07-19 Watch
video

You’ve successfully built your data lakehouse. Congratulations! But what happens when your operational data stores, streaming systems like Apache Kafka or data ingestion systems produce bad data into the lakehouse? Can you be proactive when it comes to preventing bad data from affecting your business? How can you take advantage of automation to ensure that raw data assets become well maintained data products (clear ownership, documentation and sensitivity classification) without requiring people to do redundant work across operational, ingestion and lakehouse systems? How do you get live and historical visibility into your entire data ecosystem (schemas, pipelines, data lineage, models, features and dashboards) within and across your production services, ingestion pipelines and data lakehouse? Data engineers struggle with data quality and data governance issues constantly interrupting their day and limiting their upside impact on the business.

In this talk, we will share how data engineers from our 3K+ strong DataHub community are using DataHub to track lineage, understand data quality, and prevent failures from impacting their important dashboards, ML models and features. The talk will include details of how DataHub extracts lineage automatically from Spark, schema and statistics from Delta Lake and shift-left strategies for developer-led governance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Privacy: Practical Architectures and Experiences

Scaling Privacy: Practical Architectures and Experiences

2022-07-19 Watch
video

At Spark Data & AI 2021, We presented the use case around Privacy in an Insurance Landscape using Privacera. Scaling Privacy in a Spark Ecosystem (https://www.youtube.com/watch?v=cjJEMlNcg5k). In one year, the concept of privacy and security have taken off as a major need to solve and the ability to embed this into business process to empower data democratization has become mandatory. The concept that data is a product is now commonplace and that ability to rapidly innovate those products hinges on the ability to balance a dual mandate. One mandate: Move Fast. Second Mandate: Manage Privacy and Security. How do we make this happen? Let's dig into the real details and experiences and show the blueprint for success.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Sound Data Engineering in Rust—From Bits to DataFrames

Sound Data Engineering in Rust—From Bits to DataFrames

2022-07-19 Watch
video

Spark applications often need to query external data sources such as file-based data sources or relational data sources. In order to do this, Spark provides Data Source APIs to access structured data through Spark SQL.

Data Source APIs have optimization rules such as filter push down and column pruning to reduce the amount of data that needs to be processed to improve query performance. As part of our ongoing project to provide generic Data Source V2 push down APIs, we have introduced partial aggregate push down, which significantly speeds up spark jobs by dramatically reducing the amount of data transferred between data sources and Spark. We have implemented aggregate push down in both JDBC and parquet.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

State-of-the-Art Natural Language Processing with Apache Spark NLP

State-of-the-Art Natural Language Processing with Apache Spark NLP

2022-07-19 Watch
video

This session teaches how & why to use the open-source Spark NLP library. Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of recent research advances. Spark NLP is the most widely used NLP library in the enterprise today; provides thousands of current, supported, pre-trained models for 200+ languages out of the box; and is the only open-source NLP library that can natively scale to use any Apache Spark cluster.

We’ll walk through Python code running common NLP tasks like document classification, named entity recognition, sentiment analysis, spell checking, question answering, and translation. The discussion of each task includes the latest advances in deep learning and transfer learning used to tackle it. We’ll also cover new free tools for data annotation, no-code active learning & transfer learning, easily deploying NLP models as production-grade services, and sharing models you’ve trained.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

A Low-Code Approach to 10x Data Engineering

A Low-Code Approach to 10x Data Engineering

2022-07-19 Watch
video

Can we take Data Engineering on Spark 10x beyond where it is today?

Yes, we can enable 10x more users on Spark, and make them 10x more productive from day 1. Data engineering can run at scale, and it can still be 10x simpler and faster to develop, deploy, and manage pipelines.

Low code is the key. A modern data engineering platform built on low code will enable all data users, from new graduates to experts, to visually develop high-quality pipelines. With Visual = Code, the visual elements will be stored as PySpark code on Git and deployed using the best software practices taken from DevOps. Search and lineage help data engineers and their customers in analytics understand how each column value was produced, when it was updated, and the associated quality metric.

See how a complete, low-code data engineering platform can reduce complexity and effort, enabling you to rapidly deploy, scale, and use Spark, making data and analytics a strategic asset in your company.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How To Make Apache Spark on Kubernetes Run Reliably on Spot Instances

How To Make Apache Spark on Kubernetes Run Reliably on Spot Instances

2022-07-19 Watch
video

Since the general availability of Apache Spark’s native support for running on Kubernetes with Spark 3.1 in March 2021, the Spark community is increasingly choosing to run on k8s to benefit of containerization, efficient resource-sharing, and the tools from the cloud-native ecosystem.

Data teams are faced with complexities in this transition, including how to leverage spot VMs. These instances enable up to 90% cost savings but are not guaranteed to be available and face the risk of termination. This session will cover concrete guidelines on how to make Spark run reliably on spot instances, with code examples from real-world use cases.

Main topics: • Using spot nodes for Spark executors • Mixing instance types & sizes to reduce risk of spot interruptions - cluster autoscaling • Spark 3.0: Graceful Decommissioning - preserve shuffle files on executor shutdown • Spark 3.1: PVC reuse on executor restart - disaggregate compute & shuffle storage • What to look for in future Spark releases

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/