talk-data.com talk-data.com

Topic

Spark

Apache Spark

big_data distributed_computing analytics

120

tagged

Activity Trend

71 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Databricks DATA + AI Summit 2023 ×
Top Mistakes to Avoid in Streaming Applications

Are you a data engineer seeking to enhance the performance of your streaming applications? Join our session where we will share valuable insights and best practices gained from handling diverse customer streaming use cases using Apache Spark™ Structured Streaming.

In this session, we will delve into the common pitfalls that can hinder your streaming workflows. Learn practical tips and techniques to overcome these challenges during different stages of application development. By avoiding these errors, you can unlock faster performance, improved data reliability, and smoother data processing.

Don't miss out on this opportunity to level up your streaming skills and excel in your data engineering journey. Join us to gain valuable knowledge and practical techniques that will empower you to optimize your streaming applications and drive exceptional results.

Talk by: Vikas Reddy Aravabhumi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

The English SDK for Apache Spark™

In the fast-paced world of data science and AI, we will explore how large language models (LLMs) can elevate the development process of Apache Spark applications.

We'll demonstrate how LLMs can simplify SQL query creation, data ingestion, and DataFrame transformations, leading to faster development and clearer code that's easier to review and understand. We'll also show how LLMs can assist in creating visualizations and clarifying data insights, making complex data easy to understand.

Furthermore, we'll discuss how LLMs can be used to create user-defined data sources and functions, offering a higher level of adaptability in Apache Spark applications.

Our session, filled with practical examples, highlights the innovative role of LLMs in the realm of Apache Spark development. We invite you to join us in this exploration of how these advanced language models can drive innovation and boost efficiency in the sphere of data science and AI.

Talk by: Gengliang Wang and Allison Wang

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Live from the Lakehouse: Ethics in AI with Adi Polak & gaining from open source with Vini Jaiswal

Hear from two guests. First, Adi Polak (VP of Developer Experience, Treeverse, and author of #1 new release - Scaling ML with Spark) on how AI helps us be more productive. Second guest, Vini Jaiswal (Principal Developer Advocate, ByteDance) on gaining with the open source community, overcoming scalability challenges, and taking innovation to the next stage. Hosted by Pearl Ubaru (Sr Technical Marketing Engineer, Databricks)

Cutting the Edge in Fighting Cybercrime: Reverse-Engineering a Search Language to Cross-Compile

Traditional cybersecurity Security Information and Event Management (SIEM) ways do not scale well for data sources with 30TiB per day, leading HSBC to create a Cybersecurity Lakehouse with Delta and Spark. Creating a platform to overcome several conventional technical constraints, the limitation in the amount of data for long-term analytics available in traditional platforms and query languages is difficult to scale and time-consuming to run. In this talk, we’ll learn how to implement (or actually reverse-engineer) a language with Scala and translate it into what Apache Spark understands, the Catalyst engine. We’ll guide you through the technical journey of building equivalents of a query language into Spark. We’ll learn how HSBC business benefited from this cutting-edge innovation, like decreasing time and resources for Cyber data processing migration, improving Cyber threat Incident Response, and fast onboarding of HSBC Cyber Analysts on Spark with Cybersecurity Lakehouse platform.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Connecting the Dots with DataHub: Lakehouse and Beyond

You’ve successfully built your data lakehouse. Congratulations! But what happens when your operational data stores, streaming systems like Apache Kafka or data ingestion systems produce bad data into the lakehouse? Can you be proactive when it comes to preventing bad data from affecting your business? How can you take advantage of automation to ensure that raw data assets become well maintained data products (clear ownership, documentation and sensitivity classification) without requiring people to do redundant work across operational, ingestion and lakehouse systems? How do you get live and historical visibility into your entire data ecosystem (schemas, pipelines, data lineage, models, features and dashboards) within and across your production services, ingestion pipelines and data lakehouse? Data engineers struggle with data quality and data governance issues constantly interrupting their day and limiting their upside impact on the business.

In this talk, we will share how data engineers from our 3K+ strong DataHub community are using DataHub to track lineage, understand data quality, and prevent failures from impacting their important dashboards, ML models and features. The talk will include details of how DataHub extracts lineage automatically from Spark, schema and statistics from Delta Lake and shift-left strategies for developer-led governance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Privacy: Practical Architectures and Experiences

At Spark Data & AI 2021, We presented the use case around Privacy in an Insurance Landscape using Privacera. Scaling Privacy in a Spark Ecosystem (https://www.youtube.com/watch?v=cjJEMlNcg5k). In one year, the concept of privacy and security have taken off as a major need to solve and the ability to embed this into business process to empower data democratization has become mandatory. The concept that data is a product is now commonplace and that ability to rapidly innovate those products hinges on the ability to balance a dual mandate. One mandate: Move Fast. Second Mandate: Manage Privacy and Security. How do we make this happen? Let's dig into the real details and experiences and show the blueprint for success.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Sound Data Engineering in Rust—From Bits to DataFrames

Spark applications often need to query external data sources such as file-based data sources or relational data sources. In order to do this, Spark provides Data Source APIs to access structured data through Spark SQL.

Data Source APIs have optimization rules such as filter push down and column pruning to reduce the amount of data that needs to be processed to improve query performance. As part of our ongoing project to provide generic Data Source V2 push down APIs, we have introduced partial aggregate push down, which significantly speeds up spark jobs by dramatically reducing the amount of data transferred between data sources and Spark. We have implemented aggregate push down in both JDBC and parquet.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

State-of-the-Art Natural Language Processing with Apache Spark NLP

This session teaches how & why to use the open-source Spark NLP library. Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of recent research advances. Spark NLP is the most widely used NLP library in the enterprise today; provides thousands of current, supported, pre-trained models for 200+ languages out of the box; and is the only open-source NLP library that can natively scale to use any Apache Spark cluster.

We’ll walk through Python code running common NLP tasks like document classification, named entity recognition, sentiment analysis, spell checking, question answering, and translation. The discussion of each task includes the latest advances in deep learning and transfer learning used to tackle it. We’ll also cover new free tools for data annotation, no-code active learning & transfer learning, easily deploying NLP models as production-grade services, and sharing models you’ve trained.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

A Low-Code Approach to 10x Data Engineering

Can we take Data Engineering on Spark 10x beyond where it is today?

Yes, we can enable 10x more users on Spark, and make them 10x more productive from day 1. Data engineering can run at scale, and it can still be 10x simpler and faster to develop, deploy, and manage pipelines.

Low code is the key. A modern data engineering platform built on low code will enable all data users, from new graduates to experts, to visually develop high-quality pipelines. With Visual = Code, the visual elements will be stored as PySpark code on Git and deployed using the best software practices taken from DevOps. Search and lineage help data engineers and their customers in analytics understand how each column value was produced, when it was updated, and the associated quality metric.

See how a complete, low-code data engineering platform can reduce complexity and effort, enabling you to rapidly deploy, scale, and use Spark, making data and analytics a strategic asset in your company.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How To Make Apache Spark on Kubernetes Run Reliably on Spot Instances

Since the general availability of Apache Spark’s native support for running on Kubernetes with Spark 3.1 in March 2021, the Spark community is increasingly choosing to run on k8s to benefit of containerization, efficient resource-sharing, and the tools from the cloud-native ecosystem.

Data teams are faced with complexities in this transition, including how to leverage spot VMs. These instances enable up to 90% cost savings but are not guaranteed to be available and face the risk of termination. This session will cover concrete guidelines on how to make Spark run reliably on spot instances, with code examples from real-world use cases.

Main topics: • Using spot nodes for Spark executors • Mixing instance types & sizes to reduce risk of spot interruptions - cluster autoscaling • Spark 3.0: Graceful Decommissioning - preserve shuffle files on executor shutdown • Spark 3.1: PVC reuse on executor restart - disaggregate compute & shuffle storage • What to look for in future Spark releases

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Amgen’s Journey To Building a Global 360 View of its Customers with the Lakehouse

Serving patients in over 100 countries, Amgen is a leading global biotech company focused on developing therapies that have the power to save lives. Delivering on this mission requires our commercial teams to regularly meet with healthcare providers to discuss new treatments that can help patients in need. With the onset of the pandemic, where face-to-face interactions with doctors and other Healthcare Providers (HCPs) were severely impacted, Amgen had to rethink these interactions. With that in mind, the Amgen Commercial Data and Analytics team leveraged a modern data and AI architecture built on the Databricks Lakehouse to help accelerate its digital and data insights capabilities. This foundation enabled Amgen’s teams to develop a comprehensive, customer-centric view to support flexible go-to-market models and provide personalized experiences to our customers. In this presentation, we will share our recent journey of how we took an agile approach to bringing together over 2.2 petabytes of internally generated and externally sourced vendor data , and onboard into our AWS Cloud and Databricks environments to enable a standardized, scalable and robust capabilities to meet the business requirements in our fast-changing life sciences environment. We will share use cases of how we harmonized and managed our diverse sets of data to deliver efficiency, simplification, and performance outcomes for the business. We will cover the following aspects of our journey along with best practices we learned over time: • Our architecture to support Amgen’s Commercial Data & Analytics constant processing around the globe • Engineering best practices for building large scale Data Lakes and Analytics platforms such as Team organization, Data Ingestion and Data Quality Frameworks, DevOps Toolkit and Maturity Frameworks, and more • Databricks capabilities adopted such as Delta Lake, Workspace policies, SQL workspace endpoints, and MLflow for model registry and deployment. Also, various tools were built for Databricks workspace administration • Databricks capabilities being explored for future, such as Multi-task Orchestration, Container-based Apache Spark Processing, Feature Store, Repos for Git integration, etc. • The types of commercial analytics use cases we are building on the Databricks Lakehouse platform Attendees building global and Enterprise scale data engineering solutions to meet diverse sets of business requirements will benefit from learning about our journey. Technologists will learn how we addressed specific Business problems via reusable capabilities built to maximize value.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Implementing an End-to-End Demand Forecasting Solution Through Databricks and MLflow

In retail, the right quantity at the right time is crucial for success. In this session we share how a demand forecasting solution helped some of our retailers to improve efficiencies and sharpen fresh product production and delivery planning.

With the setup in place we train hundreds of models in parallel, training on various levels including store level, product level and the combination of the two. By leveraging the distributed computation of Spark, we can do all of this in a scalable and fast way. Powered by Delta Lake, feature store and MLFlow this session clarifies how we built a highly reliable ML factory.

We show how this setup runs at various retailers and feeds accurate demand forecasts back to the ERP system, supporting the clients in their production planning and delivery. Through this session we want to inspire retailers & conference attendants to use data & AI to not only gain efficiency but also decrease food waste.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Improving Apache Spark Application Processing Time by Configurations, Code Optimizations, etc.

In this session, we'll go over several use-cases and describe the process of improving our spark structured streaming application micro-batch time from ~55 to ~30 seconds in several steps.

Our app is processing ~ 700 MB/s of compressed data, it has very strict KPIs, and it is using several technologies and frameworks such as: Spark 3.1, Kafka, Azure Blob Storage, AKS and Java 11.

We'll share our work and experience in those fields, and go over a few tips to create better Spark structured streaming applications.

The main areas that will be discussed are: Spark Configuration changes, code optimizations and the implementation of the Spark custom data source.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Improving Interactive Querying Experience on Spark SQL

Being a data driven company, interactive querying on 100s of petabytes of data is a common and important function at Pinterest. Interactive querying has different requirements and challenges from batch querying.

In this talk, we will talk about various architectural alternatives one can choose from to perform interactive querying with Spark SQL. Through discussion on trade-offs of those architectures and requirements for interactive querying, we will elaborate on our design choice. We will share enhancements we made to open source projects including Apache Spark, Apache Livy and Dr. Elephant along with in-house technologies we built to improve interactive querying experience at Pinterest. We will share enhancements like DDL query speed ups, spark session caching, spark session sharing, Apache Yarn’s diagnostic message improvements, query failure handling and tuning recommendations. We will also discuss some challenges we faced along the way and future improvements we are working on.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Managing Straggler Executors at Apache Spark 3.3

Tuning high-performance Apache Spark applications to handle mis-behaving executors is at best challenging and at worst impossible. Apache Spark does provide some built-in support to kill and recreate new executors under certain conditions such as long GC delays or due to application errors. However this still leaves-open various scenarios where slow-running executors can impact the overall performance of your application even when you enable features such as task speculation. In this talk, we are going to describe Apache Spark 3.3’s new feature, Executor Rolling. Apache Spark 3.3 (SPARK-37810) provides a built-in executor rolling driver plugin with three configurations.

spark.kubernetes.executor.rollInterval (default: '0s' which means being disabled.) spark.kubernetes.executor.rollPolicy (default: OUTLIER) spark.kubernetes.executor.minTasksPerExecutorBeforeRolling (default: 0)

This driver plugin tries to choose and decommission a single executor at every interval with the given policy. The followings are the built-in policies and their targets.

  • ID: An executor with the smallest executor ID
  • ADD_TIME: An executor with the smallest add-time
  • TOTAL_GC_TIME: An executor with the biggest GC time
  • TOTAL_DURATION: An executor with the biggest total task time
  • AVERAGE_DURATION: An executor with the biggest average task duration
  • FAILED_TASKS: An executor with the largest number of failed tasks
  • OUTLIER: An outlier executor or the biggest total task time

In short, Apache Spark 3.3 maintains the set of live executors literally freshly and reduces much engineering burdens to handle executors’ JVM misbehavior at diverse production jobs by utilizing the proposed built-in executor rolling policies in advance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Migrating Complex SAS Processes to Databricks - Case Study

Many federal agencies use SAS software for critical operational data processes. While SAS has historically been a leader in analytics, it has often been used by data analysts for ETL purposes as well. However, modern data science demands on ever-increasing volumes and types of data require a shift to modern, cloud architectures and data management tools and paradigms for ETL/ELT. In this presentation, we will provide a case study at Centers for Medicare and Medicaid Services (CMS) detailing the approach and results of migrating a large, complex legacy SAS process to modern, open-source/open-standard technology - Spark SQL & Databricks – to produce results ~75% faster without reliance on proprietary constructs of the SAS language, with more scalability, and in a manner that can more easily ingest old rules and better govern the inclusion of new rules and data definitions. Significant technical and business benefits derived from this modernization effort are described in this session.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Mosaic: A Framework for Geospatial Analytics at Scale

In this session we’ll present Mosaic, a new Databricks Labs project with a geospatial flavour.

Mosaic provides users of Spark and Databricks with a unified framework for distributing geospatial analytics. Users can choose to employ existing Java-based tools such as JTS or Esri's Geometry API for Java and Mosaic will handle the task of parallelizing these tools' operations: e.g. efficiently reading and writing geospatial data and performing spatial functions on geometries. Mosaic helps users scale these operations by providing spatial indexing capabilities (using, for example, Uber's H3 library) and advanced techniques for optimising common point-in-polygon and polygon-polygon intersection operations.

The development of Mosaic builds upon techniques developed with Ordnance Survey (the central hub for geospatial data across UK Government) and described in this blog post: https://databricks.com/blog/2021/10/11/efficient-point-in-polygon-joins-via-pyspark-and-bng-geospatial-indexing.html

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Multimodal Deep Learning Applied to E-commerce Big Data

At Mirakl, we empower marketplaces with Artificial Intelligence solutions. Catalogs data is an extremely rich source of e-commerce sellers and marketplaces products which include images, descriptions, brands, prices and attributes (for example, size, gender, material or color). Such big volumes of data are suitable for training multimodal deep learning models and present several technical Machine Learning and MLOps challenges to tackle.

We will dive deep into two key use cases: deduplication and categorization of products. For categorization the creation of quality multimodal embeddings plays a crucial role and is achieved through experimentation of transfer learning techniques on state-of-the-art models. Finding very similar or almost identical products among millions and millions can be a very difficult problem and that is where our deduplication algorithm comes to bring a fast and computationally efficient solution.

Furthermore we will show how we deal with big volumes of products using robust and efficient pipelines, Spark for distributed and parallel computing, TFRecords to stream and ingest data optimally on multiple machines avoiding memory issues, and MLflow for tracking experiments and metrics of our models.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

Microservices is an increasingly popular architecture much loved by application teams, for it allows services to be developed and scaled independently. Data teams, though, often need a centralized repository where all data from different services come together to join and aggregate. The data platform can serve as a single source of company facts, enable near real time analytics, and secure sharing of massive data sets across clouds.

A viable microservices ingestion pattern is Change Data Capture, using AWS Database Migration Services or Debezium. CDC proves to be a scalable solution ideal for stable platforms, but it has several challenges for evolving services: Frequent schema changes, complex, unsupported DDL during migration, and automated deployments are but a few. An event streaming architecture can address these challenges.

Confluent, for example, provides a schema registry service where all services can register their event schemas. Schema registration helps with verifying that the events are being published based on the agreed contracts between data producers and consumers. It also provides a separation between internal service logic and the data consumed downstream. The services write their events to Kafka using the registered schemas with a specific topic based on the type of the event.

Data teams can leverage Spark jobs to ingest Kafka topics into Bronze tables in the Delta Lake. On ingestion, the registered schema from schema registry is used to validate the schema based on the provided version. A merge operation is sometimes called to translate events into final states of the records per business requirements.

Data teams can take advantage of Delta Live Tables on streaming datasets to produce Silver and Gold tables in near real time. Each input data source also has a set of expectations to ensure data quality and business rules. The pipeline allows Engineering and Analytics to collaborate by mixing Python and SQL. The refined data sets are then fed into Auto ML for discovery and baseline modeling.

To expose Gold tables to more consumers, especially non spark users across clouds, data teams can implement Delta Sharing. Recipients can accesses Silver tables from a different cloud and build their own analytics data sets. Analytics teams can also access Gold tables via pandas Delta Sharing client and BI tools.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

An Advanced S3 Connector for Spark to Hunt for Cyber Attacks

Working with S3 is different from doing so with HDFS: The architecture of the Object store makes the standard Spark file connector inefficient to work with S3.

There is a way to tackle this problem with a message queue for listening to changes in a bucket. What if an additional message queue is not an option and you need to use Spark-streaming? You can use a standard file connector, but you quickly face performance degradation with a number of files in the source path.

We have seen this happen at Hunters, a security operations platform that works with a wide range of data sources.

We want to share a description of the problem and the solution we will open-source. The audience will learn how to configure it and make the best use of it. We will also discuss how to use metadata to boost the performance of discovering new files in the stream and show the use case of utilizing time metadata of CloudTrail to efficiently collect logs for hunting cyber attacks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/