talk-data.com talk-data.com

Topic

Spark

Apache Spark

big_data distributed_computing analytics

581

tagged

Activity Trend

71 peak/qtr
2020-Q1 2026-Q1

Activities

581 activities · Newest first

Fugue Tune: Distributed Hybrid Hyperparameter Tuning

Hyperparameter optimization on Spark is commonly memory-bound, where the model training is done on data that doesn’t fit on a single machine. We introduce Fugue-tune, an intuitive interface focusing on compute-bound hyperparameter tuning that scales Hyperopt and Optuna by allowing them to leverage Spark and Dask without code change.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ÀLaSpark: Gousto's Recipe for Building Scalable PySpark Pipelines

Find out how Gousto is developing its data pipelines at scale in a repeatable manner. At Gousto, we’ve developed Goustospark - a wrapper around pyspark that allows us to quickly and easily build data pipelines that are deployed into our Databricks environment.

This wrapper abstracts repetitive components of all data pipelines such as spark configurations and metastore interactions. This allows a developer to simply specify the blueprints of the pipeline before turning their attention to more pressing issues, such as data quality and data governance, whilst enjoying a high level of performance and reliability.

In this session we will deep dive into the design patterns we followed, some unique approaches we’ve taken on how we structure pipelines and show a live demo of implementing a new spark streaming pipeline in Databricks from scratch. We will even share some example python code and snippets to help you build your own.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How AARP Services, Inc. automated SAS transformation to Databricks using LeapLogic

While SAS has been a standard in analytics and data science use cases, it is not cloud-native and does not scale well. Join us to learn how AARP automated the conversion of hundreds of complex data processing, model scoring, and campaign workloads to Databricks using LeapLogic, an intelligent code transformation accelerator that can transform any and all legacy ETL, analytics, data warehouse and Hadoop to modern data platforms.

In this session experts from AARP and Impetus will share about collaborating with Databricks and how they were able to: • Automate modernization of SAS marketing analytics based on coding best practices • Establish a rich library of Spark and Python equivalent functions on Databricks with the same capabilities as SAS procedures, DATA step operations, macros, and functions • Leverage Databricks-native services like Delta Live Tables to implement waterfall techniques for campaign execution and simplify pipeline monitoring

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Welcome &  Destination Lakehouse    Ali Ghodsi   Keynote Data + AI Summit 2022

Join the Day 1 keynote to hear from Databricks co-founders - and original creators of Apache Spark and Delta Lake - Ali Ghodsi, Matei Zaharia, and Reynold Xin on how Databricks and the open source community is taking on the biggest challenges in data. The talks will address the latest updates on the Apache Spark and Delta Lake projects, the evolution of data lakehouse architecture, and how companies like Adobe and Amgen are using lakehouse architecture to advance their data goals.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Apache Spark Community Update | Reynold Xin Streaming Lakehouse | Karthik Ramasamy

Data + AI Summit Keynote talks from Reynold Xin and Karthik Ramasamy

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Databricks Meets Power BI

Databricks and Spark are becoming increasingly popular and are now used as a modern data platform to analyze real-time or batch data. In addition, Databricks offers a great integration for machine learning developers.

Power BI, on the other hand, is a great platform for easy graphical analysis of data, and it's a great way to bring hundreds of different data sources together, analyze them together and make them accessible on any device.

So let's just bring both worlds together and see how well Databricks works with Power BI.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

From PostGIS to Spark SQL: The History and Future of Spatial SQL

In this talk, we'll review the major milestones that have defined Spatial SQL as the powerful tool for geospatial analytics that it is today.

From the early foundations of the JTS Topology Suite and GEOS and its application on the PostGIS extension for PostgreSQL, to the latest implementation in Spark SQL using libraries such as the CARTO Analytics Toolbox for Databricks, Spatial SQL has been a key component of many geospatial analytics products and solutions, leveraging the computing power of different databases with SQL as lingua franca, allowing easy adoption by data scientists, analysts and engineers.

The latest innovation in this area is the CARTO Spatial Extension for Databricks, which makes the most of the near-unlimited scalability provided by Spark and the cutting-edge geospatial capabilities that CARTO offers.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Git for Data Lakes—How lakeFS Scales Data Versioning to Billions of Objects

Modern data lake architectures rely on object storage as the single source of truth. We use them to store an increasing amount of data, which is increasingly complex and interconnected. While scalable, these object stores provide little safety guarantees: lacking semantics that allow atomicity, rollbacks, and reproducibility capabilities needed for data quality and resiliency.

lakeFS - an open source data version control system designed for Data Lakes solves these problems by introducing concepts borrowed from Git: branching, committing, merging and rolling back changes to data.

In this talk you'll learn about the challenges with using object storage for data lakes and how lakeFS enables you to solve them.

By the end of the session you’ll understand how lakeFS scales its Git-like data model to petabytes of data, across billions of objects - without affecting throughput or performance. We will also demo branching, writing data using Spark and merging it on a billion-object repository.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How AT&T Data Science Team Solved an Insurmountable Big Data Challenge on Databricks

Data driven personalization is an insurmountable challenge for AT&T’s data science team because of the size of datasets and complexity of data engineering. More often these data preparation tasks not only take several hours or days to complete but some of these tasks fail to complete affecting productivity.

In this session, the AT&T Data Science team will talk about how RAPIDS Accelerator for Apache Spark and Photon runtime on Databricks can be leveraged to process these extremely large datasets resulting in improved content recommendation, classification, etc while reducing infrastructure costs. The team will compare speedups and costs to the regular Databricks runtime Apache Spark environment. The size of tested datasets vary from 2TB - 50TB which consists of data collected from for 1 day to 31 days.

The talk will showcase the results from both RAPIDS accelerator for Apache Spark and Databricks Photon runtime.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How Robinhood Built a Streaming Lakehouse to Bring Data Freshness from 24h to Less Than 15 Mins

Robinhood’s data lake is the bedrock foundation that powers business analytics, product experimentation, and other machine learning applications throughout our organization. Come join this session where we will share our journey of building a scalable streaming data lakehouse with Spark, Postgres and other leading open source technologies.

We will lay out our architecture in depth and describe how we perform CDC streaming ingestion and incremental processing of 1000’s of Postgres tables into our data lake.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Obfuscating Sensitive Information from Spark UI and Logs

The Spark UI and logs have useful information but also include sensitive data that need to be obfuscated.

To obfuscate the data, at Workday we have implemented methods for Apache Spark where the string representations for the TreeNode class can be configured to be obfuscated or non-obfuscated.To do this, we added a custom treenode printer for ui and a custom log4j appender which uses a list of rules based on class name/package name/log message regexes to decide whether to obfuscate third party libraries. In the Spark UI and in the logging, this results in the obfuscation of Spark Plans and column names.

In this talk we will go over the steps we have taken to implement the methods for obfuscation and show what it looks like in the Spark UI and logs. The methods shared have worked out well when deployed to production at workday, and other companies can also benefit from implementing these methods.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Recent Parquet Improvements in Apache Spark

Apache Parquet is a very popular columnar file format supported by Apache Spark. In a typical Spark job, scanning Parquet files is sometimes one of the most time consuming steps, as it incurs high CPU and IO overhead. Therefore, optimizing Parquet scan performance is crucial to job latency and cost efficiency.

Spark currently have two Parquet reader implementations: a vectorized one and a non-vectorized one. The former was implemented from scratch and offers much better performance than the latter. However, it currently doesn’t support complex types (e.g., array, list, map) at the moment and will fallback to the latter when encountering them. In addition to the reader implementation, predicate pushdown is also crucial to Parquet scan performance as it enables Spark to skip those data that do not satisfy the predicates, before the scan. Currently, Spark constructs predicates itself and rely on Parquet-MR to do the heavy lifting, which does the filtering based on various information such as statistics, dictionary, bloom filter or column index.

This talk will go through two recent improvements for Parquet scan performance: 1) vectorized read support for complex types, which allows Spark to achieve 10x+ improvement when reading Parquet data of complex types, and 2) Parquet column index support, which enables Spark to leverage Parquet column index feature during predicate pushdown. Last but not least, Chao go over some future work items that can further enhance Parquet read performance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Apache Spark AQE SkewedJoin Optimization and Practice in ByteDance

Almost all distributed computing systems cannot avoid data skew. If data skew is not dealt with, the long-tail task will seriously slow down the execution of the job, or even cause the failure of the job. In this talk, we will introduce how Spark AQE processes SkewedJoin and how we optimize the implementation based on workload in ByteDance. The main points are as follows: 1. Address the risks associated with increasing statistical accuracy to solve the problem of not being able to identify data skew. 2. Optimize the split logic of skew data to achieve a better optimization effect. 3. Compared to the community's implementation, more complex optimization scenarios are supported, which has basically covered all SkewedJoin scenarios. By February 2021, Spark AQE SkewedJoin optimization covers 13000+ Spark jobs per day in ByteDance. The average performance of optimized Spark jobs increased by 35%.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Day 1 Morning Keynote | Data + AI Summit 2022

Day 1 Morning Keynote | Data + AI Summit 2022 Welcome & "Destination Lakehouse" | Ali Ghodsi Apache Spark Community Update | Reynold Xin Streaming Lakehouse | Karthik Ramasamy Delta Lake | Michael Armbrust How Adobe migrated to a unified and open data Lakehouse to deliver personalization at unprecedented scale | Dave Weinstein Data Governance and Sharing on Lakehouse |Matei Zaharia Analytics Engineering and the Great Convergence | Tristan Handy Data Warehousing | Shant Hovespian Unlocking the power of data, AI & analytics: Amgen’s journey to the Lakehouse | Kerby Johnson

Get insights on how to launch a successful lakehouse architecture in Rise of the Data Lakehouse by Bill Inmon, the father of the data warehouse. Download the ebook: https://dbricks.co/3ER9Y0K

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Media and Entertainment Experience at Data + AI Summit 2022

Welcome data teams and executives in Media and Entertainment! This year’s Data + AI Summit is jam-packed with talks, demos and discussions focused on how organizations are using data to personalize, monetize and innovate the audience experience. To help you take full advantage of the Communications, Media & Entertainment experience at Summit, we’ve curated all the programs in one place.

Highlights at this year’s Summit:

Communications, Media & Entertainment Forum: Our capstone event for the industry at Summit featuring fireside chats and panel discussions with HBO, Warner Bros. Discovery, LaLiga, and Condé Nast followed by networking. More details in the agenda below. Industry Lounge: Stop by our lounge located outside the Expo floor to meet with Databricks’ industry experts and see solutions from our partners including Cognizant, Fivetran, Labelbox, and Lovelytics. Session Talks: Over 10 technical talks on topics including Telecommunication Data Lake Management at AT&T, Data-driven Futbol Analysis from LaLiga, Improving Recommendations with Graph Neural Networks from Condé Nast, Tools for Assisted Spark Version Migrations at Netflix, Real-Time Cost Reduction Monitoring and Alerting with HuuugeGames and much more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Moving from Apache Spark 2 to Apache Spark 3: Spark Version Upgrade at Scale in Pinterest

Apache Spark has become Pinterest’s dominant distributed batch processing framework, and Pinterest is migrating its Spark Platform and most production Spark jobs to Spark 3.

In this talk, we’ll share how Pinterest performed the Spark 3 version migration at scale. Moving to Spark 3 is a huge version upgrade that brings many incompatibilities and major differences compared with Spark 2. We’ll first introduce the motivation of the migration, then talk about the major challenges, approaches we took, how we handled different Spark job types during the migration, how we address the incompatibilities between Spark 2 and Spark 3, like Scala version support, and how we efficiently and safely migrated our existing production Spark jobs at scale without impacting stability & SLO with the help of Auto Migration Service (AMS). We’ll then further discuss our current performance improvements, cost saving, as well as the future plans and improvements that we’ll work on.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Optimizing Incremental Ingestion in the Context of a Lakehouse

Incremental ingestion of data is often trickier than one would assume, particularly when it comes to maintaining data consistency: for example, specific challenges arise depending on whether the data is ingested in a streaming or a batched fashion. In this session we want to share the real-life challenges encountered when setting up incremental ingestion pipeline in the context of a Lakehouse architecture.

In this session we outline how we used the recently introduced Databricks features, such as Autoloader and Change Data Feed, in addition to some more mature features, such as Spark Structured Streaming and Trigger Once functionality. These functionalities allowed us to transform batch processes into a “streaming” setup without having the need for the cluster to always run. This setup – which we are keen to share to the community - does not require reloading large amounts of data, and therefore represents a computationally, and consequently economically, cheaper solution.

In our presentation we dive deeper into each of the different aspects of the setup, with some extra focus on some essential Autoloader functionalities, such as schema inference, recovery mechanisms and file discovery modes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Presto 101: An Introduction to Open Source Presto

Presto is a widely adopted distributed SQL engine for data lake analytics. With Presto, you can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis. Additionally, new features like the disaggregated coordinator, Presto-on-Spark, scan optimizations, a reusable native engine, and a Pinot connector enable added benefits around performance, scale, and ecosystem.

In this session, Philip and Rohan will introduce the Presto technology and share why it’s becoming so popular – in fact, companies like Facebook, Uber, Twitter, Alibaba, and much more use Presto for interactive ad hoc queries, reporting & dashboarding data lake analytics, and much more. We’ll also show a quick demo on getting Presto running in AWS.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Privacy Preserving Machine Learning and Big Data Analytics Using Apache Spark

In recent years, latest privacy laws & regulations bring a fundamental shift in the protection of data and privacy, placing new challenges to data applications. To resolve these privacy & security challenges in big data ecosystem without impacting existing applications, several hardware TEE (Trusted Execution Environment) solutions have been proposed for Apache Spark, e.g., PySpark with Scone and Opaque etc. However, to the best of our knowledge, none of them provide full protection to data pipelines in Spark applications. An adversary may still get sensitive information from unprotected components and stages. Furthermore, some of them greatly narrowed supported applications, e.g., only support SparkSQL. In this presentation, we will present a new PPMLA (privacy preserving machine learning and analytics) solution built on top of Apache Spark, BigDL, Occlum and Intel SGX. It ensures all spark components and pipelines are fully protected by Intel SGX, and existing Spark applications written in Scala, Java or Python can be migrated into our platform without any code change. We will demonstrate how to build distributed end-to-end SparkML/SparkSQL workloads with our solution on untrusted cloud environment and share real-world use cases for PPMLA.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/