talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

Pushing the limits of scale/performance for enterprise-wide analytics: A fire-side chat with Akamai

With the world’s most distributed compute platform — from cloud to edge — Akamai makes it easy for businesses to develop and run applications, while keeping experiences closer to users and threats farther away. ​So when it was time to scale it’s legacy Hadoop-like infrastructure reaching its capacity limits, while keeping their global operations running uninterrupted, Akamai partnered with Microsoft and Databricks to migrate to Azure Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

PySpark in Apache Spark 3.3 and Beyond

PySpark has rapidly evolved with the momentum of Project Zen introduced in Apache Spark 3.0. We improved error messages, added type hints for autocompletion, implemented visualization, etc. Most importantly, Pandas API on Spark was introduced from Apache Spark 3.2 which exposes the pandas API that runs on Apache Spark, and the Pandas API on Spark has gained a lot of popularity.

In Apache Spark 3.3, the effort of Project Zen continued and PySpark has many cool changes such as more API coverage & faster default index in Pandas API on Spark, datetime.timedelta support, new PyArrow batch interface, better autocompletion, Python & Pandas UDF profiler and new error classification.

In this talk, we will introduce what is new in PySpark at Apache Spark 3.3, and what is next beyond Apache Spark 3.3 with the current effort and roadmap in PySpark.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Quick to Production with the Best of Both Apache Spark and Tensorflow on Databricks

Using Tensorflow with big datasets has been an impediment for building deep learning models due to the added complexities of running it in a distributed setting and complicated MLOps code, recent advancements in tensorflow 2, and some extension libraries for Spark has now simplified a lot of this. This talk focuses on how we can leverage the best of both Spark and tensorflow to build machine learning and deep learning models using minimal MLOps code letting Spark handle the grunt of work, enabling us to focus more on feature engineering and building the model itself. This design also enables us to use any of the libraries in the tensorflow ecosystem (like tensorflow recommenders) with the same boilerplate code. For businesses like ours, fast prototyping and quick experimentations are key to building completely new experiences in an efficient and iterative way. It is always preferable to have tangible results before putting more resources into a certain project. This design provides us with that capability and lets us spend more time on research, building models, testing quickly, and rapidly iterating. It also provides us with the flexibility to use our choice of framework at any stage of the machine learning lifecycle. In this talk, we will go through some of the best and new features of both spark and tensorflow, how to go from single node training to distributed training with very few extra lines of code, how to leverage MLFlow as a central model store, and finally, using these models for batch and real-time inference.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Radical Speed on the Lakehouse: Photon Under the Hood

Many organizations are standardizing on the lakehouse, however, this new architecture poses challenges with an underlying query execution engine for accessing structured and unstructured data. The execution engine needs to provide the performance of a data warehouse and the scalability of data lakes. To ensure optimum performance, the Databricks Lakehouse Platform offers Photon. This next-gen vectorized query execution engine outperforms existing data warehouses in SQL workloads and implements a more general execution framework for efficient processing of data with support of the Apache Spark™ API. With Photon, analytical queries are seeing a 3 to 5x speed increase, with a 40% reduction in compute hours for ETL workloads. In this session, we will dive into Photon, describe its integration with the Databricks Platform and Apache Spark™ runtimes, talk through customer use cases, and show how your SQL and DataFrame workloads can benefit from the performance of Photon.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Realize the Promise of Streaming with the Databricks Lakehouse Platform

Streaming is the future of all data pipelines and applications. It enables businesses to make data-driven decisions sooner and react faster, develop data-driven applications considered previously impossible, and deliver new and differentiated experiences to customers. However, many organizations have not realized the promise of streaming to its full potential because it requires them to completely redevelop their data pipelines and applications on new, complex, proprietary, and disjointed technology stacks.

The Databricks Lakehouse Platform is a simple, unified, and open platform that supports all streaming workloads ranging from ingestion, ETL to event processing, event-driven application, and ML inference. In this session, we will discuss the streaming capabilities of the Lakehouse Platform and demonstrate how easy it is to build end-to-end, scalable streaming pipelines and applications, to fulfill the promise of streaming for your business. You will also hear from Erica Lee, VP of ML at Upwork, the world's largest Work Marketplace, share how the Upwork team uses Databricks to enable real-time predictions by computing ML features in a continuous streaming manner.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Real-Time Search and Recommendation at Scale Using Embeddings and Hopsworks

The dominant paradigm today for real-time personalized recommendations and personalized search is the retrieval and ranking architecture based on embeddings. It is a fan-out architecture where a single query produces a storm of requests on the backend. A single query will search through millions of items to retrieve hundreds of candidates that are then enriched by a feature store and ranked so only a few recommended items are presented to the user. A search should return in much less than 1 second. Retrieval and ranking architectures need significant infrastructure - an embeddings store and a feature store - to provide both the required scale and real-time performance. In this talk, we will introduce an open-source, scalable retrieval and ranking serving architecture based on open-source technology: Hopsworks Feature Store, OpenSearch, and KServe. We will describe how to build and operate personalized search and recommendation systems using a retrieval model based on a two tower embedding model, and a ranking model gradient boosted trees. We will also show how you can train your embeddings and build your embeddings store index using Hopsworks and Apache Spark.

Attend this session to learn:

  • how to to build a scalable, real-time retrieval and ranking recommender system using open-source platforms;
  • how to train item/user embedding models and ranking models;
  • how to put all these pieces together in an end-to-end solution for training and operating a scalable recommender/search engine.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Rethinking Orchestration as Reconciliation: Software-Defined Assets in Dagster

This talk discusses “software-defined assets”, a declarative approach to orchestration and data management that makes it drastically easier to trust and evolve datasets and ML models. Dagster is an open source orchestrator built for maintaining software-defined assets.

In traditional data platforms, code and data are only loosely coupled. As a consequence, deploying changes to data feels dangerous, backfills are error-prone and irreversible, and it’s difficult to trust data, because you don’t know where it comes from or how it’s intended to be maintained. Each time you run a job that mutates a data asset, you add a new variable to account for when debugging problems.

Dagster proposes an alternative approach to data management that tightly couples data assets to code - each table or ML model corresponds to the function that’s responsible for generating it. This results in a “Data as Code” approach that mimics the “Infrastructure as Code” approach that’s central to modern DevOps. Your git repo becomes your source of truth on your data, so pushing data changes feels as safe as pushing code changes. Backfills become easy to reason about. You trust your data assets because you know how they’re computed and can reproduce them at any time. The role of the orchestrator is to ensure that physical assets in the data warehouse match the logical assets that are defined in code, so each job run is a step towards order.

Software-defined assets is a natural approach to orchestration for the modern data stack, in part because dbt models are a type of software-defined asset.

Attendees of this session will learn how to build and maintain lakehouses of software-defined assets with Dagster.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling AI Workloads with the Ray Ecosystem

Modern machine learning (ML) workloads, such as deep learning and large-scale model training, are compute-intensive and require distributed execution. Ray is an open-source, distributed framework from U.C. Berkeley’s RISELab that easily scales Python applications and ML workloads from a laptop to a cluster, with an emphasis on the unique performance challenges of ML/AI systems. It is now used in many production deployments.

This talk will cover Ray’s overview, architecture, core concepts, and primitives, such as remote Tasks and Actors; briefly discuss Ray’s native libraries (Ray Tune, Ray Train, Ray Serve, Ray Datasets, RLlib); and Ray’s growing ecosystem to scale your Python or ML workloads.

Through a demo using XGBoost for classification, we will demonstrate how you can scale training, hyperparameter tuning, and inference—from a single node to a cluster, with tangible performance difference when using Ray.

The takeaways from this talk are :

Learn Ray architecture, core concepts, and Ray primitives and patterns Why Distributed computing will be the norm not an exception How to scale your ML workloads with Ray libraries: Training on a single node vs. Ray cluster, using XGBoost with/without Ray Hyperparameter search and tuning, using XGBoost with Ray and Ray Tune Inferencing at scale, using XGBoost with/without Ray

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Deep Learning on Databricks

Training modern Deep Learning models in a timely fashion requires leveraging GPUs to accelerate the process. Ensuring that this expensive hardware is properly utilised and scales efficiently is complex however. All the steps, from data storage and loading through to preprocessing and finally distributing the model training process requires careful thought.

To reduce the cost of training a model, we need to ensure that we are making best use of our hardware resources. Typically, the GPUs that we rely on are memory constrained with much smaller amounts of VRAM being available relative to CPU RAM. As such we will need to leverage a variety of libraries to help ensure that we can keep our GPUs running.

Through the use of libraries like Petastorm to handle the data loading side, PyTorch Lightning and Horovod to handle the model distribution side we can accelerate can leverage commodity spark clusters to accelerate the training process for our Deep Learning Models.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Salesforce In-Memory Streaming Analytics Platform for Trillion Events Per Day

In general , in-memory pipelines would scale quite well in Spark if we apply the same processing logic to all records. But for Salesforce the major challenge is, we need to apply custom logic specific to a Log Record Type (LRT). The custom logic includes applying different schemas while processing each event. So performing such custom logic specific to LRT , we need to have a mechanism to collect LRT specific data In-Memory such that we can apply custom logic to each collection. We normally get around 50K files in S3 every 5 minutes and there are around 4 billion log events there in 50K files. Creating a DataFrame from 50K files, then group events by LRTs and applying filters per LRT to create a child DataFrame is one approach. One major challenge is that LRT data distribution is very skewed , so we need an efficient in-memory partitioning strategy to distribute the data. Also just applying filters on parent DataFrame will have many child Data frames with empty partitions due to large skew in data distribution and this creates too many empty tasks while processing child DataFrames. So we need to have a Partitioning schema to distribute data and filter by Log Type but not create unnecessary empty partitions in child DataFrames. We also need a scheduling algorithm to process all child DataFrames to utilize cluster efficiency. We have implemented a custom Spark Streaming for reading SQS notifications and then reading new files in S3 which is designed to scale with ingestion volume . This talk will cover how we performed a Spark RangePartition based on Size distribution of the incoming data and applying schema specific transformation logic. This talk will explain various optimizations at various stages of the processing to meet our latency goal.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Up Machine Learning in Instacart Search for the 2020 Surge in Online Shopping

As online grocery business accelerated in 2020, Instacart search, which supports one of the largest catalog of grocery items in the world, started facing new challenges. We experienced a sudden surge in the number of users, retailers and traffic to our search engine. As a result, the scale of our data grew manifold and the predictive performance of our model started degrading due to lack of historical data for many new retailers and users that started using Instacart. New users searched for queries that we have never seen before. The new retailers on our platform were quite diverse - ranging from local grocery stores to office supplies, pharmacies and halloween stores - which are categories that our models were never trained on. As our relatively small team team of four engineers tried to build new models to address these issues, we faced a number of operational challenges. This talk will focus on details about the challenges we encountered in this new world including drift in our data and cold start issues. We will cover the architecture of our search engine and the issues we faced in training and serving our ML models due to the increase in scale. We will talk about how we we overcame the issues by using more sophisticated models that are trained and served on a more robust infrastructure and technical stack. We will also cover the iterations on our ML ranking models to adapt to this new world and we successfully improved the quality of search results and our revenue while operating in a robust production environment.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Your Workloads with Databricks Serverless

Databricks SQL provides a first-class user experience for BI and SQL directly on the lakehouse platform. But you still need to administer and maintain clusters of virtual machines. What if you could focus on your Databricks SQL queries and never need to worry about the underlying compute infrastructure? Learn how Databricks Serverless, built into the Databricks Lakehouse Platform, eliminates cluster management, provides instant compute, and lowers total cost of ownership for Databricks SQL. In this session, you will see demos, hear from customers, learn how Databricks Serverless works under the hood, be equipped with everything you need to get started – and ultimately get the best out of Databricks Serverless.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Search and Aggregations Made Easy with OpenSearch and NodeJS

In this session the audience will get both theoretical and practical knowledge on what OpenSearch is and how they can work with it by using its NodeJS client. This is a hands-on session where the audience is invited to follow along.

There will be an accompanying GitHub repository to allow the audience to follow me during or after the lecture.

During the session we will: - Overview on OpenSearch architecture. - Set up the cluster and prepare NodeJS project. - Load sample data (we'll use a dataset with 20k recipes). - Explore different types of search queries: term-level, full-text and boolean. - Explore different types of aggregations: metric, bucket, pipeline. - Visualisations with the help of OpenSearch Dashboards.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Securing Databricks on AWS Using Private Link

Minimizing data transfers over the public internet is among the top priorities for organizations of any size, both for security and cost reasons. Modern cloud-native data analytics platforms need to support deployment architectures that meet this objective. For Databricks on AWS such an architecture is realized thanks to AWS PrivateLink, which allows computing resources deployed on different virtual private networks and different AWS accounts to communicate securely without ever crossing the public internet.

In this session, we want to provide a brief introduction to AWS Private Link and its main use cases in the context of a Databricks deployment: securing communications between control and data plane and securely connecting to the Databricks Web UI. We will then provide step-by-step walkthrough of the steps required in setting up PrivateLink connections with a Databricks deployment and demonstrate how to automate that process using AWS Cloud Formation or Terraform templates.

In this presentation we will cover the following topics: - Brief Introduction to AWS Private Link - How you can use PrivateLink to secure your AWS Databricks deployment - Step-by-step walkthrough of how to set up Private Link - How to automate and scale the setup using AWS CloudFormation or Terraform

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Security Best Practices for Lakehouse

To learn more, visit the Databricks Security and Trust Center: https://www.databricks.com/trust

As you embark on a lakehouse project or evolve your existing data lake, you may want to improve your security posture and take advantage of new security features—there may even be a security team at your company that demands it! Databricks has worked with thousands of customers to securely deploy the Databricks Platform to meet their architecture and security requirements. While many organizations deploy security differently, we have found a common set of guidelines and features among organizations who require a high level of security. In this talk, we will detail the security features and architectural choices frequently used by these organizations and walk through a series of threat models for the risks that most concern security teams. While this session is great for people who already know Databricks, don’t worry, that knowledge isn’t required.

You will walk away with a full handbook detailing all of the concepts, configurations, and code from the session so that you can make immediate progress when you get back to the office. Security can be hard, but we’ve collected the hard work already done by some of the best in the industry, to make it easier. Come learn how.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Serverless Kafka and Apache Spark in a Multi-Cloud Data Lakehouse Architecture

Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.

This post explores different architecture to build serverless Kafka and Spark multi-cloud architectures across regions and continents. We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data lakehouse. Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Simon Whiteley + Denny Lee Live Ask Me Anything

Simon and Denny Build A Thing is a live webshow, where Simon Whiteley (Advancing Analytics) and Denny Lee (Databricks) are building out a TV Ratings Analytics tool, working through the various challenges of building out a Data Lakehouse using Databricks. In this session, they'll be talking through their Lakehouse Platform, revisiting various pieces of functionality, and answering your questions, Live!

This is your chance to ask questions around structuring a lake for enterprise data analytics, the various ways we can use Delta Live Tables to simplify ETL or how to get started serving out data using Databricks SQL. We have a whole load of things to talk through, but we want to hear YOUR questions, which we can field from industry experience, community engagement and internal Databricks direction. There's also a chance we'll get distracted and talk about the Expanse for far too long.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Metadata and Lineage Driven Pipelines on Kubernetes

Machine Learning becomes a critical role in every industry amid its widespread adoption. Composing an ML pipeline at a rapid pace is an inevitable way for success. However, an ML pipeline consists of several components and needs various efforts of different teams, including data engineers, data scientists, ML engineers, etc. A typical cooperation strategy is to define a sequence of tasks, coordinate the integration, test, apply fixes and enhancements, and repeat. ML pipeline components produced by task-driven approach lack reusability only maintenance efforts. Kubeflow Pipelines, a platform making deployments of ML pipeline on Kubernetes straightforward and scalable, provides metadata and lineage-driven approach to develop platform-independent and portable ML pipelines. Data linkage and propagation become crystal clear within ML pipelines. It also nourishes ML pipeline composition.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Production-Ready Recommender Systems with Feature Stores

Recommender systems are highly prevalent in modern applications and services but are notoriously difficult to build and maintain. Organizations face challenges such as complex data dependencies, data leakage, and frequently changing data/models. These challenges are compounded when building, deploying, and maintaining ML pipelines spans data scientists and engineers. Feature stores help address many of the operational challenges associated with recommender systems.

In this talk, we explore:

  • Challenges of building recommender systems
  • Strategies for reducing latency, while balancing requirements for freshness
  • Challenges in mitigating data quality issues
  • Technical and organizational challenges feature stores solve
  • How to integrate Feast, an open-source feature store, into an existing recommender system to support production systems

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/