talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

Apache Spark on Kubernetes—Lessons Learned from Launching Millions of Spark Executors

At Apple, data scientists and engineers are running enormous Spark workloads to deliver amazing cloud services. Apple Cloud Service supports the ever-increasing scale of Spark workloads and resource requirements with great user experience: from code to deployment management, one interface for all compute backends.

In this talk, Aaruna and Zhou would walk through the lessons we learnt and pitfalls encountered for supporting the service at Apple scale - we would share how Apple Cloud Services effectively orchestrate Spark applications, as well as the seamless switchover among different resource managers - be it in Mesos or Kubernetes, private or on-premise infrastructure. We will also cover the monitoring system and how it helps tuning Spark resource requirements with actual execution analysis.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Apache Spark SQL Aggregate Improvement at Meta (Facebook)

Aggregate (group-by) is one of most important SQL operations in data warehouses. It is required when we want to get aggregated insights from input datasets. Over the last year, we added a series of aggregate optimizations internally at Facebook Spark SQL, and we started to contribute back to Apache Spark recently.

(1).sort aggregate (SPARK-32461): add code generation to improve query performance, replace hash with sort aggregate when child is sorted, etc. (2).object hash aggregate (SPARK-34286): adaptive sort-based fallback based on JVM heap memory usage during query execution. (3).hash aggregate (SPARK-31973): adaptive bypass partial aggregate when aggregate reduction ratio is low. (4).data source aggregate push down (SPARK-34960): aggregate push down to ORC data source by utilizing column statistics (5).files statistics aggregate: aggregate output files (and all columns) statistics distributively when writing query output

we’ll take deep dive of above features and lessons learned.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Auto Encoder Decoder-Based Anomaly Detection with the Lakehouse Paradigm

Auto-Encoder-Decoder is a type of deep learning neural network architecture with an hourglass shape, high dimensional inputs are compressed to latent space through the encoder. The decoder mirrors the encoder architecture and reconstructs the input data from the latent space. Auto-Encoder-Decoder models are commonly used for anomaly detection, after training, the reconstructed error of normal data is minimized thus anomaly can be detected if its reconstructed error gets higher than the “normal threshold”. This presentation will demonstrate an Auto-Encoder-Decoder anomaly detection solution built with the Lakehouse Paradigm, from data management to after-deployment monitoring, to explain the entire model life cycle. It will also highlight the flexibility and scalability that MLflow custom model and Pandas UDF can bring when a large number of individual models need to be trained, deployed, and monitored in parallel.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Automating Model Lifecycle Orchestration with Jenkins

A key part of the lifecycle involves bringing a model to production. In regular software systems, this is accomplished via a CI/CD pipeline such as one built with Jenkins. However, integrating Jenkins into a typical DS/ML workflow is not straightforward for X, Y, Z reasons. In this hands-on talk, I will talk about what Jenkins and CI/CD practices can bring to your ML workflows, demonstrate a few of these workflows, and share some best practices on how a bit of Jenkins can level up your MLOps processes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Backfill Streaming Data Pipelines in Kappa Architecture

Streaming data pipelines can fail due to various reasons. Since the source data, such as Kafka topics, often have limited retention, prolonged job failures can lead to data loss. Thus, streaming jobs need to be backfillable at all times to prevent data loss in case of failures. One solution is to increase the source's retention so that backfilling is simply replaying source streams, but extending Kafka retention is very costly for Netflix's data sizes. Another solution is to utilize source data stored in DWH, commonly known as the Lambda architecture. However, this method introduces significant code duplication, as it requires engineers to maintain a separate equivalent batch job. At Netflix, we have created the Iceberg Source Connector to provide backfilling capabilities to Flink streaming applications. It allows Flink to stream data stored in Apache Iceberg while mirroring Kafka's ordering semantics, enabling us to backfill large-scale stateful Flink pipelines at low retention cost.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Batches, Streams, and Everything in between: Unifying Batch and Stream Storage with Apache Pulsar

Delta Lake and Lakehouse architectures have been instrumental technologies in providing a better foundation for dealing with streaming and data deltas via an open-industry standard. The rapid growth of the ecosystem is a testament to the success of this approach. However, challenges still remain in building a data platform that allows teams to process all data via streams, regardless of the age of data, while also being able to view all streams as tables without exporting data out of the streaming system. In this talk, we will take a hands-on look at how Apache Pulsar is building it’s core storage engine on the concepts of Lakehouse architectures, allowing teams to build data platforms that can manage data over its entire lifecycle and enabling data to be consumed as either a stream or a table. With these capabilities, we will show how Pulsar + Delta Lake empowers teams, regardless of toolset, to better focus on driving value from data, not just managing it.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Beyond Daily Batch Processing: Operational Trade-Offs of Microbatch, Incremental, and Real-Time

Are you considering converting some batch daily pipelines to a realtime system? Perhaps restating multiple days of batch data is becoming unscalable for your pipelines. Maybe a short SLA is music to your stakeholders' ears. If you're flink-curious or possibly just sick of pondering your late arriving data, this discussion is for you.

On the Streaming Data Science and Engineering team at Netflix we support business-critical daily batch, hourly batch, incremental, and realtime pipelines with a rotating on-call system. In this presentation I'll discuss tradeoffs we experience between these systems with an emphasis on operational support when things go sideways. I'll also share some learnings about "goodness of fit" per processing type amongst various workloads with an eye for keeping your data timely and your colleagues sane.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Beyond Monitoring: The Rise of Data Observability

"Why did our dashboard break?" "What happened to my data?" "Why is this column missing?" If you've been on the receiving end of these messages (and many others!) from downstream stakeholders, you're not alone. Data engineering teams spend 40 percent or more of their time tackling data downtime, or periods of time when data is missing, erroneous, or otherwise inaccurate, and as data systems become increasingly complex and distributed, this number will only increase. To address this problem, data observability is becoming an increasingly important part of the cloud data stack, helping engineers and analysts reduce time to detection and resolution for data incidents caused by faulty data, code, and operational environments. But what does data observability actually look like in practice? During this presentation, Barr Moses, CEO and co-founder of Monte Carlo, will present on how some of today's best data leaders implement observability across their data lake ecosystem and share best practices for data teams seeking to achieve end-to-end visibility into their data at scale. Topics addressed will include: building automated lineage for Apache Spark, applying data reliability workflows, and extending beyond testing and monitoring to solve for unknown unknowns in your data pipelines.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Build an Enterprise Lakehouse for Free with Trino and Delta Lake

Delta Lake has quickly grown in usage across data lakes everywhere due to the growing use cases that require DML capabilities that Delta Lake brings. Outside of support for ACID transactions, users want the ability to interactively query the data in their data lake. This is where a query engine like Trino (formerly PrestoSQL) comes in. Starburst provides an enterprise version of the popular Trino MPP SQL query engine and has recently open sourced their Delta Lake connector.

In this talk, Tom and Claudius will talk about the connector, its features, and how their users are taking advantage of expanding the functionality of their data lakes with improved performance and the ability to handle colliding modifications. Get started with this feature-rich and open stack without the need of a multi-million dollar budget.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building a Lakehouse on AWS for Less with AWS Graviton and Photon

AWS Graviton processors are custom-designed by AWS to enable the best price performance for workloads in Amazon EC2. In this session we will review benchmarks that demonstrate how AWS Graviton based instances run Databricks workloads at a lower price and better performance than x86-based instances on AWS, and when combined with Photon, the new Databricks engine, the price performance gains are even greater. Learn how you can optimize your Databricks workloads on AWS and save more.

Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building an Analytics Lakehouse at Grab

Grab shares the story of their Lakehouse journey, from the drivers behind their shift to this new paradigm, to lessons learned along the way. From a starting point of a siloed, data warehouse centric architecture that had inherent challenges with scalability, performance and data duplication, Grab has standardized upon Databricks to serve as an open and unified Lakehouse platform to deliver insights at scale, democratizing data through the rapid deployment of AI and BI use cases across their operations.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building and Scaling Machine Learning-Based Products in the World's Largest Brewery

In this session we will present how Anheuser-Busch InBev (Brazil) has been developing and growing an ML platform product to democratize and evolve AI usage within the full company. Our cutting-edge intelligence product offers a set of tools and processes to facilitate everything from exploratory data analysis to the development of state-of-the-art machine learning algorithms. We designed a simple, scalable, and performative product that involves the full data science/machine learning lifecycle, with processes abstraction, feature store, promptness to production and pipeline orchestration. Today we withstand and are always evolving a solution that is used by cross-functional teams in several countries, and helps data scientists to create their solutions in a cooperative setting and supports data engineers to monitor the model pipelines.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building an Operational Machine Learning Organization from Zero and Leveraging ML for Crypto Securit

BlockFi is a cryptocurrency platform that allows its clients to grow wealth through various financial products including loans, trading and interest accounts. In this presentation, we will showcase our journey adopting Databricks to build an operational nerve center for analytics across the company. We will demonstrate how to build a cross-functional organization and solve key business problems to earn executive buy-in. We will showcase two of the early successes we've had using machine learning & data science to solve key business challenges in the domains of cyber security and IT Operations. In the domain of security, we will showcase how we are using Graph Analytics to analyze millions of blockchain transactions to identify dust attacks, account takeover and flag risky transactions. The operational IT use case will showcase how we are using Sarimax to forecast platform usage patterns to scale our infrastructure using hourly crypto prices, and financial indicators.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Enterprise Scale Data and Analytics Platforms at Amgen

Amgen has developed a suite of enterprise data & analytics platforms powered by modern, cloud native and open source technologies, that have played a vital role in building game changing analytics capabilities within the organization. Our platforms include a mature Data Lake with extensive self service capabilities, a Data Fabric with semantically connected data, a Data Marketplace for advanced cataloging, an intelligent Enterprise search among others to solve for a range of high value business problems. In this talk, we - Amgen and our partner ZS Associates - will share learning from our journey so far, best practices for building enterprise scale data & analytics platforms, and describe several business use cases and how we leverage modern technologies such as Databricks to enable our business teams. We will cover use cases related to Delta Lake, microservices, platform monitoring, fine grained security, and more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Practical Data Governance in a Large Scale Databricks Environment

Learn from two governance and data practitioners what it takes to do data governance at enterprise scale. This is critical, since the power of Data Science is the ability to tap into any type of data source and turn it into pure value. It is at odds with its key enablers of Scale and Governance and we often must tackle new ways to bring our focus back to unlocking the insights inside the data. In this session, We will share new agile practices to roll out governance policies that balance Governance and Scale. We will untap how to deliver centralized fine-grained governance for ML and data transformation workloads that actually empowers data scientists in an enterprise Databricks environment that ensures privacy and compliance across hundreds of datasets. With automation being key to scale, we will also explore how we successfully automated security and governance

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Predicting and Preventing Machine Downtime with AI and Expert Alerts

John Deere’s Expert alerts is a proactive monitoring system that notifies dealers of potential machine issues. This allows technicians to diagnose issues remotely and fix them before they become a problem thereby avoiding multiple trips by a repair technician and minimizing downtime. John Deere ingests petabytes of data every year from its Connected Machines across the globe. To improve the availability, uptime and performance of the John Deere machines globally, our data scientists perform machine data analysis on our data lake in an efficient and scalable manner. The result is dramatically improved mean time to repair, decreased downtime with predictive alerts, improved cost efficiency, improved customer satisfaction and great yields and results for John Deere’s customers.

You will learn • What are Experts Alerts at John Deere and what challenges they seek to solve • How John Deere migrated from a legacy application for alerting to flexible and scalable Lakehouse framework • Getting stakeholder buy in and converting business logic to AI • Overcoming the scale problem: processing petabytes of data within SLAs • What is next for Alert

Other Resources • Two Minute Overview of Expert Alerts: https://www.youtube.com/watch?v=yFnMhMhipXA • Expert Alerts: Dealer Execution - John Deere: https://www.youtube.com/watch?v=2FGz0lx4UiM • Ben Burgess FarmSight services - Expert Alerts from John Deere: https://www.youtube.com/watch?v=BrQhX4oCsSw • U.S. Farm Report Driving Technology: John Deere Expert Alerts: https://www.youtube.com/watch?v=h8IGtk61EDo

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Predicting Repeat Admissions to Substance Abuse Treatment with Machine Learning

In our presentation, we will walk through a model created to predict repeat admissions to substance abuse treatment centers. The goal is to predict early who will be at high risk for relapse so care can be tailored to put additional focus on these patients. We used the Treatment Episode Data Set (TEDS) Admissions data set, which includes every publicly funded substance abuse treatment admission in the US.

While longitudinal data is not available in the data set, we were able to predict with 88% accuracy and an f-score of 0.85 which admissions were first or repeat admissions. Our solution used a scikit-learn Random Forest model and leveraged MLFlow to track model metrics to choose the most effective model. Our pipeline tested over 100 models of different types ranging from Gradient Boosted Trees to Deep Neural Networks in Tensorflow.

To improve model interpretability, we used Shapley values to measure which variables were most important for predicting readmission. These model metrics along with other valuable data are visualized in an interactive Power BI dashboard designed to help practitioners understand who to focus on during treatment. We are in discussions with companies and researchers who may be able to leverage this model in substance abuse treatment centers in the field.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Presto On Spark: A Unified SQL Experience

Presto was originally designed to run interactive queries against data warehouses, but now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads. However, Presto doesn't scale to very large and complex batch pipelines. Presto Unlimited was designed to address such scalability challenges but it didn’t fully solve fault tolerance, isolation, and resource management.

Spark is the tool of choice across the industry for running large scale complex batch ETL pipelines. This motivated the development of Presto On Spark. Presto on Spark runs Presto as a library that is submitted with spark-submit to a Spark cluster. It leverages Spark for scaling shuffle, worker execution, and resource management. It thereby eliminates any query conversion between interactive and batch use cases. This solution helps enable a performant and scalable platform with seamless end-to-end experience to explore and process data.

Many analysts at Intuit use Presto to explore data in the Data Lake/S3 and use Spark for batch processing. These analysts would earlier spend several hours converting these exploration SQLs written for Presto to Spark SQL to operationalize/schedule them as data pipelines. Presto On Spark is now used by analysts at Intuit to run thousands of critical jobs. No query conversion is required here, improved analysts' productivity and empowered them to deliver insights at high speed.

Benefits from session: Attendees will learn about Presto On Spark architecture Attendees will learn when To Use Spark's Execution Engine With Presto Attendees will learn how Intuit runs thousands of presto jobs daily leveraging databricks platform which they can apply to their own work

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Productionizing Ethical Credit Scoring Systems with Delta Lake, Feature Store and MLFlow

Fairness, Ethics, Accountability and Transparency (FEAT) are must-haves for high-stakes machine learning models. In particular, models within the Financial Services industry such as those that assign credit scores can impact people’s access to housing and utilities and even influence their social standing. Hence, model developers have a moral responsibility to ensure that models do not systematically disadvantage any one group. Nevertheless, implementing such models in industrial settings remains challenging. A lack of concrete guidelines, common standards and technical templates make evaluating models from a FEAT perspective unfeasible. To address these implementation challenges, the Monetary Authority of Singapore (MAS) set up the Veritas Initiative to create a framework for operationalising the FEAT principles, so as to guide the responsible development of AIDA (Artificial Intelligence and Data Analytics) systems.

In January 2021, MAS announced the successful conclusion of Phase 1 of the Veritas Initiative. Deliverables included an assessment methodology for the Fairness principle and open source code for applying Fairness metrics to two use cases - customer marketing and credit scoring. In this talk, we demonstrate how these open-source examples, and their fairness metrics, might be put into production using open source tools such as Delta Lake and MLFlow. Although the Veritas Framework was developed in Singapore, the ethical framework is applicable across geographies.

By doing this, we illustrate how ethical principles can be operationalised, monitored and maintained in production, thus moving beyond only accuracy-based metrics of model performance and towards a more holistic and principled way of developing and productionizing machine learning systems.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Protecting PII/PHI Data in Data Lake via Column Level Encryption

Data breach is a concern for any data collection company including Northwestern mutual. Every measure is taken to avoid the identity theft and fraud for our customers; however they are still not sufficient if the security around it is not updated periodically. A multiple layer of encryption is the most common approach utilized to avoid breaches however unauthorized internal access to this sensitive data still poses a threat

This presentation will walk you following steps: - Design to build encryption at column level - How to protect PII data that is used as key for joins - Ability for authorized users to decrypt data at run time - Ability to rotate the encryption keys if needed

At Northwestern Mutual, a combination of Fernet, AES encryption libraries, user-defined functions (UDFs), and Databricks secrets, were utilized to develop a process to encrypt PII information. Access was only provided to those with a business need to decrypt it, this helps avoids the internal threat. This is also done without data duplication or metadata (view/tables) duplication. Our goal is to help you understand on how you can build a secure data lake for your organization which can eliminate threats of data breach internally and externally. Associated blog: https://databricks.com/blog/2020/11/20/enforcing-column-level-encryption-and-avoiding-data-duplication-with-pii.html

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/