talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

How socat and UNIX Pipes Can Help Data Integration

Nearly every developer is familiar with creating a CLI. Containerized CLIs provide a flexible, cross-language standard with a low barrier to entry for open-source contributors. The ETL process can be reduced to two CLIs: one that reads data and one that writes data. While this interface is simple enough to implement from the contributor’s side, Kubernetes’ distributed nature means orchestrating data transfer between the CLIs on Kubernetes presents an unsolved problem.

This talk describes a novel approach to reliably orchestrate CLIs on Kubernetes for data integration. Through this lens, we go through the evaluation of strategies and describe the pros and cons of each architecture for horizontally scaling containerised data integration workflows on Kubernetes. We also cover the journey of implementing a TCP-based “process” abstraction over CLIs using socat and UNIX pipes. This same approach powers all of Airbyte’s Kubernetes deployments and helps sync TBs of data daily.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How the Largest County in the US is Transforming Hiring with a Modern Data Lakehouse

Los Angeles County’s Department of Human Resources (DHR) is responsible for attracting a diverse workforce for the 37 departments it supports. Each year, DHR processes upwards of 400,000 applications for job opportunities making it one of the largest employers in the nation. Managing a hiring process of this scale is complex with many complicated factors such as background checks and skills examination. These processes, if not managed properly, can create bottlenecks and a poor experience for both candidates and hiring managers.

In order to identify areas for improvement, DHR set out to build detailed operational metrics across each stage of the hiring process. DHR used to conduct high level analysis manually using excel and other disparate tools. The data itself was limited, difficult to obtain, and analyze. In addition, it was taking analysts weeks to manually pull data from half a dozen siloed systems into excel for cleansing and analysis. This process was labor-intensive, inefficient, and prone to human error.

To overcome these challenges, DHR in partnership with Internal Services Department (ISD) adopted a modern data architecture in the cloud. Powered by the Azure Databricks Lakehouse, DHR was able to bring together their diverse volumes of data into a single platform for data analytics. Manual ETL processes that took weeks could now be automated in 10 minutes or less. With this new architecture, DHR has built Business Intelligence dashboards to unpack the hiring process to get a clear picture of where the bottlenecks are and track the speed with which candidates move through the process The dashboards allow the County departments innovate and make changes to enhance and improve the experience of potential job seekers and improve the timeliness of securing highly qualified and diverse County personnel at all employment levels.

In this talk, we’ll discuss DHR’s journey towards building a data-driven hiring process, the architecture decisions that enabled this transformation and the types of analytics that we’ve deployed to improve hiring efforts.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How to Build a Complete Security and Governance Solution Using Unity Catalog

Unity Catalog unifies governance and security for Databricks in one place. It can store data classifications and privileges and enforce them.

This talk will go into the details of Unity Catalog and explains the core building blocks in Unity Catalog for Security and Governance. I will also explain how Privacera translates Apache Ranger policies into native policies of Unity Catalogs, audits are collected from Unity Catalog and imported into the centralized Audit Store of Apache Ranger, and Privacera can extend Unity Catalog.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Ingesting data into Lakehouse with COPY INTO

COPY INTO is a popular data ingestion SQL command for Databricks users, especially for customers using Databricks SQL. In this talk, we want to discuss the data ingestion use cases in Databricks and how COPY INTO fits your data ingestion needs. We will discuss a few new COPY INTO features and how to achieve the following use cases: 1. Loading data into a Delta Table incrementally ; 2. Fixing errors in already loaded data and helping you with data cleansing; 3. Evolving your schema over time; 4. Previewing data before ingesting; 5. Loading data from a third party data source. In this session, we will demo the new features, discuss the architecture for the implementation, and how other Databricks features are using COPY INTO under the hood.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Integrating Apache Superset into a B2B Platform: Why and How

Our IT team creates a portal for managing a pizzeria franchise business. This portal is a rather large and unwieldy b2b system that has been developing for more than 10 years.

Our partners need dashboards to manage their business. These dashboards must be fully integrated into the portal. This is the job for our data engineers!

In this talk, I will tell you how and why we chose Apache Superset, what difficulties we encountered during integration and what refinements we had to make to achieve this goal.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Interactive Analytics on a Massive Scale Using Delta Lake

Interactive, Near Real Time analytics is usually a common requirement for many data teams across different fields.

In the field of web security, interactive analytics allows end users to get real time or historical insights about the state of their protected resource at any point of time and take actions accordingly.

One of the hardest aspects of enabling interactive, near-real-time analytics on a massive scale is a low response time. Scanning hundreds of Terabytes of data over a non-aggregated stream of events (a Delta Lake), and still returning an answer within just a few seconds can be a major challenge.

In this talk we will learn: • How did we build a 5PB Delta Lake of non-aggregated security events • What challenges did we see along the way - reducing delta log scan, improving cache affinity, reducing storage throttling errors etc. • How did we overcome them one by one

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Intermittent Demand Forecasting in Scale Using Meta-Modelling (Deep Auto Regressive Linear Dynamic

The Presentation will cover a novel Demand Forecasting Solution for Intermittent Time-Series developed by Walmart which is currently used to make Granular Demand Predictions in Scale across Walmart Stores. The Solution alleviates the problem of forecasting for slow moving items; which are characterised by intermittency in time, rendering traditional statistical and time-series models ineffective in these scenarios. The Solution involves a Meta-Modelling Approach combining Linear Dynamic Systems and Deep Auto-Regressive Recurrent Networks which has been scaled up for accurate demand forecasts across ~35000 SKUs and ~250 Walmart Stores.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Lessons Learned Running RL Recommendation at Scale in Physical Retail Setting at Starbucks

Change in QSR state from static boards to dynamic and contextualized recommendation. The brain behind the system connects the Starbucks brand and culture with state-of-the-art AI techniques. Review some of the tactics and lessons learnt by running an RL algorithm and deep item collaborative filtering in production over a year.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

MLflow Pipelines: Accelerating MLOps from Development to Production

Despite being an emerging topic, MLOps is hard and there are no widely established approaches for MLOps. What makes it even harder is that in many companies the ownership of MLOps usually falls through the cracks between data science teams and production engineering teams. Data scientists are mostly focused on modeling the business problems and reasoning about data, features, and metrics, while the production engineers/ops are mostly focused on traditional DevOps for software development, ignoring ML-specific Ops like ML development cycles, experiment tracking, data/model validation, etc. In this talk, we will introduce MLflow Pipelines, an opinionated approach for MLOps. It provides predefined ML pipeline templates for common ML problems and opinionated development workflows to help data scientists bootstrap ML projects, accelerate model development, and ship production-grade code with little help from production engineers.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Obfuscating Sensitive Information from Spark UI and Logs

The Spark UI and logs have useful information but also include sensitive data that need to be obfuscated.

To obfuscate the data, at Workday we have implemented methods for Apache Spark where the string representations for the TreeNode class can be configured to be obfuscated or non-obfuscated.To do this, we added a custom treenode printer for ui and a custom log4j appender which uses a list of rules based on class name/package name/log message regexes to decide whether to obfuscate third party libraries. In the Spark UI and in the logging, this results in the obfuscation of Spark Plans and column names.

In this talk we will go over the steps we have taken to implement the methods for obfuscation and show what it looks like in the Spark UI and logs. The methods shared have worked out well when deployed to production at workday, and other companies can also benefit from implementing these methods.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Recent Parquet Improvements in Apache Spark

Apache Parquet is a very popular columnar file format supported by Apache Spark. In a typical Spark job, scanning Parquet files is sometimes one of the most time consuming steps, as it incurs high CPU and IO overhead. Therefore, optimizing Parquet scan performance is crucial to job latency and cost efficiency.

Spark currently have two Parquet reader implementations: a vectorized one and a non-vectorized one. The former was implemented from scratch and offers much better performance than the latter. However, it currently doesn’t support complex types (e.g., array, list, map) at the moment and will fallback to the latter when encountering them. In addition to the reader implementation, predicate pushdown is also crucial to Parquet scan performance as it enables Spark to skip those data that do not satisfy the predicates, before the scan. Currently, Spark constructs predicates itself and rely on Parquet-MR to do the heavy lifting, which does the filtering based on various information such as statistics, dictionary, bloom filter or column index.

This talk will go through two recent improvements for Parquet scan performance: 1) vectorized read support for complex types, which allows Spark to achieve 10x+ improvement when reading Parquet data of complex types, and 2) Parquet column index support, which enables Spark to leverage Parquet column index feature during predicate pushdown. Last but not least, Chao go over some future work items that can further enhance Parquet read performance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Setting up On Shelf Availability Alerts at Scale with Databricks and Azure

Tredence' s OSA accelerator is a robust quick-start guide that is the foundation for a full Out of Stock or Supply Chain solution. The OSA solution focuses on driving sales through improved stock availability on the shelves. The following components make up the OSA accelerator.

• Identifying OOS Situation: ML models to identify the Out-Of-Stock scenario in a store at a SKU level taking in account the level of phantom inventory • Identifying Off-Sales Behavior: ML models to identify the off-sale behavior of a SKU in particular which is attributable to phantom inventory, stock less than presentation stock or improper operations within the store • Smart Alerts: Alert mechanism for the store manager and merchandizing reps in order to maintain healthy stock in the store and increase the revenue

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Sink Framework Evolution in Apache Flink

Apache Flink is one of the most popular frameworks for unified stream and batch processing. Like every other big data framework, Apache Flink offers connectors to different external systems to read from and write to. We refer to connectors for writing to external systems as sinks. Over the years, multiple frameworks existed inside Apache Flink for building sinks. The Apache Flink community also noticed the latest trend of ingesting real-time data directly into data lakes for further usage. Therefore with Apache Flink 1.15, we released the next iteration of our sink framework. We designed it to accommodate the needs of modern data lake connectors i.e. lazy file compaction, user-defined shuffling.

In this talk, we first give a brief historical glimpse of the evolution of the frameworks that started as a kind of a simple map operation until a custom operator model that simplified two-phase commit semantics. Secondly, we do a deep dive into Apache Flink’s fault tolerance model to explain how the last iteration of the sink framework supports exactly-once processing and complex operations important for delta lakes. In summary, this talk introduces the principles behind the sink framework in Apache Flink and gives a starting point for developers building a new connector for Apache Flink.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Analytics Engineering and the Great Convergence   Tristan Handy   Keynote Data + AI Summit 2022

We've come a long way from the way data analysis used to be done. The emergence of the analytics engineering workflow, with dbt at its center, has helped usher in a new era of productivity. Not quite data engineering or data analysis, analytics engineering has enabled new levels of collaboration between two key sets of practitioners.

But that's not the only coming together happening right now. Enabled by the open lakehouse, the worlds of data analysis and AI/ML are also converging under a single roof, hinting at a new future of intertwined workloads and silo-free collaboration. It's a future that's tantalizing, and entirely within reach. Let's talk about making it happen.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Apache Spark AQE SkewedJoin Optimization and Practice in ByteDance

Almost all distributed computing systems cannot avoid data skew. If data skew is not dealt with, the long-tail task will seriously slow down the execution of the job, or even cause the failure of the job. In this talk, we will introduce how Spark AQE processes SkewedJoin and how we optimize the implementation based on workload in ByteDance. The main points are as follows: 1. Address the risks associated with increasing statistical accuracy to solve the problem of not being able to identify data skew. 2. Optimize the split logic of skew data to achieve a better optimization effect. 3. Compared to the community's implementation, more complex optimization scenarios are supported, which has basically covered all SkewedJoin scenarios. By February 2021, Spark AQE SkewedJoin optimization covers 13000+ Spark jobs per day in ByteDance. The average performance of optimized Spark jobs increased by 35%.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data centric AI development  From Big Data to Good Data   Andrew Ng

Data-centric AI is a growing movement which shifts the engineering focus in AI systems from the model to the data. However, Data-centric AI faces many open challenges, including measuring data quality, data iteration and engineering data as part of the ML project workflow, data management tools, crowdsourcing, data augmentation & data synthesis as well as responsible AI. This talk names the key pillars of Data-centric AI, identifies the trends in Data-centric AI movement, and sets a vision for taking ideas applied intuitively by a handful of experts and synthesizing them into tools that make the application systematic for all.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Day 1 Afternoon Keynote |  Data + AI Summit 2022

Day 1 Afternoon Keynote | Data + AI Summit 2022 Supercharging our data architecture at Coinbase using Databricks Lakehouse | Eric Sun | Keynote Partner Connect & Ecosystem Strategy | Zaheera Valani What are ELT and CDC, and why are all the cool kids doing it? |George Fraser Analytics without Compromise | Francois Ajenstat Fireside Chat with Zhamak Dehghani and Arsalan Tavakoli

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Day 1 Morning Keynote | Data + AI Summit 2022

Day 1 Morning Keynote | Data + AI Summit 2022 Welcome & "Destination Lakehouse" | Ali Ghodsi Apache Spark Community Update | Reynold Xin Streaming Lakehouse | Karthik Ramasamy Delta Lake | Michael Armbrust How Adobe migrated to a unified and open data Lakehouse to deliver personalization at unprecedented scale | Dave Weinstein Data Governance and Sharing on Lakehouse |Matei Zaharia Analytics Engineering and the Great Convergence | Tristan Handy Data Warehousing | Shant Hovespian Unlocking the power of data, AI & analytics: Amgen’s journey to the Lakehouse | Kerby Johnson

Get insights on how to launch a successful lakehouse architecture in Rise of the Data Lakehouse by Bill Inmon, the father of the data warehouse. Download the ebook: https://dbricks.co/3ER9Y0K

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/