talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

Sound Data Engineering in Rust—From Bits to DataFrames

Spark applications often need to query external data sources such as file-based data sources or relational data sources. In order to do this, Spark provides Data Source APIs to access structured data through Spark SQL.

Data Source APIs have optimization rules such as filter push down and column pruning to reduce the amount of data that needs to be processed to improve query performance. As part of our ongoing project to provide generic Data Source V2 push down APIs, we have introduced partial aggregate push down, which significantly speeds up spark jobs by dramatically reducing the amount of data transferred between data sources and Spark. We have implemented aggregate push down in both JDBC and parquet.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

State-of-the-Art Natural Language Processing with Apache Spark NLP

This session teaches how & why to use the open-source Spark NLP library. Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of recent research advances. Spark NLP is the most widely used NLP library in the enterprise today; provides thousands of current, supported, pre-trained models for 200+ languages out of the box; and is the only open-source NLP library that can natively scale to use any Apache Spark cluster.

We’ll walk through Python code running common NLP tasks like document classification, named entity recognition, sentiment analysis, spell checking, question answering, and translation. The discussion of each task includes the latest advances in deep learning and transfer learning used to tackle it. We’ll also cover new free tools for data annotation, no-code active learning & transfer learning, easily deploying NLP models as production-grade services, and sharing models you’ve trained.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Technical and Tactical Football Analysis Through Data

How LaLiga uses and combines eventing and tracking data to implement novel analytics and metrics, thus helping analysts to better understand the technical and tactical aspects of their clubs.

This presentation will explain the treatment of these data and its subsequent use to create metrics and analytical models.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

The Databricks Notebook: Front Door of the Lakehouse

One of the greatest data challenges organizations face is the sprawl of disparate toolchains, multiple vendors, and siloed teams. This can result in each team working on their own subset of data, preventing the delivery of cohesive and comprehensive insights and inhibiting the value that data can provide. This problem is not insurmountable, however; it can be fixed by a collaborative platform that enables users of all personas to discover and share data insights with each other. Whether you're a marketing analyst or a data scientist, the Databricks Notebook is that single platform that lets you tap into the awesome power of the Lakehouse. The Databricks Notebook supercharges data teams’ ability to collaborate, explore data, and create data assets like tables, pipelines, reports, dashboards, and ML models—all in the language of users’ choice. Join this session to discover how the Notebook can unleash the power of the Lakehouse. You will also learn about new data visualizations, the introduction of ipywidgets and bamboolib, workflow automation and orchestration, CI/CD, and integrations with MLflow and Databricks SQL.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Turning Fan Data Into an Asset

The sports industry is evolving, with organizations investing heavily into their tech stack. Simply connecting data, however, is not enough. Unexpected insights can come from anyone — but only if they have tools they can easily use.

Enter Pumpjack Dataworks, the world’s leading fan data refinery. This platform enables clubs, leagues, and sponsors to discover digital revenue channels by joining consumer data while assuring its protection. Powered by Databricks’ Delta Sharing protocol and governed by Immuta’s data access platform, this solution democratizes fan data, making it immediately accessible.

Join us to learn how automating access control provides scalable, rule-driven assurance that data is properly managed and analyzed.

You’ll discover: How Pumpjack Dataworks uses attribute-based access control to meet privacy regulations and data sharing requirements How Databricks and Immuta’s partnership enables robust governance Why ABAC is critical for scale within modern data stacks

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

A Case Study in Rearchitecting an On-Premises Pipeline in the Cloud

This talk will give a detailed discussion of some of the many considerations that must be taken into account when rebuilding on-premises data pipelines in the cloud. I will give an initial overview of the original pipeline and the reasons that we chose to migrate this pipeline to Azure. Next, I will discuss the decisions that lead to the architecture we used to replace the original pipeline, and give a thorough overview of the new cloud pipeline, including design components and networking. I will also discuss the many lessons we learned along the way to successfully migrating this pipeline.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

A Low-Code Approach to 10x Data Engineering

Can we take Data Engineering on Spark 10x beyond where it is today?

Yes, we can enable 10x more users on Spark, and make them 10x more productive from day 1. Data engineering can run at scale, and it can still be 10x simpler and faster to develop, deploy, and manage pipelines.

Low code is the key. A modern data engineering platform built on low code will enable all data users, from new graduates to experts, to visually develop high-quality pipelines. With Visual = Code, the visual elements will be stored as PySpark code on Git and deployed using the best software practices taken from DevOps. Search and lineage help data engineers and their customers in analytics understand how each column value was produced, when it was updated, and the associated quality metric.

See how a complete, low-code data engineering platform can reduce complexity and effort, enabling you to rapidly deploy, scale, and use Spark, making data and analytics a strategic asset in your company.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data Governance and Sharing on Lakehouse | Matei Zaharia | Keynote Data + AI Summit 2022

Data + AI Summit Keynote talk from Matei Zaharia on Data Governance and Sharing on Lakehouse

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How To Make Apache Spark on Kubernetes Run Reliably on Spot Instances

Since the general availability of Apache Spark’s native support for running on Kubernetes with Spark 3.1 in March 2021, the Spark community is increasingly choosing to run on k8s to benefit of containerization, efficient resource-sharing, and the tools from the cloud-native ecosystem.

Data teams are faced with complexities in this transition, including how to leverage spot VMs. These instances enable up to 90% cost savings but are not guaranteed to be available and face the risk of termination. This session will cover concrete guidelines on how to make Spark run reliably on spot instances, with code examples from real-world use cases.

Main topics: • Using spot nodes for Spark executors • Mixing instance types & sizes to reduce risk of spot interruptions - cluster autoscaling • Spark 3.0: Graceful Decommissioning - preserve shuffle files on executor shutdown • Spark 3.1: PVC reuse on executor restart - disaggregate compute & shuffle storage • What to look for in future Spark releases

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How To Use Databricks SQL for Analytics on Your Lakehouse

Most organizations run complex cloud data architectures that silo applications, users, and data. As a result, most analysis is performed with stale data and there isn’t a single source of truth of data for analytics.

Join this interactive follow-along deep dive demo to learn how Databricks SQL allows you to operate a multicloud lakehouse architecture that delivers data warehouse performance at data lake economics — with up to 12x better price/performance than traditional cloud data warehouses. Now data analysts and scientists can work with the freshest and most complete data and quickly derive new insights for accurate decision-making.

Here’s what we’ll cover: • Managing data access and permissions and monitoring how the data is being used and accessed in real time across your entire lakehouse infrastructure • Configuring and managing compute resources for fast performance, low latency, and high user concurrency to your data lake • Creating and working with queries, dashboards, query refresh, troubleshooting features and alerts • Creating connections to third-party BI and database tools (Power BI, Tableau, DbVisualizer, etc.) so that you can query your lakehouse without making changes to your analytical and dashboarding workflows

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How unsupervised machine learning can scale data quality monitoring in Databricks

Technologies like Databricks Delta Lake and Databricks SQL enable enterprises to store and query their data. But existing rules and metrics approaches to monitoring the quality of this data are tedious to set up and maintain, fail to catch unexpected issues, and generate false positive alerts that lead to alert fatigue.

In this talk, Jeremy will describe a set of fully unsupervised machine learning algorithms for monitoring data quality at scale in Databricks. He will cover how the algorithms work, their strengths and weaknesses, and how they are tested and calibrated.

Participants will leave this talk with an understanding of unsupervised data quality monitoring, its strengths and weaknesses, and how to begin monitoring data using it in Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Lessons Learned from Deidentifying 700 Million Patient Notes

Providence embarked on an ambitious journey to de-identify all our clinical electronic medical record (EMR) data to support medical research and the development of novel treatments. This talk shares how this was done for patient notes and how you can achieve the same.

First, we built a deidentification pipeline using pre-trained deep learning models, fine-tuned to our own data. We then developed an innovative methodology to evaluate reidentification risk, as American healthcare laws (HIPAA) require that de-identified data have a “very low” risk of reidentification, but do not specify a standard. Our next challenge was to annotate a dataset large enough to produce meaningful statistics and improve the fine-tuning of our model. Finally, through experimentation and iteration, we achieved a level of level of performance that would safeguard patient privacy while minimizing information loss. Our technology partner provided the computing power to efficiently process hundreds of millions of records of historical data and incremental daily loads.

Through this endeavor, we have learned many lessons that we will share:

• Evaluating risk of reidentification to meet HIPAA requirements
• Annotating samples of data to create labeled datasets • Performing experiments and evaluating performance • Fine-tuning pre-trained models with your own data • Augmenting models with rules and other tricks • Optimizing clusters to process very large volumes of text data

We will also present speed and throughput metrics from running our pipeline, which you can use to benchmark similar projects.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Achieve Machine Learning Hyper-Productivity with Transformers and Hugging Face

According to the latest State of AI report, "transformers have emerged as a general-purpose architecture for ML. Not just for Natural Language Processing, but also Speech, Computer Vision or even protein structure prediction." Indeed, the Transformer architecture has proven very efficient on a wide variety of Machine Learning tasks. But how can we keep up with the frantic pace of innovation? Do we really need expert skills to leverage these state-of-the-art models? Or is there a shorter path to creating business value in less time? In this code-level talk, we'll gradually build and deploy a demo involving several Transformer models. Along the way, you'll learn about the portfolio of open source and commercial Hugging Face solutions, how they can help you become hyper-productive in order to deliver high-quality Machine Learning solutions faster than ever before.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Administrator Best Practices and Tips for Future-Proofing your Databricks Account

You’ve started using Databricks. Your initial set of users, access controls, and workspace are configured. Everything is great, but you’re onboarding more and more teams to the platform. That initial pilot has become an enterprise deployment. How do you make the most of the Databricks administration capabilities to operate at scale? In this session, you will receive best practice guidance from our product team, see demos on our some of the new capabilities, and get a sneak peek into the upcoming roadmap. You will also learn from other customers on how they set up their Databricks deployment. If you are a Databricks administrator, this session will help you speed up onboarding for new users and give them a positive experience, while following best practices to secure and manage your Databricks workspace.

Advanced Migrations: From Hive to SparkSQL

Learn how Pinterest moved over 6000 Hive queries to SparkSQL, achieved a 2x runtime-weighted speed up and made significant savings in compute resources. In order to do migrations at this scale. Companies often take one of two approaches, either employ hundreds of engineers to manually migrate or completely change the query engine to be compatible with Hive both of which take significant engineering time. In this session you will learn how Pinterest took a hybrid approach and the tools and tricks Pinterest used to safely migrate thousands of queries at scale.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

AI powered Assortment Planning Solution

For shop owners to maximize revenue, they need to ensure that the right products are available on the right shelf at the right time. So, how does one assort the right mix of products to make max profit & reduce inventory pressure? Today, these decisions are led by human knowledge of trends & inputs from salespeople. This is error prone and cannot scale with a growing product assortment & varying demand patterns. Mindtree has analyzed this problem and built a cloud-based AI/ML solution that provides contextual, real-time insights and optimizes inventory management. In this presentation, you will hear our solution approach to help global CPG organization, promote new products, increase demand across their product offerings and drive impactful insights. You will also learn about the technical solution architecture, orchestration of product and KPI generation using Databricks, AI/ML models, heterogenous cloud platform options for deployment and rollout, scale-up and scale-out options.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

A Practitioner's Guide to Unity Catalog—A Technical Deep Dive

As a practitioner, managing and governing data assets and ML models in the data lakehouse is critical for your business initiatives to be successful. With Databricks Unity Catalog, you have a unified governance solution for all data and AI asserts in your lakehouse, giving you much better performance, management and security on any cloud. When deploying Unity Catalog for your lakehouse, you must be prepared with best practices to ensure a smooth governance implementation. This session will cover key considerations for a successful implementation such as: • How to manage Unity Catalog’s metastore and understand various usage patterns • How to use identity federation to assign account principals to a Databricks Workspace • Best practices for leveraging cloud storages, managed tables and external tables with Unity catalog

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Meshing About with Databricks

Large enterprises are increasingly de-centralizing their data teams to increase overall business agility. The cloud has been a big enabler for teams to become more autonomous in the data products they prioritize, the technology they choose, and the ability to attribute costs granularly.

In order for organizations to successfully realize such aspirations, it is in their best interest to shift from centralized teams and centralized technology to a more distributed ecosystem built around business domains.

The data mesh is an architecture paradigm that many enterprises are looking to adopt to realize this vision. It proposes that distributed autonomous domains leverage self-serve data infrastructure as a platform to enable their work of creating and maintaining sharable data products.

This session will explain how Databricks can be used to implement a Data Mesh across an enterprise.

We will demonstrate how: - A new data team can be onboarded quickly - Consumers can discover data products and their lineage - Domains can publish data products and set governance policies - Data can be accessed within and external to the enterprise - Analysis can be shared

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Migrate and Modernize your Data Platform with Confluent and Databricks

Moving and building in the cloud to accelerate analytics development requires enterprises to rethink their data infrastructure. Whether you are moving from an on-prem legacy system or you were born in the cloud, businesses are turning to Confluent and Databricks to help them unlock new real-time customer experiences and intelligence for their backend operations.

Join us to see how Confluent and Databricks enable companies to set data in motion across any system, at any scale, in near real-time. Connecting Confluent with Databricks allows companies to migrate and connect data from on-prem databases and data warehouses like Netezza, Oracle, and Cloudera to Databricks in the cloud to power real-time analytics.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/