talk-data.com talk-data.com

Topic

AWS

Amazon Web Services (AWS)

cloud cloud provider infrastructure services

32

tagged

Activity Trend

190 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Databricks DATA + AI Summit 2023 ×
Building a Lakehouse on AWS for Less with AWS Graviton and Photon

AWS Graviton processors are custom-designed by AWS to enable the best price performance for workloads in Amazon EC2. In this session we will review benchmarks that demonstrate how AWS Graviton based instances run Databricks workloads at a lower price and better performance than x86-based instances on AWS, and when combined with Photon, the new Databricks engine, the price performance gains are even greater. Learn how you can optimize your Databricks workloads on AWS and save more.

Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Securing Databricks on AWS Using Private Link

Minimizing data transfers over the public internet is among the top priorities for organizations of any size, both for security and cost reasons. Modern cloud-native data analytics platforms need to support deployment architectures that meet this objective. For Databricks on AWS such an architecture is realized thanks to AWS PrivateLink, which allows computing resources deployed on different virtual private networks and different AWS accounts to communicate securely without ever crossing the public internet.

In this session, we want to provide a brief introduction to AWS Private Link and its main use cases in the context of a Databricks deployment: securing communications between control and data plane and securely connecting to the Databricks Web UI. We will then provide step-by-step walkthrough of the steps required in setting up PrivateLink connections with a Databricks deployment and demonstrate how to automate that process using AWS Cloud Formation or Terraform templates.

In this presentation we will cover the following topics: - Brief Introduction to AWS Private Link - How you can use PrivateLink to secure your AWS Databricks deployment - Step-by-step walkthrough of how to set up Private Link - How to automate and scale the setup using AWS CloudFormation or Terraform

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Streaming Data into Delta Lake with Rust and Kafka

Scribd's data architecture was originally batch-oriented, but in the last couple years, we introduced streaming data ingestion to provide near-real-time ad hoc query capability, mitigate the need for more batch processing tasks, and set the foundation for building real-time data applications.

Kafka and Delta Lake are the two key components of our streaming ingestion pipeline. Various applications and services write messages to Kafka as events are happening. We were tasked with getting these messages into Delta Lake quickly and efficiently.

Our first solution was to deploy Spark Structured Streaming jobs. This got us off the ground quickly, but had some downsides.

Since Delta Lake and the Delta transaction protocol are open source, we kicked off a project to implement our own Rust ingestion daemon. We were confident we could deliver a Rust implementation since our ingestion jobs are append only. Rust offers high performance with a focus on code safety and modern syntax.

In this talk I will describe Scribd's unique approach to ingesting messages from Kafka topics into Delta Lake tables. I will describe the architecture, deployment model, and performance of our solution, which leverages the kafka-delta-ingest Rust daemon and the delta-rs crate hosted in auto-scaling ECS services. I will discuss foundational design aspects for achieving data integrity such as distributed locking with DynamoDb to overcome S3's lack of "PutIfAbsent" semantics, and avoiding duplicates or data loss when multiple concurrent tasks are handling the same stream. I'll highlight the reliability and performance characteristics we've observed so far. I'll also describe the Terraform deployment model we use to deliver our 70-and-growing production ingestion streams into AWS.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Turning Big Biology Data into Insights on Disease – The Power of Circulating Biomarkers

Profiling small molecules in human blood across global populations gives rise to a greater understanding of the varied biological pathways and processes that contribute to human health and diseases. Herein, we describe the development of a comprehensive Human Biology Database, derived from nontargeted molecular profiling of over 300,000 human blood samples from individuals across diverse backgrounds, demographics, geographical locations, lifestyles, diseases, and medication regimens, and its applications to inform drug development.

Approximately 11,000 circulating molecules have been captured and measured per sample using Sapient’s high-throughput, high-specificity rapid liquid chromatography-mass spectrometry (rLC-MS) platform. The samples come from cohorts with adjudicated clinical outcomes from prospective studies lasting 10-25 years, as well as data on individuals’ diet, nutrition, physical exercise, and mental health. Genetic information for a subset of subjects is also included and we have added microbiome sequencing data from over 150,000 human samples in diverse diseases.

An efficient data science environment is established to enable effective health insight mining across this vast database. Built on a customized AWS and Databricks “infrastructure-as-code” Terraform configuration, we employ streamlined data ETL and machine learning-based approaches for rapid rLC-MS data extraction. In mining the database, we have been able to identify circulating molecules potentially causal to disease; illuminate the impact of human exposures like diet and environment on disease development, aging, and mortality over decades of time; and support drug development efforts through identification of biomarkers of target engagement, pharmacodynamics, safety, efficacy, and more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cloud Fetch: High-bandwidth Connectivity With BI Tools

Business Intelligence (BI) tools such as Tableau and Microsoft Power BI are notoriously slow at extracting large query results from traditional data warehouses because they typically fetch the data in a single thread through a SQL endpoint that becomes a data transfer bottleneck. Data analysts can connect their BI tools to Databricks SQL endpoints to query data in tables through an ODBC/JDBC protocol integrated in our Simba drivers. With Cloud Fetch, which we released in Databricks Runtime 8.3 and Simba ODBC 2.6.17 driver, we introduce a new mechanism for fetching data in parallel via cloud storage such as AWS S3 and Azure Data Lake Storage to bring the data faster to BI tools. In our experiments using Cloud Fetch, we observed a 10x speed-up in extract performance due to parallelism.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How McAfee Leverages Databricks on AWS at Scale

McAfee, a global leader in online protection security enables home users and businesses to stay ahead of fileless attacks, viruses, malware, and other online threats. Learn how McAfee leverages Databricks on AWS to create a centralized data platform as a single source of truth to power customer insights. We will also describe how McAfee uses additional AWS services specifically Kinesis and CloudWatch to provide real time data streaming and monitor and optimize their Databricks on AWS deployment. Finally, we’ll discuss business benefits and lessons learned during McAfee’s petabyte scale migration to Databricks on AWS using Databricks Delta clone technology coupled with network, compute, storage optimizations on AWS.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Presto 101: An Introduction to Open Source Presto

Presto is a widely adopted distributed SQL engine for data lake analytics. With Presto, you can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis. Additionally, new features like the disaggregated coordinator, Presto-on-Spark, scan optimizations, a reusable native engine, and a Pinot connector enable added benefits around performance, scale, and ecosystem.

In this session, Philip and Rohan will introduce the Presto technology and share why it’s becoming so popular – in fact, companies like Facebook, Uber, Twitter, Alibaba, and much more use Presto for interactive ad hoc queries, reporting & dashboarding data lake analytics, and much more. We’ll also show a quick demo on getting Presto running in AWS.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Self-Serve, Automated and Robust CDC pipeline using AWS DMS, DynamoDB Streams and Databricks Delta

Many companies are trying to solve the challenges of ingesting transactional data in Data lake and dealing with late-arriving updates and deletes.

To address this at Swiggy, we have built CDC(Change Data Capture) system, an incremental processing framework to power all business-critical data pipelines at low latency and high efficiency.

It offers: Freshness: It operates in near real-time with configurable latency requirements. Performance: Optimized read and write performance with tuned compaction parameters and partitions and delta table optimization. Consistency: It supports reconciliation based on transaction types. Basically applying insert, update, and delete on existing data.

To implement this system, AWS DMS helped us with initial bootstrapping and CDC replication for Mysql sources. AWS Lambda and DynamoDB streams helped us to solve the bootstrapping and CDC replication for DynamoDB source.

After setting up the bootstrap and cdc replication process we have used Databricks delta merge to reconcile the data based on the transaction types.

To support the merge we have implemented supporting features - * Deduplicating multiple mutations of the same record using log offset and time stamp. * Adding optimal partition of the data set. * Infer schema and apply proper schema evolutions(Backward compatible schema) * We have extended the delta table snapshot generation technique to create a consistent partition for partitioned delta tables.

FInally to read the data we are using Spark sql with Hive metastore and Snowflake. Delta tables read with Spark sql have implicit support for hive metastore. We have built our own implementation of the snowflake sync process to create external, internal tables and materialized views on Snowflake.

Stats: 500m CDC logs/day 600+ tables

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Simplify Global DataOps and MLOps Using Okta’s FIG Automation Library

Think for a moment about an ML pipeline that you have created. Was it tedious to write? Did you have to familiarize yourself with technology outside your normal domain? Did you find many bugs? Did you give up with a “good enough” solution? Even simple ML pipelines are tedious. Complex ML pipelines make teams that include Data Engineers and ML Engineers still end up with delays and bugs. Okta’s FIG (Feature Infrastructure Generator) simplifies this with a configuration language for Data Scientists that produces scalable and correct ML pipelines, even highly complex ones. FIG is “just a library” in the sense that you can PIP install it. Once installed, FIG will configure your AWS account, creating ETL jobs, workflows, and ML training and scoring jobs. Data Scientists then use FIG’s configuration language to specify features and model integrations. With a single function call, FIG will run an ML pipeline to generate feature data, train models, and create scoring data. Feature generation is performed in a scalable, efficient, and temporally correct manner. Model training artifacts and scoring are automatically labeled and traced. This greatly simplifies the ML prototyping experience. Once it is time to productionize a model, FIG is able to use the same configuration to coordinate with Okta’s deployment infrastructure to configure production AWS accounts, register build and model artifacts, and setup monitoring. This talk will show a demo of using FIG in the development of Okta’s next generation security infrastructure. The demo includes a walkthrough of the configuration language and how that is translated into AWS during a prototyping session. The demo will also briefly cover how FIG interacts with Okta’s deployment system to make productionization seamless.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/