AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

2025-12-08 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Athena AWS AWS Glue Cloud Computing Data Management Cyber Security

Prepare to revolutionize your data infrastructure for the AI era with Amazon EMR, AWS Glue, and Amazon Athena. This session will guide you through leveraging these powerful AWS services to construct robust, scalable data architectures that empower AI solutions at scale. Gain insights into effective architectural strategies for data processing to build AI applications, optimizing for cost-efficiency and security. Explore architectural frameworks that underpin successful AI-driven data initiatives, and learn from field lessons on how to navigate modernization projects. Whether you’re starting your modernization journey or refining current setups, this session offers practical strategies to fast-track your organization towards achieving excellence in AI-powered data management.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Agentic data engineering with AWS Analytics MCP Servers (ANT335)

2025-12-07 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Analytics Athena AWS AWS Glue Cloud Computing Data Engineering Redshift

In this session, we will introduce AWS Analytics Model Context Protocol (MCP) Servers, including the Data Processing MCP Server and Amazon Redshift MCP Server, which enable agentic workflows across AWS Glue, Amazon EMR, Amazon Athena, and Amazon Redshift. You will learn how these open-source tools simplify complex analytics operations through natural language interactions with AI agents. We'll cover MCP server implementation strategies, real-world use cases, architectural patterns for deployment, and production best practices for building intelligent data engineering workflows that understand and orchestrate your analytics environment.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - iTTi's Cross-Company Data Mesh Blueprint with Amazon SageMaker (ANT342)

2025-12-07 · AWS re:Invent 2024 Watch

video

Agile/Scrum AWS Cloud Computing Data Lake Amazon SageMaker

This session shares the journey of implementing a hybrid Data Mesh architecture within a multi-company holding, balancing centralization and decentralization needs. We will cover how our hybrid approach leverages Amazon EMR on EKS for data lake ingestion and Amazon SageMaker to enable self-service data discovery, data product subscription, and consumption, allowing companies within the group to autonomously explore, access, and utilize data products while maintaining centralized governance. Through ITTI - Grupo Vázquez's real-world experience, attendees will learn how this hybrid data mesh architecture successfully addresses diverse data domains, varying governance requirements, and rapid value delivery needs.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

2025-12-06 · AWS re:Invent 2024 Watch

video

Agile/Scrum Athena AWS AWS Glue Cloud Computing Data Lakehouse ETL/ELT Iceberg Redshift S3 Amazon SageMaker Spark +1 more

Discover advanced strategies for implementing Apache Iceberg on AWS, focusing on Amazon S3 Tables and integration of Iceberg Rest Catalog with the lakehouse in Amazon SageMaker. We'll cover performance optimization techniques for Amazon Athena and Amazon Redshift queries, real-time processing using Apache Spark, and integration with Amazon EMR, AWS Glue, and Trino. Explore practical implementations of zero-ETL, change data capture (CDC) patterns, and medallion architecture. Gain hands-on expertise in implementing enterprise-grade lakehouse solutions with Iceberg on AWS.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

2025-12-04 · AWS re:Invent 2024 Watch

video

Agile/Scrum Airflow Analytics Athena AWS AWS Glue Big Data Cloud Computing

Explore the latest advancements in AWS Analytics designed to transform your data processing landscape. This session unveils powerful new capabilities across key services, including Amazon EMR for scalable big data processing, AWS Glue for seamless data integration, Amazon Athena for optimized querying, and Amazon Managed Workflows for Apache Airflow (MWAA) for workflow orchestration. Discover how these innovations can supercharge performance, optimize costs, and streamline your data ecosystem. Whether you're looking to enhance scalability, improve data integration, accelerate queries, or refine workflow management, join us to gain actionable insights that will position your organization at the forefront of data processing innovation.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

2025-12-02 · AWS re:Invent 2024 Watch

video

Agile/Scrum AWS AWS Glue Cloud Computing ETL/ELT S3 Amazon SageMaker Spark

Apache Spark on AWS Glue, Amazon EMR, and Amazon SageMaker enhances the optimization of large-scale data processing workloads. These include faster read and write throughput, accelerated processing of common file formats, and expanded Amazon S3 support through the S3A protocol for greater flexibility in write operations. In this session, we'll explore recent enhancements in Spark for distributed computation and in-memory storage to enable efficient data aggregation and job optimization. We'll also demonstrate how these innovations, combined with Spark's native capabilities, strengthen governance and encryption to help you optimize performance while maintaining control and compliance. Join us to learn how to build unified, secure, and high-performance ETL pipelines on AWS using Spark.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Tim Kessler (Redox, Inc.) , Matthew Giglia (Databricks)

AI/ML API BI Data Lakehouse Databricks Delta ETL/ELT SQL Data Streaming

Redox & Databricks direct integration can streamline your interoperability workflows from responding in record time to preauthorization requests to letting attending physicians know about a change in risk for sepsis and readmission in near real time from ADTs. Data engineers will learn how to create fully-streaming ETL pipelines for ingesting, parsing and acting on insights from Redox FHIR bundles delivered directly to Unity Catalog volumes. Once available in the Lakehouse, AI/BI Dashboards and Agentic Frameworks help write FHIR messages back to Redox for direct push down to EMR systems. Parsing FHIR bundle resources has never been easier with SQL combined with the new VARIANT data type in Delta and streaming table creation against Serverless DBSQL Warehouses. We'll also use Databricks accelerators dbignite and redoxwrite for writing and posting FHIR bundles back to Redox integrated EMRs and we'll extend AI/BI with Unity Catalog SQL UDFs and the Redox API for use in Genie.

Iceberg Table Format Adoption and Unified Metadata Catalog Implementation in Lakehouse Platform

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Ruotian Wang (Doordash) , Sergey Zavgorodni (DoorDash)

Data Lake Data Lakehouse Databricks DWH Iceberg Snowflake Trino

DoorDash Data organization actively adopts LakeHouse paradigm. This presentation describes the methodology which allows to migrate the classic Data Warehouse and Data Lake platforms to unified LakeHouse solution.The objective of this effort include Elimination of excessive data movement.Seamless integration and consolidation of the query engine layers, including Snowflake, Databricks, EMR and Trino.Query performance optimization.Abstracting away complexity of underlying storage layers and table formatsStrategic and justified decision on the Unified Metadata catalog used across varios compute platforms

AWS re:Invent 2024 - Customer Keynote Autodesk

2025-01-06 · AWS re:Invent 2024 Watch

video

by Raji Arasu (Autodesk)

Agile/Scrum AI/ML AWS Cloud Computing DynamoDB ELK GenAI Fabric Amazon SageMaker

Design software pioneer Autodesk is transforming computer-aided design (CAD) by harnessing generative AI and Amazon Web Services (AWS). The company is developing advanced AI foundation models, like "Project Bernini," which can generate precise 2D and 3D geometric designs based on physical principles.

By utilizing AWS technologies such as Amazon DynamoDB, Elastic MapReduce (EMR), Amazon SageMaker, and Elastic Fabric Adapter, Autodesk has significantly enhanced its AI development process. These innovations have halved foundation model development time and increased AI productivity by 30%.

Learn more about AWS events: https://go.aws/events

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

reInvent2024 #AWSreInvent2024 #AWSEvents

AWS AI and Data Conference Ireland 2024 | AWS Events

2024-12-23 · AWS re:Invent 2024 Watch

video

by Martin Holste (Trellix) , Rick Sears (AWS) , Barry Morris (AWS) , Eddie Wilson (Ryanair)

Agile/Scrum AI/ML Analytics Athena AWS Cloud Computing Data Analytics GenAI

The AWS AI and Data Conference 2024 delivered practical insights on Generative AI, Machine Learning, and Data Analytics. Attendees learned how organizations are using these technologies to scale operations and meet customer needs. AWS experts and customers shared real-world applications across industries. The event covered the latest trends and best practices, including hands-on experience with AWS tools like Amazon Bedrock for AI development. Keynote speakers included Eddie Wilson (CEO, Ryanair), Martin Holste (CTO for Cloud and AI, Trellix), Rick Sears (GM, Amazon Athena, EMR, and Lake Formation, AWS), and Barry Morris (GM, Purpose Built Databases, AWS). Whether new to AI or seasoned professionals, participants gained actionable knowledge to drive innovation in their organizations.

Sign up now for the AWS AI an Data Conference 2025 and stay at the forefront of AI and data innovation: https://go.aws/4gNtNa6

Learn more about AWS events: https://go.aws/events

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSEvents #AWSAI #AWSDataConf #GenerativeAI #CloudInnovation #AmazonBedrock #AIforBusiness #DataDriven #MachineLearning #AWSEvents #TechInnovation

AWS re:Invent 2024 - Cost-effective data processing with Amazon EMR (ANT344)

2024-12-08 · AWS re:Invent 2024 Watch

video

by Aaron Feng (Roblox) , Matthew Liem (AWS)

Agile/Scrum AWS Amazon EC2 Big Data Cloud Computing

Unlock the full potential of your big data environment during this in-depth session on Amazon EMR and cost optimization strategies, tailored for data engineers, data architects, and cloud architects. Gain a comprehensive understanding of various cost optimization strategies, including cluster rightsizing, using Amazon EC2 Spot Instances, and implementing managed scaling. Learn about the key differences between Amazon EMR deployment models and how to choose the best option that aligns with your organization’s specific requirements, constraints, and technical capabilities. Leave with actionable insights and practical strategies to enhance your big data workflows and achieve significant cost savings.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Innovations in AWS analytics: Data processing (ANT346)

2024-12-05 · AWS re:Invent 2024 Watch

video

by William Vambenepe (AWS) , Kinshuk Pahare (AWS) , Craig Suchanec (Bridgewater Associates)

Agile/Scrum Analytics Athena AWS AWS Glue Big Data Cloud Computing

Join this session for an in-depth look at new capabilities to optimize data processing with AWS analytics services. Learn more about using Amazon EMR for scalable big data processing, AWS Glue for seamless data integration, Amazon Athena for powerful querying, and Amazon MWAA for supporting complex workflows. Whether you’re looking to improve performance, reduce costs, or streamline your data pipeline, this session provides valuable insights into the latest functionality and tools needed to enhance your data processing capabilities.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

Learn how Trellix scales security operations with Amazon Bedrock, AWS EMR | AWS Events

2024-07-29 · AWS re:Invent 2024 Watch

video

by Martin Holste (Trellix) , Rick Sears (AWS)

Agile/Scrum AI/ML Athena AWS Cloud Computing GenAI Cyber Security

Dive into this AWS session with Rick Sears, General Manager of Amazon Athena, EMR, and Lake Formation at AWS, and Martin Holste, CTO Cloud and AI at Trellix. Explore how Trellix uses Amazon Bedrock and AWS EMR to revolutionize security operations. Learn how generative AI and comprehensive data strategies enhance threat detection and automate security processes, driving a new era of efficiency and protection. Discover practical AI applications and real-world examples, and get ready to accelerate your AI journey with AWS.

Speakers: Martin Holste, CTO Cloud and AI, Trellix Rick Sears, General Manager of Amazon Athena, EMR, and Lake Formation, Amazon Web Services

Learn more: https://go.aws/3x2mha0 Learn more about AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSEvents #awsaidataconference #awsgenai

Building an end to end data strategy for analytics and generative AI | AWS Events

2024-07-22 · AWS re:Invent 2024 Watch

video

by Mark Greville (Workhuman) , Kamal Sampathkumar (Workhuman) , Rick Sears (AWS)

Agile/Scrum AI/ML Analytics API Athena AWS Cloud Computing Data Management GenAI

In this session, Rick Sears, General Manager of Amazon Athena, EMR, and Lake Formation at AWS, explores how generative AI is revolutionizing businesses and the critical role data plays in this transformation. He discusses the evolution of AI models and the importance of a comprehensive data management strategy encompassing availability, quality, and protection of data.

Mark Greville, Vice President of Architecture at Workhuman, shares insights from Workhuman's journey in building a robust cloud-based data strategy, emphasizing the significance of storytelling, demonstrating value, and gaining executive support.

Kamal Sampathkumar, Senior Manager of Data Architecture at Workhuman, delves into the technical aspects, detailing the architecture of Workhuman's data platform and showcasing solutions like Data API and self-service reporting that deliver substantial value to customers.

Learn more at: https://go.aws/3x2mha0

Learn more about AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSEvents #awsaianddataconference #generativeaiconference #genaiconference #genaievent #AWSgenerativeai #AWSgenai

Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Noritaka Sekiyama (Amazon Web Services (AWS)) , Akira Ajisaka

Athena AWS AWS Glue Amazon RDS Cloud Computing Data Lake Data Lakehouse Databricks Delta DWH DynamoDB MongoDB +3 more

Delta Lake is an open source project that helps implement modern data lake architectures commonly built on cloud storages. With Delta Lake, you can achieve ACID transactions, time travel queries, CDC, and other common use cases on the cloud.

There are a lot of use cases of Delta tables on AWS. AWS has invested a lot in this technology, and now Delta Lake is available with multiple AWS services, such as AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. With AWS Glue, you can easily ingest data from multiple data sources such as on-prem databases, Amazon RDS, DynamoDB, MongoDB into Delta Lake on Amazon S3 even without expertise in coding.

This session will demonstrate how to get started with processing Delta Lake tables on Amazon S3 using AWS Glue, and querying from Amazon Athena, and Amazon Redshift. The session also covers recent AWS service updates related to Delta Lake.

Talk by: Noritaka Sekiyama and Akira Ajisaka

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Deep Dive Into Grammarly's Data Platform

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Christopher Locklin , Faraz Yasrobi

AWS Data Lakehouse Databricks Marketing

Grammarly helps 30 million people and 50,000 teams to communicate more effectively. Using the Databricks Lakehouse Platform, we can rapidly ingest, transform, aggregate, and query complex data sets from an ecosystem of sources, all governed by Unity Catalog. This session will overview Grammarly’s data platform and the decisions that shaped the implementation. We will dive deep into some architectural challenges the Grammarly Data Platform team overcame as we developed a self-service framework for incremental event processing.

Our investment in the lakehouse and Unity Catalog has dramatically improved the speed of our data value chain: making 5 billion events (ingested, aggregated, de-identified, and governed) available to stakeholders (data scientists, business analysts, sales, marketing) and downstream services (feature store, reporting/dashboards, customer support, operations) available within 15. As a result, we have improved our query cost performance (110% faster at 10% the cost) compared to our legacy system on AWS EMR.

I will share architecture diagrams, their implications at scale, code samples, and problems solved and to be solved in a technology-focused discussion about Grammarly’s iterative lakehouse data platform.

Talk by: Faraz Yasrobi and Christopher Locklin

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lessons Learned from Deidentifying 700 Million Patient Notes

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Databricks

Providence embarked on an ambitious journey to de-identify all our clinical electronic medical record (EMR) data to support medical research and the development of novel treatments. This talk shares how this was done for patient notes and how you can achieve the same.

First, we built a deidentification pipeline using pre-trained deep learning models, fine-tuned to our own data. We then developed an innovative methodology to evaluate reidentification risk, as American healthcare laws (HIPAA) require that de-identified data have a “very low” risk of reidentification, but do not specify a standard. Our next challenge was to annotate a dataset large enough to produce meaningful statistics and improve the fine-tuning of our model. Finally, through experimentation and iteration, we achieved a level of level of performance that would safeguard patient privacy while minimizing information loss. Our technology partner provided the computing power to efficiently process hundreds of millions of records of historical data and incremental daily loads.

Through this endeavor, we have learned many lessons that we will share:

• Evaluating risk of reidentification to meet HIPAA requirements
• Annotating samples of data to create labeled datasets • Performing experiments and evaluating performance • Fine-tuning pre-trained models with your own data • Augmenting models with rules and other tricks • Optimizing clusters to process very large volumes of text data

We will also present speed and throughput metrics from running our pipeline, which you can use to benchmark similar projects.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Analytics AWS Amazon EC2 CI/CD Data Management Databricks ETL/ELT Redshift Spark

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Comprehensive Patient Data Self-Serve Environment and Executive Dashboards Leveraging Databricks

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Databricks ELK

In this talk, we will outline our data pipelines and demo dashboards developed on top of the resulting elasticsearch index. This tool enables queries for terms or phrases in the raw documents to be executed together with any associated EMR patient data filters within 1-2 second for a data set containing millions of records/documents. Finally, the dashboards are simple to use and enable Real World Evidence data stakeholders to gain real-time statistical insight into the comprehensive patient information available.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Analytics API AWS Amazon EC2 Big Data CI/CD Databricks ETL/ELT Redshift Spark

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

talk-data.com

Amazon EMR

Activity Trend

Top Events

Top Speakers

AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Agentic data engineering with AWS Analytics MCP Servers (ANT335)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - iTTi's Cross-Company Data Mesh Blueprint with Amazon SageMaker (ANT342)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

AWSreInvent #AWSreInvent2025 #AWS

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

Iceberg Table Format Adoption and Unified Metadata Catalog Implementation in Lakehouse Platform

AWS re:Invent 2024 - Customer Keynote Autodesk

reInvent2024 #AWSreInvent2024 #AWSEvents

AWS AI and Data Conference Ireland 2024 | AWS Events

AWSEvents #AWSAI #AWSDataConf #GenerativeAI #CloudInnovation #AmazonBedrock #AIforBusiness #DataDriven #MachineLearning #AWSEvents #TechInnovation

AWS re:Invent 2024 - Cost-effective data processing with Amazon EMR (ANT344)

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Innovations in AWS analytics: Data processing (ANT346)

AWSreInvent #AWSreInvent2024

Learn how Trellix scales security operations with Amazon Bedrock, AWS EMR | AWS Events

AWSEvents #awsaidataconference #awsgenai

Building an end to end data strategy for analytics and generative AI | AWS Events

AWSEvents #awsaianddataconference #generativeaiconference #genaiconference #genaievent #AWSgenerativeai #AWSgenai

Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift

Deep Dive Into Grammarly's Data Platform

Lessons Learned from Deidentifying 700 Million Patient Notes

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

Comprehensive Patient Data Self-Serve Environment and Executive Dashboards Leveraging Databricks

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs