AWS re:Invent 2025 - Sustainable computing for climate solutions (AIM417)

2025-12-06 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML AWS CloudWatch Cloud Computing

High Performance Computing (HPC) is crucial for solving complex environmental challenges such as air pollution and extreme weather forecasting, yet HPC workloads are traditionally computationally intensive. This session explores patterns and key AWS services like AWS Batch, ParallelCluster, EC2 Spot Instances, Graviton processors, FSx for Lustre, and Cloudwatch that help optimize the balance between computational performance and environmental sustainability. Attendees will explore a real-world example of how researchers from University of Oxford are using sustainable HPC solutions for community air quality improvement through large-scale geospatial data processing and machine learning.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Architecting for success with AWS EC2 Graviton Instances

2025-10-14 · The Performance Engine - Architecting for Real-Time Scale

talk

by Nati Cohen (AWS)

AWS

Architecting for success with AWS EC2 Graviton Instances

2025-10-14 · The Performance Engine - Architecting for Real-Time Scale

talk

by Nati Cohen (AWS)

graviton

AWS customers launch tens of billions of EC2 instances yearly, choosing from an expanding selection of compute, storage, memory, and networking options. This session reveals how innovations like the Nitro system and Graviton CPUs offload tasks to hardware, delivering improved performance and security. You'll discover how these technologies transform what was once impossible into reality for your workloads

AWS re:Invent 2024 - Customer Keynote Canva

2024-12-10 · AWS re:Invent 2024 Watch

video

Agile/Scrum AWS Cloud Computing

Canva, an online visual communication and collaboration platform shares its engineering principles and lessons learned as it scaled from startup to more than 220 million monthly users worldwide. Canva will share its architectural evolution, starting with a monolith on Amazon EC2 and progressively transitioning to a microservices architecture utilizing various AWS services, as well as its focus on allowing anyone to build functionality on top of the Canva platform.

Learn more about AWS events: https://go.aws/events

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024 #AWS

AWS re:Invent 2024 - Cost-effective data processing with Amazon EMR (ANT344)

2024-12-08 · AWS re:Invent 2024 Watch

video

by Aaron Feng (Roblox) , Matthew Liem (AWS)

Agile/Scrum AWS Amazon EMR Big Data Cloud Computing

Unlock the full potential of your big data environment during this in-depth session on Amazon EMR and cost optimization strategies, tailored for data engineers, data architects, and cloud architects. Gain a comprehensive understanding of various cost optimization strategies, including cluster rightsizing, using Amazon EC2 Spot Instances, and implementing managed scaling. Learn about the key differences between Amazon EMR deployment models and how to choose the best option that aligns with your organization’s specific requirements, constraints, and technical capabilities. Leave with actionable insights and practical strategies to enhance your big data workflows and achieve significant cost savings.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

Platform for Genomic Processing Using Airflow and ECS

2023-07-01 · Airflow Summit 2023

session

by Zohar Donenhirsh , Alina Aven

Airflow AWS Data Engineering Data Science Docker ELK

High-scale orchestration of genomic algorithms using Airflow workflows, AWS Elastic Container Service (ECS), and Docker. Genomic algorithms are highly demanding of CPU, RAM, and storage. Our data science team requires a platform to facilitate the development and validation of proprietary algorithms. The Data engineering team develops a research data platform that enables Data Scientists to publish docker images to AWS ECR and run them using Airflow DAGS that provision AWS’s ECS compute power of EC2 and Fargate. We will describe a research platform that allows our data science team to check their algorithms on ~1000 cases in parallel using airflow UI and dynamic DAG generation to utilize EC2 machines, auto-scaling groups, and ECS clusters across multiple AWS regions.

Building a Lakehouse on AWS for Less with AWS Graviton and Photon

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AWS Data Lakehouse Databricks

AWS Graviton processors are custom-designed by AWS to enable the best price performance for workloads in Amazon EC2. In this session we will review benchmarks that demonstrate how AWS Graviton based instances run Databricks workloads at a lower price and better performance than x86-based instances on AWS, and when combined with Photon, the new Databricks engine, the price performance gains are even greater. Learn how you can optimize your Databricks workloads on AWS and save more.

Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Analytics AWS Amazon EMR CI/CD Data Management Databricks ETL/ELT Redshift Spark

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Analytics API AWS Amazon EMR Big Data CI/CD Databricks ETL/ELT Redshift Spark

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

High Performant File System Workloads for AI and HPC on AWS using IBM Spectrum Scale

2021-03-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sanjay Sudam

AI/ML AWS Cloud Computing ELK IBM Linux data data-engineering

This IBM® Redpaper® publication is intended to facilitate the deployment and configuration of the IBM Spectrum® Scale based high-performance storage solutions for the scalable data and AI solutions on Amazon Web Services (AWS). Configuration, testing results, and tuning guidelines for running the IBM Spectrum Scale based high-performance storage solutions for the data and AI workloads on AWS are the focus areas of the paper. The LAB Validation was conducted with the Red Hat Linux nodes to IBM Spectrum Scale by using the various Amazon Elastic Compute Cloud (EC2) instances. Simultaneous workloads are simulated across multiple Amazon EC2 nodes running with Red Hat Linux to determine scalability against the IBM Spectrum Scale clustered file system. Solution architecture, configuration details, and performance tuning demonstrate how to maximize data and AI application performance with IBM Spectrum Scale on AWS.

Airflow: A beast character in the gaming world

2020-07-01 · Airflow Summit 2020

session

by Naresh Yegireddi (PlayStation) , Patricio Garza (PlayStation)

Airflow Analytics AWS Big Data Data Analytics Docker ETL/ELT Python Spark

Being a pioneer for the past 25 years, SONY PlayStation has played a vital role in the Interactive Gaming Industry. Over 100+ million monthly active users, 100+ million PS-4 console sales along with thousands of game development partners across the globe, big-data problem is quite inevitable. This presentation talks about how we scaled Airflow horizontally which has helped us building a stable, scalable and optimal data processing infrastructure powered by Apache Spark, AWS ECS, EC2 and Docker. Due to the demand for processing large volumes of data and also to meet the growing Organization’s data analytics and usage demands, the data team at PlayStation took an initiative to build an open source big data processing infrastructure where Apache Spark in Python as the core ETL engine. Apache Airflow is the core workflow management tool for the entire eco system. We started with an Airflow application running on a single AWS EC2 instance to support parallelism of 16 with 1 scheduler and 1 worker and eventually scaled it to a bigger scheduler along with 4 workers to support a parallelism of 96, DAG concurrency of 96 and a worker task concurrency of 24. Containerized all the services on AWS ECS which gave us an ability to scale Airflow horizontally.

Expert Apache Cassandra Administration

2017-12-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sam R. Alapati

Big Data Cassandra Data Modelling Docker ELK Spark data data-engineering nosql-databases

Follow this handbook to build, configure, tune, and secure Apache Cassandra databases. Start with the installation of Cassandra and move on to the creation of a single instance, and then a cluster of Cassandra databases. Cassandra is increasingly a key player in many big data environments, and this book shows you how to use Cassandra with Apache Spark, a popular big data processing framework. Also covered are day-to-day topics of importance such as the backup and recovery of Cassandra databases, using the right compression and compaction strategies, and loading and unloading data. Expert Apache Cassandra Administration provides numerous step-by-step examples starting with the basics of a Cassandra database, and going all the way through backup and recovery, performance optimization, and monitoring and securing the data. The book serves as an authoritative and comprehensive guide to the building and management of simpleto complex Cassandra databases. The book: Takes you through building a Cassandra database from installation of the software and creation of a single database, through to complex clusters and data centers Provides numerous examples of actual commands in a real-life Cassandra environment that show how to confidently configure, manage, troubleshoot, and tune Cassandra databases Shows how to use the Cassandra configuration properties to build a highly stable, available, and secure Cassandra database that always operates at peak efficiency What You'll Learn Install the Cassandra software and create your first database Understand the Cassandra data model, and the internal architecture of a Cassandra database Create your own Cassandra cluster, step-by-step Run a Cassandra cluster on Docker Work with Apache Spark by connecting to a Cassandra database Deploy Cassandra clusters in your data center, or on Amazon EC2 instances Back up and restore mission-critical Cassandra databases Monitor, troubleshoot, and tune production Cassandra databases, and cut your spending on resources such as memory, servers, and storage Who This Book Is For Database administrators, developers, and architects who are looking for an authoritative and comprehensive single volume for all their Cassandra administration needs. Also for administrators who are tasked with setting up and maintaining highly reliable and high-performing Cassandra databases. An excellent choice for big data administrators, database administrators, architects, and developers who use Cassandra as their key data store, to support high volume online transactions, or as a decentralized, elastic data store.

The Definitive Guide to MongoDB: A complete guide to dealing with Big Data using MongoDB, Second Edition

2013-11-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by David Hows , Eelco Plugge , Peter Membrey , Tim Hawkins

Azure Big Data Cloud Computing Data Management Linux MongoDB NoSQL Python SQL data data-engineering nosql-databases

The Definitive Guide to MongoDB, Second Edition, is updated for the latest version and includes all of the latest MongoDB features, including the aggregation framework introduced in version 2.2 and hashed indexes in version 2.4. MongoDB is the most popular of the "Big Data" NoSQL database technologies, and it's still growing. David Hows from 10gen, along with experienced MongoDB authors Peter Membrey and Eelco Plugge, provide their expertise and experience in teaching you everything you need to know to become a MongoDB pro. The Definitive Guide to MongoDB, Second Edition, starts with the basics, including how to install on Windows, Linux, and OS X, and how MongoDB handles your data. Then you'll learn how to develop with MongoDB with both PHP and Python, including an example application using a PHP driver to create a blog application. Finally, you'll dig into more advanced but extremely important MongoDB features, including optimization, replication, and sharding -- load-balancing that makes MongoDB ideal for dealing with Big Data. If you're dealing with data, MongoDB should be on your must-learn list. The Definitive Guide to MongoDB, Second Edition, is just the book you need. What you'll learn Set up MongoDB on all major server platforms, including Windows, Linux, OS X, and cloud platforms like Rackspace, Azure, and Amazon EC2 Work with GridFS and the new aggregation framework Work with your data using non-SQL commands Write applications using either PHP or Python Optimize MongoDB Master MongoDB administration, including replication, replication tagging, and tag-aware sharding Who this book is for Database admins and developers who need to get up to speed on MongoDB and its Big Data, NoSQL approach to dealing with data management.

Seven Databases in Seven Weeks

2012-05-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Eric Redmond , Jim R. Wilson

Big Data Cloud Computing Data Management DynamoDB ELK Apache HBase Java Linux MongoDB Neo4j NoSQL RDBMS +5 more

Data is getting bigger and more complex by the day, and so are the choices in handling that data. As a modern application developer you need to understand the emerging field of data management, both RDBMS and NoSQL. Seven Databases in Seven Weeks takes you on a tour of some of the hottest open source databases today. In the tradition of Bruce A. Tate's Seven Languages in Seven Weeks, this book goes beyond your basic tutorial to explore the essential concepts at the core each technology. Redis, Neo4J, CouchDB, MongoDB, HBase, Riak and Postgres. With each database, you'll tackle a real-world data problem that highlights the concepts and features that make it shine. You'll explore the five data models employed by these databases-relational, key/value, columnar, document and graph-and which kinds of problems are best suited to each. You'll learn how MongoDB and CouchDB are strikingly different, and discover the Dynamo heritage at the heart of Riak. Make your applications faster with Redis and more connected with Neo4J. Use MapReduce to solve Big Data problems. Build clusters of servers using scalable services like Amazon's Elastic Compute Cloud (EC2). Discover the CAP theorem and its implications for your distributed data. Understand the tradeoffs between consistency and availability, and when you can use them to your advantage. Use multiple databases in concert to create a platform that's more than the sum of its parts, or find one that meets all your needs at once. Seven Databases in Seven Weeks will take you on a deep dive into each of the databases, their strengths and weaknesses, and how to choose the ones that fit your needs. What You Need: To get the most of of this book you'll have to follow along, and that means you'll need a *nix shell (Mac OSX or Linux preferred, Windows users will need Cygwin), and Java 6 (or greater) and Ruby 1.8.7 (or greater). Each chapter will list the downloads required for that database.

talk-data.com

Amazon EC2

Activity Trend

Top Events

Top Speakers

AWS re:Invent 2025 - Sustainable computing for climate solutions (AIM417)

AWSreInvent #AWSreInvent2025 #AWS

Architecting for success with AWS EC2 Graviton Instances

Architecting for success with AWS EC2 Graviton Instances

AWS re:Invent 2024 - Customer Keynote Canva

AWSreInvent #AWSreInvent2024 #AWS

AWS re:Invent 2024 - Cost-effective data processing with Amazon EMR (ANT344)

AWSreInvent #AWSreInvent2024

Platform for Genomic Processing Using Airflow and ECS

Building a Lakehouse on AWS for Less with AWS Graviton and Photon

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs

High Performant File System Workloads for AI and HPC on AWS using IBM Spectrum Scale

Airflow: A beast character in the gaming world

Expert Apache Cassandra Administration

The Definitive Guide to MongoDB: A complete guide to dealing with Big Data using MongoDB, Second Edition

Seven Databases in Seven Weeks