talk-data.com talk-data.com

Topic

S3

Amazon S3

object_storage cloud_storage aws

104

tagged

Activity Trend

11 peak/qtr
2020-Q1 2026-Q1

Activities

104 activities · Newest first

Discover how Apache Airflow powers scalable ELT pipelines, enabling seamless data ingestion, transformation, and machine learning-driven insights. This session will walk through: Automating Data Ingestion: Using Airflow to orchestrate raw data ingestion from third-party sources into your data lake (S3, GCP), ensuring a steady pipeline of high-quality training and prediction data. Optimizing Transformations with Serverless Computing: Offloading intensive transformations to serverless functions (GCP Cloud Run, AWS Lambda) and machine learning models (BigQuery ML, Sagemaker), integrating their outputs seamlessly into Airflow workflows. Real-World Impact: A case study on how INTRVL leveraged Airflow, BigQuery ML, and Cloud Run to analyze early voting data in near real-time, generating actionable insights on voter behavior across swing states. This talk not only provides a deep dive into the Political Tech space but also serves as a reference architecture for building robust, repeatable ELT pipelines. Attendees will gain insights into modern serverless technologies from AWS and GCP that enhance Airflow’s capabilities, helping data engineers design scalable, cloud-agnostic workflows.

In response to the growing demand for integrating new data into our data platform, the Data Engineering Team at Okta has developed a solution utilizing Snowpark for Python to automate construction of data pipelines. Discover how Okta's Zero Touch Platform creates end-to-end pipelines that ingest events arriving on S3 and transforms data in Snowflake using Streams and Tasks. The platform features integrated capabilities to detect schema changes in data streams, facilitating automatic evolution of Snowflake table schemas. Crafted with privacy in mind, it also allows for data classification through tags and systemically masking data using tag-based masking policies.

Franchise IP and Data Governance at Krafton: Driving Cost Efficiency and Scalability

Join us as we explore how KRAFTON optimized data governance for PUBG IP, enhancing cost efficiency and scalability. KRAFTON operates a massive data ecosystem, processing tens of terabytes daily. As real-time analytics demands increased, traditional Batch-based processing faced scalability challenges. To address this, we redesigned data pipelines and governance models, improving performance while reducing costs. Transitioned to real-time pipelines (batch to streaming) Optimized workload management (reducing all-purpose clusters, increasing Jobs usage) Cut costs by tens of thousands monthly (up to 75%) Enhanced data storage efficiency (lower S3 costs, Delta Tables) Improved pipeline stability (Medallion Architecture) Gain insights into how KRAFTON scaled data operations, leveraging real-time analytics and cost optimization for high-traffic games. Learn more: https://www.databricks.com/customers/krafton

Sponsored by: RowZero | Spreadsheets in the modern data stack: security, governance, AI, and self-serve analytics

Despite the proliferation of cloud data warehousing, BI tools, and AI, spreadsheets are still the most ubiquitous data tool. Business teams in finance, operations, sales, and marketing often need to analyze data in the cloud data warehouse but don't know SQL and don't want to learn BI tools. AI tools offer a new paradigm but still haven't broadly replaced the spreadsheet. With new AI tools and legacy BI tools providing business teams access to data inside Databricks, security and governance are put at risk. In this session, Row Zero CEO, Breck Fresen, will share examples and strategies data teams are using to support secure spreadsheet analysis at Fortune 500 companies and the future of spreadsheets in the world of AI. Breck is a former Principal Engineer from AWS S3 and was part of the team that wrote the S3 file system. He is an expert in storage, data infrastructure, cloud computing, and spreadsheets.

Sovereign Data for AI with Python

The only certainty in life is that the pendulum will always swing. Recently, the pendulum has been swinging towards repatriation. However, the infrastructure needed to build and operate AI systems using Python in a sovereign (even air-gapped) environment has changed since the shift towards the cloud. This talk will introduce the infrastructure you need to build and deploy Python applications for AI - from data processing, to model training and LLM fine-tuning at scale to inference at scale. We will focus on open-source infrastructure including: a Python library server (Pypi, Conda, etc) and avoiding supply chain attacks a container registry that works at scale a S3 storage layer a database server with a vector index

Hands-on with Apache Iceberg

You've probably heard the name Apache Iceberg by now. If it wasn't when Databricks reportedly spent 2 billion USD buying Tabular, it might have been when AWS announced S3 Tables built on Iceberg. But do you know what Apache Iceberg actually is? Or how you could start using it today?

In this tutorial, we will walk through an end-to-end example of writing and reading Iceberg data, while taking a few pitstops to demonstrate Iceberg's selling points.

Summary In this episode of the Data Engineering Podcast Mai-Lan Tomsen Bukovec, Vice President of Technology at AWS, talks about the evolution of Amazon S3 and its profound impact on data architecture. From her work on compute systems to leading the development and operations of S3, Mylan shares insights on how S3 has become a foundational element in modern data systems, enabling scalable and cost-effective data lakes since its launch alongside Hadoop in 2006. She discusses the architectural patterns enabled by S3, the importance of metadata in data management, and how S3's evolution has been driven by customer needs, leading to innovations like strong consistency and S3 tables.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.Your host is Tobias Macey and today I'm interviewing Mai-Lan Tomsen Bukovec about the evolutions of S3 and how it has transformed data architectureInterview IntroductionHow did you get involved in the area of data management?Most everyone listening knows what S3 is, but can you start by giving a quick summary of what roles it plays in the data ecosystem?What are the major generational epochs in S3, with a particular focus on analytical/ML data systems?The first major driver of analytical usage for S3 was the Hadoop ecosystem. What are the other elements of the data ecosystem that helped shape the product direction of S3?Data storage and retrieval have been core primitives in computing since its inception. What are the characteristics of S3 and all of its copycats that led to such a difference in architectural patterns vs. other shared data technologies? (e.g. NFS, Gluster, Ceph, Samba, etc.)How does the unified pool of storage that is exemplified by S3 help to blur the boundaries between application data, analytical data, and ML/AI data?What are some of the default patterns for storage and retrieval across those three buckets that can lead to anti-patterns which add friction when trying to unify those use cases?The age of AI is leading to a massive potential for unlocking unstructured data, for which S3 has been a massive dumping ground over the years. How is that changing the ways that your customers think about the value of the assets that they have been hoarding for so long?What new architectural patterns is that generating?What are the most interesting, innovative, or unexpected ways that you have seen S3 used for analytical/ML/Ai applications?What are the most interesting, unexpected, or challenging lessons that you have learned while working on S3?When is S3 the wrong choice?What do you have planned for the future of S3?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links AWS S3KinesisKafkaSQSEMRDrupalWordpressNetflix Blog on S3 as a Source of TruthHadoopMapReduceNasa JPLFINRA == Financial Industry Regulatory AuthorityS3 Object VersioningS3 Cross RegionS3 TablesIcebergParquetAWS KMSIceberg RESTDuckDBNFS == Network File SystemSambaGlusterFSCephMinIOS3 MetadataPhotoshop Generative FillAdobe FireflyTurbotax AI AssistantAWS Access AnalyzerData ProductsS3 Access PointAWS Nova ModelsLexisNexis ProtegeS3 Intelligent TieringS3 Principal Engineering TenetsThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

In todays episode of Data Engineering Central Podcast we talk about a few hot topics, AWS S3 Tables, Databricks raising money, are Data Contracts Dead, and the Lake House Storage Format battle! It's a good one, buckle up!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

AWS re:Invent 2024 - [NEW LAUNCH] Amazon SageMaker Lakehouse: Accelerate analytics & AI (ANT354-NEW)

Data warehouses, data lakes, or both? Explore how Amazon SageMaker Lakehouse, a unified, open, and secure data lake house simplifies analytics and AI. This session unveils how SageMaker Lakehouse provides unified access to data across Amazon S3 data lakes, Amazon Redshift data warehouses, and third-party sources without altering your existing architecture. Learn how it breaks down data silos and opens your data estate with Apache Iceberg compatibility, offering flexibility to use preferred query engines and tools that accelerate your time to insights. Discover robust security features, including consistent fine-grained access controls, that help democratize data without compromises.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

PostgreSQL Skills Development on Cloud: A Practical Guide to Database Management with AWS and Azure

This book provides a comprehensive approach to manage PostgreSQL cluster databases on Amazon Web Services and Azure Web Services on the cloud, as well as in Docker and container environments on a Red Hat operating system. Furthermore, detailed references for managing PostgreSQL on both Windows and Mac are provided. This book condenses all the fundamental and essential concepts you need to manage a PostgreSQL cluster into a one-stop guide that is perfect for newcomers to Postgres database administration. Each chapter of the book provides historical context and documents version changes of the PostgreSQL cluster, elucidates practical "how-to" methods, and includes illustrations and key word definitions, practices for application, a summary of key learnings, and questions to reinforce understanding. The book also outlines a clear study objective with a weekly learning schedule and hundreds of practice exercises, along with questions and answers. With its comprehensive and practical approach, this book will help you gain the confidence to manage all aspects of a PostgreSQL cluster in critical production environments so you can better support your organization's database infrastructure on the cloud and in containers. What You Will Learn Install and configure Postgres clusters on the cloud and in containers, monitor database logs, start and stop databases, troubleshoot, tune performance, backup and recover, and integrate with Amazon S3 and Azure Data Blob Manage Postgres databases on Amazon Web Services and Azure Web Services on the cloud, as well as in Docker and container environments on a Red Hat operating system Access sample references to scripting solutions and database management tools for working with Postgres, Redshift (based on Postgres 8.2), and Docker Create Amazon Machine Images (AMI) and Azure Images for managing a fleet of Postgres clusters on the cloud Reinforce knowledge with a weekly learning schedule and hundreds of practice exercises, along with questions and answers Progress from simple concepts, such as how to choose the correct instance type, to creating complex machine images Gain access to an Amazon AMI with a DBA admin tool, allowing you to learn Postgres, Redshift, and Docker in a cloud environment Refer to a comprehensive summary of documentations of Postgres, Amazon Web services, Azure Web services, and Red Hat Linux for managing all aspects of Postgres cluster management on the cloud Who This Book Is For Newcomers to PostgreSQL database administration and cross-platform support DBAs looking to master PostgreSQL on the cloud.

AWS re:Invent 2024 - Build Amazon Q apps to scale and drive community engagement (DEV201)

In this session, discover how AI services, including Amazon Q Business, can help to scale and improve community engagement, streamline events planning, and handle everyday tasks as an event organizer. Dive into technical insights with a demo as we share practical ideas and offer guidance on getting started. Learn how to use AI and services like Amazon S3 and AWS Lambda. Building and growing communities is crucial. Whether you’re organizing monthly meetups or full-scale events, learn how to use these tools to work smarter and more efficiently.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Maximize efficiency and reduce costs with Amazon OpenSearch Service (ANT347)

Optimize your Amazon OpenSearch Service deployments for enhanced user experience and cost efficiency. This session covers key cost drivers, including operational management, licensing, networking, and data distribution. Learn to choose the best storage options, such as hot, UltraWarm, and cold storage and integration with Amazon S3. Discover how to optimize infrastructure with AWS Graviton processors or OpenSearch Service OR1 instances. Also, explore On-Demand pricing, Reserved Instance savings, and how Amazon OpenSearch Serverless with automatic scaling can ensure you pay only for the resources you use.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

This talk discusses the recent improvements that Amazon S3 team has been doing in Iceberg FileIO and LocationProvider to improve Iceberg user experience on S3. This includes better retry and fault tolerant executions (#10433 & #11052), better hashing scheme to reduce throttling (#11112), and integration with S3 Data Acceleration Toolkit and AWS CRT client to improve read performance.

Coalesce 2024: Building DEFCON 1 data pipelines (aka payments pipelines)

SpotOn works with FIS (formerly WorldPay) to handle payment processing, allowing for more detailed transaction management than other processors. Our data team took on the challenge of transitioning to FIS to gain better control over transaction details.

The legacy data pipelines we inherited were problematic and unreliable. They consisted of an SFTP file server, cron jobs, and Python/Shell scripts that moved data from SFTP to S3 and then processed it into Postgres. These systems were fragile, often breaking when new or different data arrived, requiring manual intervention and frequent restarts.

We recognized the need for a better solution. Our team decided to use Snowpipe and dbt to streamline our data processing. This approach allowed us to manage and parse complex data formats efficiently. We used dbt to create models that could handle the varied and detailed specifications provided by FIS, ensuring that as updates came in, they could be easily integrated.

With this new setup, we have significantly reduced the fragility of our pipelines. Using dbt Cloud, we've improved collaboration and error detection, ensuring data integrity and better insights into usage patterns. This new system supports not only payment processing but also other critical functions like customer loyalty and marketing, aggregating and cleaning data from various sources.

As we continue migrating from older systems like TSYS, we see the clear benefits of this modernization. Our experience with dbt has proven invaluable in supporting our business-critical data operations and ensuring smooth transitions and reliable data handling.

Speakers: Kevin Hu CEO Metaplane

Daniel Corley Senior Analytics Engineer SpotOn

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

AWS re:Inforce 2024 - Building a secure end-to-end generative AI application in the cloud (NIS321)

The security and privacy of data during the training, fine-tuning, and inferencing phases of generative AI are paramount. This lightning talk introduces a reference architecture designed to use the security of AWS PrivateLink with generative AI applications. Explore the importance of protecting proprietary data in applications that leverage both AWS native LLMs and ISV-supplied external data stores. Learn about the secure movement and usage of data, particularly for RAG processes, across various data sources like Amazon S3, vector databases, and Snowflake. Learn how this reference architecture not only meets today’s security demands but also sets the stage for the future of secure generative AI development.

Learn more about AWS re:Inforce at https://go.aws/reinforce.

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

reInforce2024 #CloudSecurity #AWS #AmazonWebServices #CloudComputing

AWS re:Inforce 2024 - Accelerating auditing and compliance for generative AI on AWS (GRC302)

Generative AI brings exciting new innovations, but it also presents challenges regarding responsible usage and compliance with governance requirements. This session guides you through the journey of a generative AI application and how AWS can help you ensure that your use of Amazon Bedrock and other related services, such as Amazon S3, AWS Lambda, and Amazon VPC, follows best practices for compliance and governance. Explore compliance services that AWS offers, like AWS Audit Manager and AWS CloudTrail, that can assist you in continuously auditing your generative AI infrastructure. Learn how these services automate audit evidence collection and provide audit-ready reports to meet your compliance and audit needs.

Learn more about AWS re:Inforce at https://go.aws/reinforce.

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

reInforce2024 #CloudSecurity #AWS #AmazonWebServices #CloudComputing

AWS re:Inforce 2024 - Harnessing conversational AI for streamlined security operations (COM222)

Tired of chasing security threats by looking in many different places? Imagine a chatbot that understands security findings, prioritizes risks, and suggests solutions all through natural language. This session unveils how to create a conversational AI to get faster answers about your security posture. Learn how to build this interactive ChatSecOps tool using Amazon Q, AWS Lambda, Amazon S3, and AWS Security Hub.

Learn more about AWS re:Inforce at https://go.aws/reinforce.

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

reInforce2024 #CloudSecurity #AWS #AmazonWebServices #CloudComputing