talk-data.com talk-data.com

Topic

Big Data

data_processing analytics large_datasets

300

tagged

Activity Trend

28 peak/qtr
2020-Q1 2026-Q1

Activities

300 activities · Newest first

AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

Explore the latest advancements in AWS Analytics designed to transform your data processing landscape. This session unveils powerful new capabilities across key services, including Amazon EMR for scalable big data processing, AWS Glue for seamless data integration, Amazon Athena for optimized querying, and Amazon Managed Workflows for Apache Airflow (MWAA) for workflow orchestration. Discover how these innovations can supercharge performance, optimize costs, and streamline your data ecosystem. Whether you're looking to enhance scalability, improve data integration, accelerate queries, or refine workflow management, join us to gain actionable insights that will position your organization at the forefront of data processing innovation.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

A Journey Through a Geospatial Data Pipeline: From Raw Coordinates to Actionable Insights

Every dataset has a story — and when it comes to geospatial data, it’s a story deeply rooted in space and scale. But working with geospatial information is often a hidden challenge: massive file sizes, strange formats, projections, and pipelines that don't scale easily.

In this talk, we'll follow the life of a real-world geospatial dataset, from its raw collection in the field to its transformation into meaningful insights. Along the way, we’ll uncover the key steps of building a robust, scalable open-source geospatial pipeline.

Drawing on years of experience at Camptocamp, we’ll explore:

  • How raw spatial data is ingested and cleaned
  • How vector and raster data are efficiently stored and indexed (PostGIS, Cloud Optimized GeoTIFFs, Zarr)
  • How modern tools like Dask, GeoServer, and STAC (SpatioTemporal Asset Catalogs) help process and serve geospatial data
  • How to design pipelines that handle both "small data" (local shapefiles) and "big data" (terabytes of satellite imagery)
  • Common pitfalls and how to avoid them when moving from prototypes to production

This journey will show how the open-source ecosystem has matured to make geospatial big data accessible — and how spatial thinking can enrich almost any data project, whether you are building dashboards, doing analytics, or setting the stage for machine learning later on.

CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training

Built on top of Software Heritage - the largest public archive of source code - the CodeCommons collaboration is building a large-scale, meta-data rich source code dataset designed to make training AI models on code more transparent, sustainable, and fair. Code will be enriched with contextual information such as issues, pull request discussions, licensing data, and provenance. In this presentation, we will present the goals and structure of both Software Heritage and CodeCommons projects, and discuss our particular contribution to CodeCommon's big data infrastructure.

Daft and Unity Catalog: A Multimodal/AI-Native Lakehouse

Modern data organizations have moved beyond big data analytics to also incorporate advanced AI/ML data workloads. These workflows often involve multimodal datasets containing documents, images, long-form text, embeddings, URLs and more. Unity Catalog is an ideal solution for organizing and governing this data at scale. When paired with the Daft open source data engine, you can build a truly multimodal, AI-ready data lakehouse. In this session, we’ll explore how Daft integrates with Unity Catalog’s core features (such as volumes and functions) to enable efficient, AI-driven data lakehouses. You will learn how to ingest and process multimodal data (images, text and videos), run AI/ML transformations and feature extractions at scale, and maintain full control and visibility over your data with Unity Catalog’s fine-grained governance.

Scaling Data Engineering Pipelines: Preparing Credit Card Transactions Data for Machine Learning

We discuss two real-world use cases in big data engineering, focusing on constructing stable pipelines and managing storage at a petabyte scale. The first use case highlights the implementation of Delta Lake to optimize data pipelines, resulting in an 80% reduction in query time and a 70% reduction in storage space. The second use case demonstrates the effectiveness of the Workflows ‘ForEach’ operator in executing compute-intensive pipelines across multiple clusters, significantly reducing processing time from months to days. This approach involves a reusable design pattern that isolates notebooks into units of work, enabling data scientists to independently test and develop.

How FedEx Achieved Self-Serve Analytics and Data Democratization on Databricks

FedEx, a global leader in transportation and logistics, faced a common challenge in the era of big data: how to democratize data and foster data-driven decision making with thousands of data practitioners at FedEx wanting to build models, get real-time insights, explore enterprise data, and build enterprise-grade solutions to run the business. This breakout session will highlight how FedEx overcame challenges in data governance and security using Unity Catalog, ensuring that sensitive information remains protected while still allowing appropriate access across the organization. We'll share their approach to building intuitive self-service interfaces, including the use of natural-language processing to enable non-technical users to query data effortlessly. The tangible outcomes of this initiative are numerous, but chiefly: increased data literacy across the company, faster time-to-insight for business decisions, and significant cost-savings through improved operational efficiency.

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

Leveraging GenAI for Synthetic Data Generation to Improve Spark Testing and Performance in Big Data

Testing Spark jobs in local environments is often difficult due to the lack of suitable datasets, especially under tight timelines. This creates challenges when jobs work in development clusters but fail in production, or when they run locally but encounter issues in staging clusters due to inadequate documentation or checks. In this session, we’ll discuss how these challenges can be overcome by leveraging Generative AI to create custom synthetic datasets for local testing. By incorporating variations and sampling, a testing framework can be introduced to solve some of these challenges, allowing for the generation of realistic data to aid in performance and load testing. We’ll show how this approach helps identify performance bottlenecks early, optimize job performance and recognize scalability issues while keeping costs low. This methodology fosters better deployment practices and enhances the reliability of Spark jobs across environments.

Sponsored by: Deloitte | Advancing AI in Cybersecurity with Databricks & Deloitte: Data Management & Analytics

Deloitte is observing a growing trend among cybersecurity organizations to develop big data management and analytics solutions beyond traditional Security Information and Event Management (SIEM) systems. Leveraging Databricks to extend these SIEM capabilities, Deloitte can help clients lower the cost of cyber data management while enabling scalable, cloud-native architectures. Deloitte helps clients design and implement cybersecurity data meshes, using Databricks as a foundational data lake platform to unify and govern security data at scale. Additionally, Deloitte extends clients’ cybersecurity capabilities by integrating advanced AI and machine learning solutions on Databricks, driving more proactive and automated cybersecurity solutions. Attendees will gain insight into how Deloitte is utilizing Databricks to manage enterprise cyber risks and deliver performant and innovative analytics and AI insights that traditional security tools and data platforms aren’t able to deliver.

Shifting Left in Banking: Enhancing Machine Learning Models through Proactive Data Quality | Abhi...

Shifting Left in Banking: Enhancing Machine Learning Models through Proactive Data Quality | Abhi Ghosh | Shift Left Data Conference 2025

Good Data and not Big Data is becoming more important in today's ecosystem. Machine Learning models rely on good quality data to make their model training more efficient and effective. We have traditionally applied Data Quality checks and balances in manual, centralized way, putting a lot of onus on our customers. Shifting Left Data Quality will bring the data quality checks closer to where data is being created, while preventing bad data from flowing downstream. Also auto-detecting, recommending and auto-enforcing data quality rules will make our customers job easier, while creating a more mature and robust data ecosystem.

Where Data Science Meets Shrek: How BuzzFeed uses AI

By introducing a range of AI-enhanced products that amplify creativity and interactivity across our platforms, Buzzfeed has been able to connect with the largest global audience of young people online to cement its role as the defining digital media company of the AI era. Notably, some of Buzzfeed's most successful tools and content experiences thrive on the power of small, focused datasets. Still wondering how Shrek fits into the picture? You'll have to watch!

Video from: https://smalldatasf.com/

📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Why Small Data?: https://benn.substack.com/p/is-excel-... Small Data SF: https://www.smalldatasf.com/

➡️ Follow Us LinkedIn: / motherduck
X/Twitter : / motherduck
Bluesky: motherduck.com Blog: https://motherduck.com/blog/


Discover how BuzzFeed's Data team, led by Gilad Cohen, harnesses AI for creative purposes, leveraging large language models (LLMs) and generative image capabilities to enhance content creation. This video explores how machine learning teams build tools to create new interactive media experiences, focusing on augmenting creative workflows rather than replacing jobs, allowing readers to participate more deeply in the content they consume.

We dive into the core data science problem of understanding what a piece of content is about, a crucial step for improving content recommendation systems. Learn why traditional methods fall short and how the team is constantly seeking smaller, faster, and more performant models. This exploration covers the evolution from earlier architectures like DistilBERT to modern, more efficient approaches for better content representation, clustering, and user personalization.

A key technique explored is the use of text embeddings, which are dense, low-dimensional vector representations of data. This video provides an accessible explanation of embeddings as a form of compressed knowledge, showing how BuzzFeed creates a unique vector for each article. This allows for simple vector math to find semantically similar content, forming a foundational infrastructure for powerful ranking and recommender systems.

Explore how BuzzFeed leverages generative image capabilities to create new interactive formats. The journey began with Midjourney experiments and evolved to building custom tools by fine-tuning a Stable Diffusion XL model using LORA (Low-Rank Approximation). This advanced technique provides greater control over image output, enabling the rapid creation of viral AI generators that respond to trending topics and allow for massive user engagement.

Finally, see a practical application of machine learning for content optimization. BuzzFeed uses its vast historical dataset from Bayesian A/B testing to train a model that predicts headline performance. By generating multiple headline candidates with an LLM like Claude and running them through this predictive model, they can identify the winning headline. This showcases how to use unique, in-house data to build powerful tools that improve click-through rates and drive engagement, pointing to a significant transformation in how media is created and consumed.

Jiri Moravcik: Automating Web Workflows with LLMs

🌟 Session Overview 🌟

Session Name: Automating Web Workflows with LLMs Speaker: Jiri Moravcik Session Description: This talk will delve into Apify's approach to automation and its workflow with Large Language Models (LLMs), highlighting the seamless integration and strategic use of AI in data extraction from the web. Participants will gain insight into how Apify serves clients like Intercom and Rocket Money by employing cutting-edge techniques to scrape and structure online data. The presentation will showcase specific case studies involving ChatGPT, illustrating the methodologies and tools utilized to transform raw data into valuable insights for clients.

🚀 About Big Data and RPA 2024 🚀

Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨

📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP

💡 Stay Connected & Updated 💡

Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!

🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT

Alius Petraska: Can AI Face Your (Potential) Customers?

🌟 Session Overview 🌟

Session Name: Can AI Face Your (Potential) Customers? Lessons Learned with Multilingual Enterprises Speaker: Alius Petraska Session Description: Vytenis will share their experience on when AI can directly interact with customers or when human intervention is still necessary. Their solution helps sales and customer service (CS) agents be more effective on calls. This provides a unique perspective on understanding when AI excels and when it falls short, still requiring clients to call or send emails.

🚀 About Big Data and RPA 2024 🚀

Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨

📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP

💡 Stay Connected & Updated 💡

Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!

🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT

AWS re:Invent 2024 - Cost-effective data processing with Amazon EMR (ANT344)

Unlock the full potential of your big data environment during this in-depth session on Amazon EMR and cost optimization strategies, tailored for data engineers, data architects, and cloud architects. Gain a comprehensive understanding of various cost optimization strategies, including cluster rightsizing, using Amazon EC2 Spot Instances, and implementing managed scaling. Learn about the key differences between Amazon EMR deployment models and how to choose the best option that aligns with your organization’s specific requirements, constraints, and technical capabilities. Leave with actionable insights and practical strategies to enhance your big data workflows and achieve significant cost savings.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

Panel Discussion | Integrating AI with RPA

🌟 Session Overview 🌟

Session Name: Panel Discussion | Integrating AI with RPA: Streamlining Operations and Innovating Business Processes Speakers: Ana Marija Barisic, Andrzej Kinatowski, Pedram Birounvand, Swanand Rao, Alius Petraska Session Description: Panel Discussion will explore the powerful synergy between Artificial Intelligence (AI) and Robotic Process Automation (RPA). Panelists will discuss how combining these technologies can transform and streamline business operations, driving efficiency, accuracy, and innovation. The session will cover real-world use cases, strategies for successful integration, and the potential challenges organizations might face.

🚀 About Big Data and RPA 2024 🚀

Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨

📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP

💡 Stay Connected & Updated 💡

Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!

🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT

Anna Semjen: From Quick Wins to Revolutionising Productivity & CX with GenAI

🌟 Session Overview 🌟

Session Name: From Quick Wins to Revolutionising Productivity & CX with GenAI: Utilising Real-time and Open Source AI with Semantic Search Speaker: Anna Semjen Session Description: Join this session to discover how DataStax Astra DB can boost productivity, enable rapid deployment of GenAI applications, and transform customer experience. We’ll showcase an advanced semantic search use case, demonstrating how to vectorize entire videos with specific timestamps and use natural language processing to find precise moments from events like the Olympics. Learn about an open-source model that runs locally, making this powerful tool accessible and cost-effective. Additionally, explore hybrid search capabilities that integrate multiple videos into a single collection, streamlining processes by loading only embeddings and metadata. Perfect for enhancing content management and delivering exceptional user experiences.

🚀 About Big Data and RPA 2024 🚀

Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨

📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP

💡 Stay Connected & Updated 💡

Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!

🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT