talk-data.com talk-data.com

Topic

Big Data

data_processing analytics large_datasets

1217

tagged

Activity Trend

28 peak/qtr
2020-Q1 2026-Q1

Activities

1217 activities · Newest first

Engineering Lakehouses with Open Table Formats

Engineering Lakehouses with Open Table Formats introduces the architecture and capabilities of open table formats like Apache Iceberg, Apache Hudi, and Delta Lake. The book guides you through the design, implementation, and optimization of lakehouses that can handle modern data processing requirements effectively with real-world practical insights. What this Book will help me do Understand the fundamentals of open table formats and their benefits in lakehouse architecture. Learn how to implement performant data processing using tools like Apache Spark and Flink. Master advanced topics like indexing, partitioning, and interoperability between data formats. Explore data lifecycle management and integration with frameworks like Apache Airflow and dbt. Build secure lakehouses with regulatory compliance using best practices detailed in the book. Author(s) Dipankar Mazumdar and Vinoth Govindarajan are seasoned professionals with extensive experience in big data processing and software architecture. They bring their expertise from working with data lakehouses and are known for their ability to explain complex technical concepts clearly. Their collaborative approach brings valuable insights into the latest trends in data management. Who is it for? This book is ideal for data engineers, architects, and software professionals aiming to master modern lakehouse architectures. If you are familiar with data lakes or warehouses and wish to transition to an open data architectural design, this book is suited for you. Readers should have basic knowledge of databases, Python, and Apache Spark for the best experience.

The rise of AI, automation, and big data is transforming careers across industries. In this talk, we’ll explore the evolving job market, the skills that will set you apart, and how you can position yourself for long-term success. Whether you’re looking to transition into AI and data or advance in your current role, this session will equip you with the insights and strategies you need:

  • What new roles are emerging?
  • How will AI impact existing jobs?
  • Why aren’t you getting interviews—and how to fix it?
  • What should you be learning right now?
AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

Explore the latest advancements in AWS Analytics designed to transform your data processing landscape. This session unveils powerful new capabilities across key services, including Amazon EMR for scalable big data processing, AWS Glue for seamless data integration, Amazon Athena for optimized querying, and Amazon Managed Workflows for Apache Airflow (MWAA) for workflow orchestration. Discover how these innovations can supercharge performance, optimize costs, and streamline your data ecosystem. Whether you're looking to enhance scalability, improve data integration, accelerate queries, or refine workflow management, join us to gain actionable insights that will position your organization at the forefront of data processing innovation.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Data Engineering for Beginners

A hands-on technical and industry roadmap for aspiring data engineers In Data Engineering for Beginners, big data expert Chisom Nwokwu delivers a beginner-friendly handbook for everyone interested in the fundamentals of data engineering. Whether you're interested in starting a rewarding, new career as a data analyst, data engineer, or data scientist, or seeking to expand your skillset in an existing engineering role, Nwokwu offers the technical and industry knowledge you need to succeed. The book explains: Database fundamentals, including relational and noSQL databases Data warehouses and data lakes Data pipelines, including info about batch and stream processing Data quality dimensions Data security principles, including data encryption Data governance principles and data framework Big data and distributed systems concepts Data engineering on the cloud Essential skills and tools for data engineering interviews and jobs Data Engineering for Beginners offers an easy-to-read roadmap on a seemingly complicated and intimidating subject. It addresses the topics most likely to cause a beginning data engineer to stumble, clearly explaining key concepts in an accessible way. You'll also find: A comprehensive glossary of data engineering terms Common and practical career paths in the data engineering industry An introduction to key cloud technologies and services you may encounter early in your data engineering career Perfect for practicing and aspiring data analysts, data scientists, and data engineers, Data Engineering for Beginners is an effective and reliable starting point for learning an in-demand skill. It's a powerful resource for everyone hoping to expand their data engineering Skillset and upskill in the big data era.

We were told to scale compute. But what if the real problem was never about big data, but about bad data access? In this talk, we’ll unpack two powerful, often misunderstood techniques—projection pushdown and predicate pushdown—and why they matter more than ever in a world where we want lightweight, fast queries over large datasets. These optimizations aren’t just academic—they’re the difference between querying a terabyte in seconds vs. minutes. We’ll show how systems like Flink and DuckDB leverage these techniques, what limits them (hello, Protobuf), and how smart schema and storage design—especially in formats like Iceberg and Arrow can unlock dramatic speed gains. Along the way, we’ll highlight the importance of landing data in queryable formats, and why indexing and query engines matter just as much as compute. This talk is for anyone who wants to stop fully scanning their data lakes just to read one field.

The lakehouse promised to unify our data, but popular formats can feel bloated and hard to use for most real-world workloads. If you've ever felt that the complexity and operational overhead of "Big Data" tools are overkill, you're not alone. What if your lakehouse could be simple, fast, and maybe even a little fun? Enter DuckLake , the native lakehouse format, managed on MotherDuck. It delivers the powerful features you need like ACID transactions, time travel, and schema evolution without the heavyweight baggage. This approach truly makes massive data sets feel like Small Data. This workshop is a practical, step-by-step walkthrough for the data practitioner. We'll get straight to the point and show you how to build a fully functional, serverless lakehouse from scratch. You will learn: The Architecture: We’ll explore how DuckLake's design choices make it fundamentally simpler and faster for analytical queries compared to its JVM-based cousins. The Workflow: Through hands-on examples, you'll create a DuckLake table, perform atomic updates, and use time travel—all with the simple SQL you already know. The MotherDuck Advantage: Discover how the serverless platform makes it easy to manage, share, and query your DuckLake tables, enabling a seamless hybrid workflow between your laptop and the cloud.

Does size matter? When it comes to datasets, the conventional wisdom seems to be a resounding, "Yes!" But what about small datasets? Small- and mid-sized businesses and nonprofits, especially, often have limited web traffic, small email lists, CRM systems that can comfortably operate under the free tier, and lead and order counts that don't lend themselves to "big data" descriptors. Even large enterprises have scenarios where some datasets easily fit into Google Sheets with limited scrolling required. Should this data be dismissed out of hand, or should it be treated as what it is: potentially useful? Joe Domaleski from Country Fried Creative works with a lot of businesses that are operating in the small data world, and he was so intrigued by the potential of putting data to use on behalf of his clients that he's mid-way through getting a Master's degree in Analytics from Georgia Tech! He wrote a really useful article about the ins and outs of small data, so we brought him on for a discussion on the topic! This episode's Measurement Bite from show sponsor Recast is an explanation of synthetic controls and how they can be used as counterfactuals from Michael Kaminsky! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

In this episode, we talked with Aishwarya Jadhav, a machine learning engineer whose career has spanned Morgan Stanley, Tesla, and now Waymo. Aishwarya shares her journey from big data in finance to applied AI in self-driving, gesture understanding, and computer vision. She discusses building an AI guide dog for the visually impaired, contributing to malaria mapping in Africa, and the challenges of deploying safe autonomous systems. We also explore the intersection of computer vision, NLP, and LLMs, and what it takes to break into the self-driving AI industry.TIMECODES00:51 Aishwarya’s career journey from finance to self-driving AI05:45 Building AI guide dog for the visually impaired12:03 Exploring LiDAR, radar, and Tesla’s camera-based approach16:24 Trust, regulation, and challenges in self-driving adoption19:39 Waymo, ride-hailing, and gesture recognition for traffic control24:18 Malaria mapping in Africa and AI for social good29:40 Deployment, safety, and testing in self-driving systems37:00 Transition from NLP to computer vision and deep learning43:37 Reinforcement learning, robotics, and self-driving constraints51:28 Testing processes, evaluations, and staged rollouts for autonomous driving52:53 Can multimodal LLMs be applied to self-driving?55:33 How to get started in self-driving AI careersConnect with Aishwarya- Linkedin - https://www.linkedin.com/in/aishwaryajadhav8/Connect with DataTalks.Club:- Join the community - https://datatalks.club/slack.html- Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/r?cid=ZjhxaWRqbnEwamhzY3A4ODA5azFlZ2hzNjBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ- Check other upcoming events - https://lu.ma/dtc-events- GitHub: https://github.com/DataTalksClub- LinkedIn - https://www.linkedin.com/company/datatalks-club/ - Twitter - https://twitter.com/DataTalksClub - Website - https://datatalks.club/

Et si votre SI devenait vraiment data centric ?

Plutôt que de bâtir autour des applications, découvrez comment placer la donnée au cœur de votre système d’information et la transformer en data products exploitables par toute l’entreprise.

Cette session du Big Data & AI Paris vous plongera dans l’approche Data Driven : comment structurer, gouverner et valoriser vos données pour gagner en agilité, en performance et en innovation.

Faites de vos données le moteur stratégique de votre organisation.

Comment transformer des données publiques hétérogènes en actifs exploitables, traçables et intégrables à grande échelle ? À travers un retour croisé entre North Data et Malt, découvrez les dessous d’un projet Big Data : structuration de données massives, extraction assistée par IA, intégration de ces données dans des flux opérationnels et impact métier concret.

  • Le Big Data a vu l'avénement de l'analyse systématique des données. L’IA nous apporte plus d'intelligence dans ces analyses. Désormais, c'est l’IA agentique qui change la donne en passant de l'intelligence à l’action. 
  • Fort de son expérience chez Palantir et de son implication en tant que CEO chez H, Gautier Cloix montrera comment l’IA agentique permet d'automatiser l’intégration des données, d'agir directement sur les bases de données et de déployer des agents en production. 
  • Le résultat : une utilisation des données plus rapide, plus intelligente et autonome, à grande échelle.

Le groupe Egis conçoit et exploite des infrastructures complexes à l’échelle mondiale : autoroutes, aéroports, ferroviaire, bâtiments, services de mobilité, énergie, aménagement urbain et environnement. La diversité et le volume des données générées posent des défis majeurs en matière de gouvernance, d’industrialisation et de scalabilité.

Pour y répondre, Egis a déployé une infrastructure Data Mesh sur Azure, mise en place et opérée par une équipe dédiée. Cette équipe assure la conception, la gouvernance et la mise à disposition de l’architecture pour l’ensemble des Business Lines. L’infrastructure s’appuie sur :

• Stockage distribué avec ADLS Gen2,

• ETL et traitements big data avec Azure Data Factory et Databricks,

• Visualisation et partage sécurisé via Power BI Service et Delta Sharing,

• Des mécanismes de gouvernance avancés pour garantir interopérabilité et fiabilité.

Cette session présentera :

• Les choix d’architecture et patterns techniques pour mettre en place un Data Mesh distribué à l’échelle d’un grand groupe international

• Le rôle et l’organisation de l’équipe dédiée dans la mise à disposition et l’accompagnement des projets métiers

• Les enseignements pratiques tirés de cas d’usage concrets déjà en production

 

Une immersion au cœur de la mise en œuvre réelle du Data Mesh, pensée pour transformer la donnée en un actif accessible, fiable et exploitable à grande échelle par les équipes métiers et techniques.

Cette session, présentée par Polar Analytics, plongera au cœur des défis du Big Data pour les marques d'e-commerce. Loin des concepts théoriques, nous vous présenterons un cas d'usage concret démontrant comment transformer des données fragmentées en une plateforme de business intelligence unifiée et actionable.

La gouvernance des données n’est pas qu’une question de conformité, c’est une révolution culturelle au cœur de la transformation de l'entreprise. Dans un monde où le Big Data et l’IA redéfinissent les usages, il devient essentiel d’allier maîtrise, adoption et innovation pour garantir un usage stratégique des données. Cette conférence explore une approche holistique intégrant gouvernance, pilotage des investissements et acculturation, afin de transformer la donnée en un véritable levier de performance et d’impact. Quelle stratégie adopter pour réussir une transformation tirée par la donnée ? Comment fédérer les équipes autour d’une vision commune ? Autant de questions clés abordées pour réussir une transformation en profondeur et durable.