talk-data.com talk-data.com

Filter by Source

Select conferences and events

People (3 results)

Showing 5 results

Activities & events

Title & Speakers Event
Gilad Tal – Co-founder & CTO @ Dualbird
Spark
Meni Shmueli – Co-founder & CEO @ DataFlint
Spark Big Data

One of the big challenges in big data is interacting with the storage layer, especially in the data lake where we are the one who manages the files and partitions. One of the most common performance problems in data lakes is working with small files. In this lecture we will learn about: * Why it's important to read and write files in best-practice size * How Apache Spark under the hood interact with files, and how it relates to Spark Tasks * How we can easily detect and fix small files problem (by using the open source library DataFlint) * How to handle small files problems when using storage formats such as delta lake & iceberg.

Lecturer: Meni Shmueli- founder and author of DataFlint.(https://github.com/dataflint/spark). Ex-81 unit, Ex-Ziprecruiter and Ex-Granulate. Passionate about everything related to Big Data, and about working with data teams to solve their day-to-day challenges. Over the years helped dozens of companies improve performance, debug issues and improve dev velocity in the big data world, and is currently trying to solve performance observability in big data with DataFlint.

Fixing small files performance issues in Apache Spark, using DataFlint

One of the big challenges in big data is interacting with the storage layer, especially in the data lake where we are the one who manages the files and partitions. One of the most common performance problems in data lakes is working with small files. In this lecture we will learn about: * Why it's important to read and write files in best-practice size * How Apache Spark under the hood interact with files, and how it relates to Spark Tasks * How we can easily detect and fix small files problem (by using the open source library DataFlint) * How to handle small files problems when using storage formats such as delta lake & iceberg.

Lecturer: Meni Shmueli- founder and author of DataFlint (https://github.com/dataflint/spark). Ex-81 unit, Ex-Ziprecruiter and Ex-Granulate. Passionate about everything related to Big Data, and about working with data teams to solve their day-to-day challenges. Over the years helped dozens of companies improve performance, debug issues and improve dev velocity in the big data world, and is currently trying to solve performance observability in big data with DataFlint.

Fixing small files performance issues in Apache Spark, using DataFlint
Showing 5 results