Topic

HDFS

Hadoop Distributed File System (HDFS)

distributed_storage big_data hadoop

Activities

2

tagged

Activity Trend

1 peak/qtr

2020-Q1 2026-Q2

Top Events

O'Reilly Data Engineering Books 46 Data Engineering Podcast 6 O'Reilly Data Science Books 2 Databricks DATA + AI Summit 2023 2 O'Reilly SQL Books 1

Top Speakers

Tobias Macey 6 Tom White 4 Sandeep R Patil 3 Sandeep Karanth 2 Enrico van de Laar 2 Douglas Eadline 2 Deepak Vohra 2 Donald Miner 2 Benjamin Weissman 2 Alan Gates 2 Shiva Achari 1 Muthu Muthiah 1

Activities

2 activities · Newest first

All Video Podcast Book

Optimizing Speed and Scale of User-Facing Analytics Using Apache Kafka and Pinot

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

by Karin Wolok (StarTree) , Neha Power (StarTree)

Analytics Azure Big Data Cloud Computing Cloud Storage Data Lake Databricks GCP Kafka Oracle S3 Data Streaming

Apache Kafka is the de facto standard for real-time event streaming, but what do you do if you want to perform user-facing, ad-hoc, real-time analytics too? That's where Apache Pinot comes in.

Apache Pinot is a realtime distributed OLAP datastore, which is used to deliver scalable real time analytics with low latency. It can ingest data from batch data sources (S3, HDFS, Azure Data Lake, Google Cloud Storage) as well as streaming sources such as Kafka. Pinot is used extensively at LinkedIn and Uber to power many analytical applications such as Who Viewed My Profile, Ad Analytics, Talent Analytics, Uber Eats and many more serving 100k+ queries per second while ingesting 1Million+ events per second.

Apache Kafka's highly performant, distributed, fault-tolerant, real-time publish-subscribe messaging platform powers big data solutions at Airbnb, LinkedIn, MailChimp, Netflix, the New York Times, Oracle, PayPal, Pinterest, Spotify, Twitter, Uber, Wikimedia Foundation, and countless other businesses.

Come hear from Neha Power, Founding Engineer at a StarTree and PMC and committer of Apache Pinot, and Karin Wolok, Head of Developer Community at StarTree, on an introduction to both systems and a view of how they work together.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

An Advanced S3 Connector for Spark to Hunt for Cyber Attacks

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Databricks S3 Cyber Security Spark Data Streaming

Working with S3 is different from doing so with HDFS: The architecture of the Object store makes the standard Spark file connector inefficient to work with S3.

There is a way to tackle this problem with a message queue for listening to changes in a bucket. What if an additional message queue is not an option and you need to use Spark-streaming? You can use a standard file connector, but you quickly face performance degradation with a number of files in the source path.

We have seen this happen at Hunters, a security operations platform that works with a wide range of data sources.

We want to share a description of the problem and the solution we will open-source. The audience will learn how to configure it and make the best use of it. We will also discuss how to use metadata to boost the performance of discovering new files in the stream and show the use case of utilizing time metadata of CloudTrail to efficiently collect logs for hunting cyber attacks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/