Search – talk-data.com

Title & Speakers	Event
How Coinbase Built and Optimized SOON, a Streaming Ingestion Framework 2023-07-26 · 21:11 Chen Guo Data with low latency is important for real-time incident analysis and metrics. Though we have up-to-date data in OLTP databases, they cannot support those scenarios. Data need to be replicated to a data warehouse to serve queries using GroupBy and Join across multiple tables from different systems. At Coinbase, we designed SOON (Spark cOntinuOus iNgestion) based on Kafka, Kafka Connect, and Apache Spark™ as an incremental table replication solution to replicate tables of any size from any database to Delta Lake in a timely manner. It also supports Kafka events ingestion naturally. SOON incrementally ingests Kafka events as appends, updates, and deletes to an existing table on Delta Lake. The events are grouped into two categories: CDC (change data capture) events generated by Kafka Connect source connectors, and non-CDC events by the frontend or backend services. Both types can be appended or merged into the Delta Lake. Non-CDC events can be in any format, but CDC events must be in the standard SOON CDC schema. We implemented Kafka Connect SMTs to transform raw CDC events into this standardized format. SOON unifies all streaming ingestion scenarios such that users only need to learn one onboarding experience and the team only needs to maintain one framework. We care about the ingestion performance. The biggest append-only table onboarded has ingress traffic at hundreds of thousands events per second; the biggest CDC-merge table onboarded has a snapshot size of a few TBs and CDC update traffic at hundreds of thousands events per second. A lot of innovative ideas are incorporated in SOON to improve its performance, such as min-max range merge optimization, KMeans merge optimization, no-update merge for deduplication, generated columns as partitions, etc. Talk by: Chen Guo Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc Data Engineering Data Lakehouse Databricks Delta DWH Kafka Spark Data Streaming	Databricks DATA + AI Summit 2023 YouTube
Solving a Trillion-Dollar Problem with AI featuring Min Chen 2022-01-06 · 23:18 Min Chen – CEO @ Wisy Inc , Jonas Christensen – host “Out of stock”. Three words with a great deal of significance for retailers and their customers. It is estimated that retail products are out of stock 8% of the time in physical stores, and more than 14% of the time in e-commerce stores, leading to frustration for retailers and customers alike. Retailers miss out on important revenue from the forgone sales. Customers leave unfulfilled and are less likely to return to the same retailer or recommend it to others in their network. Supply chains feel the ripples of the gaps between demand and supply. This is a trillion-dollar problem globally. The solution to this problem is not just about demand forecasting, but also knowing what you have in stock, which is a huge challenge in itself. To understand how to solve this challenge, I recently spoke to Min Chen who is the co-founder and CEO of Wisy Inc. The company’s technology is focused on reducing retail stockouts and waste with artificial intelligence and data analytics. Min is a seasoned entrepreneur and an all-round interesting person. Having migrated from China to Panama at age 4, the now lives in Silicon Valley after moving Wisy from Panama to the US in 2020. In this episode of Leaders of Analytics, you will learn: How AI can help solve a global, trillion-dollar supply chain problemHow to develop a product-market fit for AI solutionsHow to bootstrap a start-up in a difficult environmentWhy Wisy decided to move the company from Panama to Silicon Valley AI/ML Analytics Data Analytics	Leaders of Analytics Listen

Title & Speakers

Event

How Coinbase Built and Optimized SOON, a Streaming Ingestion Framework 2023-07-26 · 21:11

Chen Guo

Data with low latency is important for real-time incident analysis and metrics. Though we have up-to-date data in OLTP databases, they cannot support those scenarios. Data need to be replicated to a data warehouse to serve queries using GroupBy and Join across multiple tables from different systems. At Coinbase, we designed SOON (Spark cOntinuOus iNgestion) based on Kafka, Kafka Connect, and Apache Spark™ as an incremental table replication solution to replicate tables of any size from any database to Delta Lake in a timely manner. It also supports Kafka events ingestion naturally.

SOON incrementally ingests Kafka events as appends, updates, and deletes to an existing table on Delta Lake. The events are grouped into two categories: CDC (change data capture) events generated by Kafka Connect source connectors, and non-CDC events by the frontend or backend services. Both types can be appended or merged into the Delta Lake. Non-CDC events can be in any format, but CDC events must be in the standard SOON CDC schema. We implemented Kafka Connect SMTs to transform raw CDC events into this standardized format. SOON unifies all streaming ingestion scenarios such that users only need to learn one onboarding experience and the team only needs to maintain one framework.

We care about the ingestion performance. The biggest append-only table onboarded has ingress traffic at hundreds of thousands events per second; the biggest CDC-merge table onboarded has a snapshot size of a few TBs and CDC update traffic at hundreds of thousands events per second. A lot of innovative ideas are incorporated in SOON to improve its performance, such as min-max range merge optimization, KMeans merge optimization, no-update merge for deduplication, generated columns as partitions, etc.

Talk by: Chen Guo

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Engineering Data Lakehouse Databricks Delta DWH Kafka Spark Data Streaming

Databricks DATA + AI Summit 2023

YouTube

Solving a Trillion-Dollar Problem with AI featuring Min Chen 2022-01-06 · 23:18

Min Chen – CEO @ Wisy Inc , Jonas Christensen – host

“Out of stock”. Three words with a great deal of significance for retailers and their customers. It is estimated that retail products are out of stock 8% of the time in physical stores, and more than 14% of the time in e-commerce stores, leading to frustration for retailers and customers alike. Retailers miss out on important revenue from the forgone sales. Customers leave unfulfilled and are less likely to return to the same retailer or recommend it to others in their network. Supply chains feel the ripples of the gaps between demand and supply. This is a trillion-dollar problem globally. The solution to this problem is not just about demand forecasting, but also knowing what you have in stock, which is a huge challenge in itself. To understand how to solve this challenge, I recently spoke to Min Chen who is the co-founder and CEO of Wisy Inc. The company’s technology is focused on reducing retail stockouts and waste with artificial intelligence and data analytics. Min is a seasoned entrepreneur and an all-round interesting person. Having migrated from China to Panama at age 4, the now lives in Silicon Valley after moving Wisy from Panama to the US in 2020. In this episode of Leaders of Analytics, you will learn: How AI can help solve a global, trillion-dollar supply chain problemHow to develop a product-market fit for AI solutionsHow to bootstrap a start-up in a difficult environmentWhy Wisy decided to move the company from Panama to Silicon Valley

AI/ML Analytics Data Analytics

Leaders of Analytics

Listen

talk-data.com

People (109 results)

Activities & events