Vertica

#201 The Database is the Operating System with Mike Stonebraker, CTO & Co-Founder At DBOS

2024-04-25 · DataFramed Listen

podcast_episode

by Mike Stonebraker (DBOS)

AI/ML Analytics Cloud Computing Computer Science Data Management RDBMS Cyber Security SQL postgresql

Databases are ubiquitous, and you don’t need to be a data practitioner to know that all data everywhere is stored in a database—or is it? While the majority of data around the world lives in a database, the data that helps run the heart of our operating systems—the core functions of our computers— is not stored in the same place as everywhere else. This is due to database storage sitting ‘above’ the operating system, requiring the OS to run before the databases can be used. But what if the OS was built ‘on top’ of a database? What difference could this fundamental change make to how we use computers? Mike Stonebraker is a distinguished computer scientist known for his foundational work in database systems, he is also currently CTO & Co-Founder At DBOS. His extensive career includes significant contributions through academic prototypes and commercial startups, leading to the creation of several pivotal relational database companies such as Ingres Corporation, Illustra, Paradigm4, StreamBase Systems, Tamr, Vertica, and VoltDB. Stonebraker's role as chief technical officer at Informix and his influential research earned him the prestigious 2014 Turing Award. Stonebraker's professional journey spans two major phases: initially at the University of California, Berkeley, focusing on relational database management systems like Ingres and Postgres, and later, from 2001 at the Massachusetts Institute of Technology (MIT), where he pioneered advanced data management techniques including C-Store, H-Store, SciDB, and DBOS. He remains a professor emeritus at UC Berkeley and continues to influence as an adjunct professor at MIT’s Computer Science and Artificial Intelligence Laboratory. Stonebraker is also recognized for his editorial work on the book "Readings in Database Systems." In the episode, Richie and Mike explore the the success of PostgreSQL, the evolution of SQL databases, the shift towards cloud computing and what that means in practice when migrating to the cloud, the impact of disaggregated storage, software and serverless trends, the role of databases in facilitating new data and AI trends, DBOS and it’s advantages for security, and much more. Links Mentioned in the Show: DBOSPaper: What Goes Around Comes Around[Course] Understanding Cloud ComputingRelated Episode: Scaling Enterprise Analytics with Libby Duane Adams, Chief Advocacy Officer and Co-Founder of AlteryxRewatch sessions from RADAR: The Analytics Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Vertical Growth

2022-08-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Carl Lemieux , Michael Bunting

data data-engineering relational-databases

Learn the secrets to self-awareness, life-changing growth and happy, high-performing teams—from the bestselling author of The Mindful Leader Great leaders and teams don’t know everything, and they don’t get it right every time. What sets them apart is their commitment to continual learning and vertical growth. Vertical growth is about cultivating the self-awareness to see our self-defeating thoughts, assumptions and behaviours, and then consciously creating new behaviours that are aligned with our best intentions and aspirations. By embracing the deliberate practices and processes for vertical growth laid out in this book, you’ll not only radically improve your leadership and personal wellbeing—you’ll also foster the highest levels of trust, psychological safety, motivation, and creativity in the teams and groups you work with. You’ll to discover how to: Identify when, where and how to develop new leadership behaviours to get better results Regulate your emotional responses in real time and handle the most difficult challenges with balance, wisdom and accountability Cultivate practices for self-awareness that foster lifelong internal growth and personal happiness Uncover and change the limiting assumptions and beliefs that keep you, your team and organisation locked in unproductive habits and behaviours Create practices and rituals that enable the highest levels of psychological safety, innovation and growth Filled with fascinating real-life case studies as well as practical tools and strategies, this is your handbook for mastering vertical growth in yourself, your team and your organisation.

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

2021-08-03 · Data Engineering Podcast Listen

podcast_episode

by Vinoth Chandar (Uber) , Tobias Macey

Flink API BigQuery Cloud Computing CSV Data Engineering Data Lake Data Management dbt Delta DWH Hadoop +19 more

Summary Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Vinoth Chandar about Apache Hudi, a data lake management layer for supporting fast and incremental updates to your tables.

Interview

Introduction How did you get involved in the area of data management? Can you describe what Hudi is and the story behind it? What are the use cases that it is focused on supporting? There have been a number of alternative table formats introduced for data lakes recently. How does Hudi compare to projects like Iceberg, Delta Lake, Hive, etc.? Can you describe how Hudi is architected?

How have the goals and design of Hudi changed or evolved since you first began working on it? If you were to start the whole project over today, what would you do differently?

Can you talk through the lifecycle of a data record as it is ingested, compacted, and queried in a Hudi deployment? One of the capabilities that is interesting to explore is support for arbitrary record deletion. Can you talk through why this is a challenging operation in data lake architectures?

How does Hudi make that a tractable problem?

What are the data platform components that are needed to support an installation of Hudi? What is involved in migrating an existing data lake to use Hudi?

How would someone approach supporting heterogeneous table formats in their lake?

As someone who has invested a lot of time in technologies for supporting data lakes, what are your thoughts on the tradeoffs of data lake vs data warehouse and the current trajectory of the ecosystem? What are the most interesting, innovative, or unexpected ways that you have seen Hudi used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Hudi? When is Hudi the wrong choice? What do you have planned for the future of Hudi?

Contact Info

Linkedin Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Hudi Docs Hudi Design & Architecture Incremental Processing CDC == Change Data Capture

Podcast Episodes

Oracle GoldenGate Voldemort Kafka Hadoop Spark HBase Parquet Iceberg Table Format

Data Engineering Episode

Hive ACID Apache Kudu

Podcast Episode

Vertica Delta Lake

Podcast Episode

Optimistic Concurrency Control MVCC == Multi-Version Concurrency Control Presto Flink

Podcast Episode

Trino

Podcast Episode

Gobblin LakeFS

Podcast Episode

Nessie

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Accelerating ML Training And Delivery With In-Database Machine Learning

2021-06-15 · Data Engineering Podcast Listen

podcast_episode

by Paige Roberts (Vertica) , Tobias Macey

AI/ML API BigQuery Cloud Computing CSV Data Engineering Data Management dbt DWH Hubspot Kubernetes Marketing +7 more

Summary When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a way to speed up your experimentation, or an easy way to apply AutoML then this conversation is for you.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Paige Roberts about machine learning workflows inside the database

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of the current state of the market for databases that support in-process machine learning?

What are the motivating factors for running a machine learning workflow inside the database?

What styles of ML are feasible to do inside the database? (e.g. bayesian inference, deep learning, etc.) What are the performance implications of running a model training pipeline within the database runtime? (both in terms of training performance boosts, and database performance impacts) Can you describe the architecture of how the machine learning process is managed by the database engine? How do you manage interacting with Python/R/Jupyter/etc. when working within the database? What is the impact on data pipeline and MLOps architectures when using the database to manage the machine learning workflow? What are the most interesting, innovative, or unexpected ways that you have seen in-database ML used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on machine learning inside the database? When is in-database ML the wrong choice? What are the recent trends/

CockroachDB In Depth with Peter Mattis - Episode 35

2018-06-11 · Data Engineering Podcast Listen

podcast_episode

by Peter Mattis (Cockroach Labs) , Tobias Macey

API Cloud Computing Data Engineering Data Management Datadog Docker GDPR/CCPA GitHub Go Kubernetes NoSQL RDBMS +4 more

Summary

With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Peter Mattis about CockroachDB, the SQL database for global cloud services

Interview

Introduction How did you get involved in the area of data management? What was the motivation for creating CockroachDB and building a business around it? Can you describe the architecture of CockroachDB and how it supports distributed ACID transactions?

What are some of the tradeoffs that are necessary to allow for georeplicated data with distributed transactions? What are some of the problems that you have had to work around in the RAFT protocol to provide reliable operation of the clustering mechanism?

Go is an unconventional language for building a database. What are the pros and cons of that choice? What are some of the common points of confusion that users of CockroachDB have when operating or interacting with it?

What are the edge cases and failure modes that users should be aware of?

I know that your SQL syntax is PostGreSQL compatible, so is it possible to use existing ORMs unmodified with CockroachDB?

What are some examples of extensions that are specific to CockroachDB?

What are some of the most interesting uses of CockroachDB that you have seen? When is CockroachDB the wrong choice? What do you have planned for the future of CockroachDB?

Contact Info

Peter

LinkedIn petermattis on GitHub @petermattis on Twitter

Cockroach Labs

@CockroackDB on Twitter Website cockroachdb on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

CockroachDB Cockroach Labs SQL Google Bigtable Spanner NoSQL RDBMS (Relational Database Management System) “Big Iron” (colloquial term for mainframe computers) RAFT Consensus Algorithm Consensus MVCC (Multiversion Concurrency Control) Isolation Etcd GDPR Golang C++ Garbage Collection Metaprogramming Rust Static Linking Docker Kubernetes CAP Theorem PostGreSQL ORM (Object Relational Mapping) Information Schema PG Catalog Interleaved Tables Vertica Spark Change Data Capture

The intro and outro music is from The Hug by The Freak Fandan

Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

2018-04-23 · Data Engineering Podcast Listen

podcast_episode

by Amnon Drori (Octopai) , Tobias Macey

Airflow API BI CRM Data Engineering Data Governance Data Management Datadog ERP ETL/ELT GDPR/CCPA Informatica +4 more

Summary

The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting the company, how their systems are able to provide valuable tools and insights, and the direction that their product will be taking in the future.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Amnon Drori about OctopAI and the benefits of metadata management

Interview

Introduction How did you get involved in the area of data management? What is OctopAI and what was your motivation for founding it? What are some of the types of information that you classify and collect as metadata? Can you talk through the architecture of your platform? What are some of the challenges that are typically faced by metadata management systems? What is involved in deploying your metadata collection agents? Once the metadata has been collected what are some of the ways in which it can be used? What mechanisms do you use to ensure that customer data is segregated?

How do you identify and handle sensitive information during the collection step?

What are some of the most challenging aspects of your technical and business platforms that you have faced? What are some of the plans that you have for OctopAI going forward?

Contact Info

Amnon

LinkedIn @octopai_amnon on Twitter

OctopAI

@OctopaiBI on Twitter Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

OctopAI Metadata Metadata Management Data Integrity CRM (Customer Relationship Management) ERP (Enterprise Resource Planning) Business Intelligence ETL (Extract, Transform, Load) Informatica SAP Data Governance SSIS (SQL Server Integration Services) Vertica Airflow Luigi Oozie GDPR (General Data Privacy Regulation) Root Cause Analysis

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

2017-11-22 · Data Engineering Podcast Listen

podcast_episode

by Doug Cutting , Julien Le Dem (Astronomer) , Tobias Macey

Arrow Avro CI/CD CSV Data Engineering Data Management GitHub Hadoop Hive Linux Parquet Presto +3 more

Summary With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems.

Interview

Introduction How did you first get involved in the area of data management? What are the main serialization formats used for data storage and analysis? What are the tradeoffs that are offered by the different formats? How have the different storage and analysis tools influenced the types of storage formats that are available? You’ve each developed a new on-disk data format, Avro and Parquet respectively. What were your motivations for investing that time and effort? Why is it important for data engineers to carefully consider the format in which they transfer their data between systems?

What are the switching costs involved in moving from one format to another after you have started using it in a production system?

What are some of the new or upcoming formats that you are each excited about? How do you anticipate the evolving hardware, patterns, and tools for processing data to influence the types of storage formats that maintain or grow their popularity?

Contact Information

Doug:

cutting on GitHub Blog @cutting on Twitter

Julien

Email @J_ on Twitter Blog julienledem on GitHub

Links

Apache Avro Apache Parquet Apache Arrow Hadoop Apache Pig Xerox Parc Excite Nutch Vertica Dremel White Paper

Twitter Blog on Release of Parquet

CSV XML Hive Impala Presto Spark SQL Brotli ZStandard Apache Drill Trevni Apache Calcite

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

The Big Data Transformation

2016-11-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ashish Thusoo

Analytics Big Data Data Analytics DWH Hadoop Marketing data data-engineering

Business executives today are well aware of the power of data, especially for gaining actionable insight into products and services. But how do you jump into the big data analytics game without spending millions on data warehouse solutions you don’t need? This 40-page report focuses on massively parallel processing (MPP) analytical databases that enable you to run queries and dashboards on a variety of business metrics at extreme speed and Exabyte scale. Because they leverage the full computational power of a cluster, MPP analytical databases can analyze massive volumes of data—both structured and semi-structured—at unprecedented speeds. This report presents five real-world case studies from Etsy, Cerner Corporation, Criteo and other global enterprises to focus on one big data analytics platform in particular, HPE Vertica. You’ll discover: How one prominent data storage company convinced both business and tech stakeholders to adopt an MPP analytical database Why performance marketing technology company Criteo used a Center of Excellence (CoE) model to ensure the success of its big data analytics endeavors How YPSM uses Vertica to speed up its Hadoop-based data processing environment Why Cerner adopted an analytical database to scale its highly successful health information technology platform How Etsy drives success with the company’s big data initiative by avoiding common technical and organizational mistakes

talk-data.com

Activity Trend

Top Events

Top Speakers

#201 The Database is the Operating System with Mike Stonebraker, CTO & Co-Founder At DBOS

Vertical Growth

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Accelerating ML Training And Delivery With In-Database Machine Learning

CockroachDB In Depth with Peter Mattis - Episode 35

Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

The Big Data Transformation