Cloud Computing

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

2024-02-25 · Data Engineering Podcast Listen

podcast_episode

by Paul Dix (InfluxData) , Tobias Macey

AI/ML Analytics Arrow Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Science Delta Hudi Iceberg +4 more

Summary

Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Paul Dix about his investment in the Apache Arrow ecosystem and how it led him to create the latest PFAD in database design

Interview

Introduction How did you get involved in the area of data management? Can you start by describing the FDAP stack and how the components combine to provide a foundational architecture for database engines?

This was the core of your recent re-write of the InfluxDB engine. What were the design goals and constraints that led you to this architecture?

Each of the architectural components are well engineered for their particular scope. What is the engineering work that is involved in building a cohesive platform from those components? One of the major benefits of using open source components is the network effect of ecosystem integrations. That can also be a risk when the community vision for the project doesn't align with your own goals. How have you worked to mitigate that risk in your specific platform? Can you describe the

[AI and the Modern Data Stack] #183 Adding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake

2024-02-21 · DataFramed Listen

podcast_episode

by Richie (DataCamp) , Sridhar Ramaswamy (Snowflake)

AI/ML Analytics Data Management Data Quality Databricks DWH GenAI Marketing Modern Data Stack NLP Snowflake Thoughtspot

Snowflake has been foundational in the data space for years. In the mid-2010s, the platform was a major driver of moving data to the cloud. More recently, it's become apparent that combining data and AI in the cloud is key to accelerating innovation. Snowflake has been rapidly adding AI features to provide value to the modern data stack, but what’s really been going on under the hood? At the time of recording, Sridhar Ramaswamy was the SVP of AI at Snowflake, being appointed CEO at Snowflake in February 2024. Sridhar was formerly Co-Founder of Neeva, acquired in 2023 by Snowflake. Before founding Neeva, Ramaswamy oversaw Google's advertising products, including search, display, video advertising, analytics, shopping, payments, and travel. He joined Google in 2003 and was part of the growth of AdWords and Google's overall advertising business. He spent more than 15 years at Google, where he started as a software engineer and rose to SVP of Ads & Commerce. In the episode, Richie and Sridhar explore Snowflake and its uses, how generative AI is changing the attitudes of leaders towards data, how NLP and AI have impacted enterprise business operations as well as new applications of AI in an enterprise environment, the challenges of enterprise search, the importance of data quality, management and the role of semantic layers in the effective use of AI, a look into Snowflakes products including Snowpilot and Cortex, the collaboration required for successful data and AI projects, advice for organizations looking to improve their data management and much more. About the AI and the Modern Data Stack DataFramed Series This week we’re releasing 4 episodes focused on how AI is changing the modern data stack and the analytics profession at large. The modern data stack is often an ambiguous and all-encompassing term, so we intentionally wanted to cover the impact of AI on the modern data stack from different angles. Here’s what you can expect: Why the Future of AI in Data will be Weird with Benn Stancil, CTO at Mode & Field CTO at ThoughtSpot — Covering how AI will change analytics workflows and tools How Databricks is Transforming Data Warehousing and AI with Ari Kaplan, Head Evangelist & Robin Sutara, Field CTO at Databricks — Covering Databricks, data intelligence and how AI tools are changing data democratizationAdding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake — Covering Snowflake and its uses, how generative AI is changing the attitudes of leaders towards data, and how to improve your data managementAccelerating AI Workflows with Nuri Cankaya, VP of AI Marketing & La Tiffaney Santucci, AI Marketing Director at Intel — Covering AI’s impact on marketing analytics, how AI is being integrated into existing products, and the democratization of AI Links Mentioned in the Show: SnowflakeSnowflake acquires Neeva to accelerate search in the Data Cloud through generative AIUse AI in Seconds with Snowflake Cortex[Course] Introduction to SnowflakeRelated Episode: Why AI will Change Everything—with Former Snowflake CEO, Bob MugliaSign up to a...

Mastering Microsoft Fabric: SAASification of Analytics

2024-02-21 · O'Reilly Data Science Books O'Reilly Amazon

book

by Debananda Ghosh

AI/ML Analytics AWS Azure ADF BI Data Engineering Data Lakehouse Data Management Data Science DWH LLM +9 more

Learn and explore the capabilities of Microsoft Fabric, the latest evolution in cloud analytics suites. This book will help you understand how users can leverage Microsoft Office equivalent experience for performing data management and advanced analytics activity. The book starts with an overview of the analytics evolution from on premises to cloud infrastructure as a service (IaaS), platform as a service (PaaS), and now software as a service (SaaS version) and provides an introduction to Microsoft Fabric. You will learn how to provision Microsoft Fabric in your tenant along with the key capabilities of SaaS analytics products and the advantage of using Fabric in the enterprise analytics platform. OneLake and Lakehouse for data engineering is discussed as well as OneLake for data science. Author Ghosh teaches you about data warehouse offerings inside Microsoft Fabric and the new data integration experience which brings Azure Data Factory and Power Query Editor of Power BI together in a single platform. Also demonstrated is Real-Time Analytics in Fabric, including capabilities such as Kusto query and database. You will understand how the new event stream feature integrates with OneLake and other computations. You also will know how to configure the real-time alert capability in a zero code manner and go through the Power BI experience in the Fabric workspace. Fabric pricing and its licensing is also covered. After reading this book, you will understand the capabilities of Microsoft Fabric and its Integration with current and upcoming Azure OpenAI capabilities. What You Will Learn Build OneLake for all data like OneDrive for Microsoft Office Leverage shortcuts for cross-cloud data virtualization in Azure and AWS Understand upcoming OpenAI integration Discover new event streaming and Kusto query inside Fabric real-time analytics Utilize seamless tooling for machine learning and data science Who This Book Is For Citizen users and experts in the data engineering and data science fields, along with chief AI officers

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

2024-02-18 · Data Engineering Podcast Listen

podcast_episode

by Dain Sundstrom (Starburst) , Tobias Macey

AI/ML Analytics Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Science Delta Hudi Iceberg Cyber Security +2 more

Summary

A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join in with the event for the global data community, Data Council Austin. From March 26th-28th 2024, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working togethr to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today. Your host is Tobias Macey and today I'm interviewing Dain Sundstrom about building a data lakehouse with Trino and Iceberg

Interview

Introduction How did you get involved in the area of data management? To start, can you share your definition of what constitutes a "Data Lakehouse"?

What are the technical/architectural/UX challenges that have hindered the progression of lakehouses? What are the notable advancements in recent months/years that make them a more viable platform choice?

There are multiple tools and vendors that have adopted the "data lakehouse" terminology. What are the benefits offered by the combination of Trino and Iceberg?

What are the key points of comparison for that combination in relation to other possible selections?

What are the pain points that are still prevalent in lakehouse architectures as compared to warehouse or vertically integrated systems?

What progress is being made (within or across the ecosystem) to address those sharp edges?

For someone who is interested in building a data lakehouse with Trino and Iceberg, how does that influence their selection of other platform elements? What are the differences in terms of pipeline design/access and usage patterns when using a Trino

5 Minute Friday - Everything Ends...Moving on From the Modern Data Stack

2024-02-16 · The Joe Reis Show Listen

podcast_episode

by Joe Reis (DeepLearning.AI)

Analytics Modern Data Stack

My voice is sort of working, and I chat about Tristan Handy's article that raised quite a ruckus this week, "Is the "Modern Data Stack" Still a Useful Idea?"

In the end, the Modern Data Stack won - people use the cloud for analytics. And everything ends, so I'm excited for what's next.

Article: https://roundup.getdbt.com/p/is-the-modern-data-stack-still-a?r=oc02

#180 How AI is Changing Cybersecurity with Brian Murphy, CEO of ReliaQuest

2024-02-12 · DataFramed Listen

podcast_episode

by Brian Murphy (ReliaQuest)

AI/ML GenAI LLM Cyber Security

Just as many of us have been using generative AI tools to make us more productive at work, so have bad actors. Generative AI makes it much easier to create fake yet convincing text and images that can be used to deceive and harm. We’ve already seen lots of high-profile attempts to leverage AI in phishing campaigns, and this is putting more pressure on cybersecurity teams to get ahead of the curve and combat these new forms of threats. However, AI is also helping those that work in cybersec to be more productive and better equip themselves to create new forms of defense and offense. Brian Murphy is a founder, CEO, entrepreneur and investor. He founded and leads ReliaQuest, the force multiplier of security operations and one of the largest and fastest-growing companies in the global cybersecurity market. ReliaQuest increases visibility, reduces complexity, and manages risk with its cloud-native security operations platform, GreyMatter. Murphy grew ReliaQuest from a boot-strapped startup to a high-growth unicorn with a valuation of over $1 billion, more than 1,000 team members, and more than $350 million in growth equity with firms such as FTV Capital and KKR Growth. In the full episode, Adel and Brian cover the evolution of cybersecurity tools, the challenges faced by cybersecurity teams, types of cyber threats, how generative AI can be used both defensively and offensively in cybersecurity, how generative AI tools are making cybersecurity professionals more productive, the evolving role of cybersecurity professionals, the security implications of deploying AI models, the regulatory landscape for AI in cybersecurity and much more. Links Mentioned in the Show: ReliaQuestReliaQuest BlogIBM finds that ChatGPT can generate phishing emails nearly as convincing as a humanInformation Sharing and Analysis Centers (ISACs)[Course] Introduction to Data SecurityRelated episode: Data Security in the Age of AI with Bart Vandekerckhove, Co-founder at Raito New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

Data Sharing Across Business And Platform Boundaries

2024-02-11 · Data Engineering Podcast Listen

podcast_episode

by Andy Jefferson , Tobias Macey

AI/ML Analytics Dagster Data Engineering Data Lake Data Lakehouse Data Management Delta Hudi Iceberg Python Cyber Security +2 more

Summary

Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing

Interview

Introduction How did you get involved in the area of data management? Can you start by giving some context and scope of what we mean by "data sharing" for the purposes of this conversation? What is the current state of the ecosystem for data sharing protocols/practices/platforms?

What are some of the main challenges/shortcomings that teams/organizations experience with these options?

What are the technical capabilities that need to be present for an effective data sharing solution?

How does that change as a function of the type of data? (e.g. tabular, image, etc.)

What are the requirements around governance and auditability of data access that need to be addressed when sharing data? What are the typical boundaries along which data access requires special consideration for how the sharing is managed? Many data platform vendors have their own interfaces for data sharing. What are the shortcomings of those options, and what are the opportunities for abstracting the sharing capability from the underlying platform? What are the most interesting, innovative, or unexpected ways that you have seen data sharing/Bobsled used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data sharing? When is Bobsled the wrong choice? What do you have planned for the future of data sharing?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine

IBM and CMTG Cyber Resiliency: Building an Automated, VMware Aware Safeguarded Copy Solution to Provide Data Resilience

2024-02-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Stephen Doney , Barry Whyte , Neil Morris

Data Management IBM VMware data data-engineering

This IBM Blueprint outlines how CMTG and IBM have partnered to provide cyber resilient services to their clients. CMTG is one of Australia's leading private cloud providers based in Perth, Western Australia. The solution is based on IBM Storage FlashSystem, IBM Safeguarded Copy and IBM Storage Copy Data Management. The target audience for this Blueprint is IBM Storage technical specialists and storage admins.

Secure Cloud, Edge, and Content Security

2024-02-08 · Cisco SCOR: Building a Strong Cybersecurity Foundation

talk

Cyber Security

IBM Storage Virtualize, IBM Storage FlashSystem, and IBM SAN Volume Controller Security Feature Checklist - For IBM Storage Virtualize 8.6

2024-02-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by James Whitaker , Bill Scales , Barry Whyte

IBM Cyber Security data data-engineering

IBM® Storage Virtualize based storage systems are secure storage platforms that implement various security-related features, in terms of system-level access controls and data-level security features. This document outlines the available security features and options of IBM Storage Virtualize based storage systems. It is not intended as a "how to" or best practice document. Instead, it is a checklist of features that can be reviewed by a user security team to aid in the definition of a policy to be followed when implementing IBM FlashSystem®, IBM SAN Volume Controller, and IBM Storage Virtualize for Public Cloud. IBM Storage Virtualize features the following levels of security to protect against threats and to keep the attack surface as small as possible: The first line of defense is to offer strict verification features that stop unauthorized users from using login interfaces and gaining access to the system and its configuration. The second line of defense is to offer least privilege features that restrict the environment and limit any effect if a malicious actor does access the system configuration. The third line of defense is to run in a minimal, locked down, mode to prevent damage spreading to the kernel and rest of the operating system. The fourth line of defense is to protect the data at rest that is stored on the system from theft, loss, or corruption (malicious or accidental). The topics that are discussed in this paper can be broadly split into two categories: System security: This type of security encompasses the first three lines of defense that prevent unauthorized access to the system, protect the logical configuration of the storage system, and restrict what actions users can perform. It also ensures visibility and reporting of system level events that can be used by a Security Information and Event Management (SIEM) solution, such as IBM QRadar®. Data security: This type of security encompasses the fourth line of defense. It protects the data that is stored on the system against theft, loss, or attack. These data security features include Encryption of Data At Rest (EDAR) or IBM Safeguarded Copy (SGC). This document is correct as of IBM Storage Virtualize 8.6.

Tackling Real Time Streaming Data With SQL Using RisingWave

2024-02-04 · Data Engineering Podcast Listen

podcast_episode

by Yingjun Wu (RisingWave Labs) , Tobias Macey

AI/ML Analytics Dagster Data Engineering Data Lake Data Lakehouse Data Management Delta DWH GitHub Hudi Iceberg +5 more

Summary

Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3

Interview

Introduction How did you get involved in the area of data management? Can you describe what RisingWave is and the story behind it? There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses?

What are some of the platforms/architectures that teams are replacing with RisingWave?

What are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem? Can you describe how RisingWave is architected and implemented?

How have the design and goals/scope changed since you first started working on it? What are the core design philosophies that you rely on to prioritize the ongoing development of the project?

What are the most complex engineering challenges that you have had to address in the creation of RisingWave? Can you describe a typical workflow for teams that are building on top of RisingWave?

What are the user/developer experience elements that you have prioritized most highly?

What are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine? What are the most interesting, innovative, or unexpected ways that you have seen RisingWave used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave? When is RisingWave the wrong choice? What do you have planned for the future of RisingWave?

Contact Info

yingjunwu on GitHub Personal Website LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows.

Hands-On Entity Resolution

2024-02-02 · O'Reilly Data Science Books O'Reilly Amazon

book

by Michael Shearer

AI/ML API Marketing Python data data-science data-science-tasks entity-resolution-record-linkage entity resolution / record linkage

Entity resolution is a key analytic technique that enables you to identify multiple data records that refer to the same real-world entity. With this hands-on guide, product managers, data analysts, and data scientists will learn how to add value to data by cleansing, analyzing, and resolving datasets using open source Python libraries and cloud APIs. Author Michael Shearer shows you how to scale up your data matching processes and improve the accuracy of your reconciliations. You'll be able to remove duplicate entries within a single source and join disparate data sources together when common keys aren't available. Using real-world data examples, this book helps you gain practical understanding to accelerate the delivery of real business value. With entity resolution, you'll build rich and comprehensive data assets that reveal relationships for marketing and risk management purposes, key to harnessing the full potential of ML and AI. This book covers: Challenges in deduplicating and joining datasets Extracting, cleansing, and preparing datasets for matching Text matching algorithms to identify equivalent entities Techniques for deduplicating and joining datasets at scale Matching datasets containing persons and organizations Evaluating data matches Optimizing and tuning data matching algorithms Entity resolution using cloud APIs Matching using privacy-enhancing technologies

Cloud infrastructure for LLM

2024-02-01 · AI meetup (February): Generative AI and LLMs in Action

talk

LLM

In this talk, I will detail what kind of infrastructures are necessary to support Large Language Model training and inference, and how they are built.

IBM Storage Fusion Multicloud Object Gateway

2024-01-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Eyal Abraham , Shawn Houston

IBM data data-engineering

This Redpaper provides an overview of IBM Storage Fusion Multicloud Object Gateway (MCG) and can be used as a quick reference guide for the most common use cases. The intended audience is cloud and application administrators, as well as other technical staff members who wish to learn how MCG works, how to set it up, and usage of a Backing Store or Namespace Store, as well as object caching.

Tendências para Dados e AI em 2024 — Data Hackers Podcast 80

2024-01-26 · Data Hackers Listen

podcast_episode

by Monique Femme (PUCRS) , Paulo Vasconcellos , Gabriel Lages , Pietro Oliveira (EBANX) , Marlesson Santana (Data Hackers)

AI/ML Analytics Databricks

Pelo quarto ano consecutivo, ainda acreditamos no IPO da Databricks e o Paulo Vasconcellos, já apostou todas as suas forças na venda da Stability AI, em 2024. E se você ouviu o episódio de Tendências, que gravamos no ano passado, acertamos quase todas previsões !!

Agora, chegou aquele momento do ano em que vamos tentar prever o que será tendência em Dados e AI para o ano de 2024! Vem com a gente pra esse papo com nossos community managers: Marlesson Santana , Pietro Oliveira e a bancada Data Hackers.

Façam suas apostas !!

Lembrando que você pode encontrar todos os podcasts da comunidade Data Hackers no Spotify, iTunes, Google Podcast, Castbox e muitas outras plataformas. Caso queira, você também pode ouvir o episódio aqui no post mesmo!

[EMBEDAR_EPISODIO]

Conheça nosso convidado:

Pietro Oliveira Marlesson Santana

Nossa Bancada Data Hackers:

Monique Femme — Head of Community Management na Data Hackers Paulo Vasconcellos — Co-founder da Data Hackers e Principal Data Scientist na Hotmart. Gabriel Lages — Co-founder da Data Hackers e Data & Analytics Sr. Director na Hotmart.

Falamos no episódioLinks de referências:

Ouça o episódio de Tendências de 2023: https://medium.com/data-hackers/as-tend%C3%AAncias-para-dados-e-ai-em-2023-data-hackers-podcast-62-2dff6fdddb6e O Brasileiro com a IA mais baixada do Mundo — Data Hackers Podcast 70:https://medium.com/data-hackers/o-brasileiro-com-a-ia-mais-baixada-do-mundo-data-hackers-podcast-70-e13a8c66fbcd Matéria:É verdade? Museu do Louvre pega fogo e vídeo viraliza": https://www.folhavitoria.com.br/geral/noticia/01/2024/e-verdade-museu-do-louvre-pega-fogo-e-video-viraliza-assustador-viral Pika (AI Video): https://pika.art/login Eleições na Argentina: IA vira arma de campanha: https://olhardigital.com.br/2023/11/17/pro/ia-vira-arma-de-campanha-durante-eleicoes-na-argentina/ Cloud da Magalu: https://medium.com/data-hackers/magalu-cloud-por-dentro-da-primeira-cloud-brasileira-em-hiperescala-data-hackers-epis%C3%B3dio-79-3ca324ddf66e

Modern Customer Data Platform Principles

2024-01-22 · Data Engineering Podcast Listen

podcast_episode

by Tasso Argyros (ActionIQ) , Tobias Macey

AI/ML Analytics CDP Data Engineering Data Lake Data Lakehouse Data Management Delta ETL/ELT Hudi Iceberg Modern Data Stack +3 more

Summary

Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro. Your host is Tobias Macey and today I'm interviewing Tasso Argyros about the role of a customer data platform in the context of the modern data stack

Interview

Introduction How did you get involved in the area of data management? Can you describe what the role of the CDP is in the context of a businesses data ecosystem?

What are the core technical challenges associated with building and maintaining a CDP? What are the organizational/business factors that contribute to the complexity of these systems?

The early days of CDPs came with the promise of "Customer 360". Can you unpack that concept and how it has changed over the past ~5 years? Recent years have seen the adoption of reverse ETL, cloud data warehouses, and sophisticated product analytics suites. How has that changed the architectural approach to CDPs?

How have the architectural shifts changed the ways that organizations interact with their customer data?

How have the responsibilities shifted across different roles?

What are the governance policy and enforcement challenges that are added with the expansion of access and responsibility?

What are the most interesting, innovative, or unexpected ways that you have seen CDPs built/used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDPs? When is a CDP the wrong choice? What do you have planned for the future of ActionIQ?

Contact Info

LinkedIn @Tasso on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being us

Magalu Cloud: Por dentro da primeira Cloud Brasileira em Hiperescala - Data Hackers Episódio #79

2024-01-19 · Data Hackers Listen

podcast_episode

by Vaner Vendramini (Magalu Cloud) , Monique Femme (PUCRS) , Paulo Vasconcellos , Gabriel Lages , Allan Senne (Data Hackers; Dadosfera)

AI/ML Analytics AWS Azure Data Science

Você já deve ter ouvido, sobre o lançamento da nova Cloud Publica e Brasileira, que movimentou muitos rumores no mercado de tecnologia. E atendo a pedidos da comunidade, agora você tem a chance de conhecer as estratégias, e um pouco mais, sobre a Magalu Cloud.

Neste episódio do Data Hackers — a maior comunidade de AI e Data Science do Brasil-, chamamos o Vaner Vendramini — Field CTO na Magalu Cloud, para desmitificar tudo que está por de trás deste lançamento da primeira Cloud Brasileira em Hiperscala, da Magalu.

Lembrando que você pode encontrar todos os podcasts da comunidade Data Hackers no Spotify, iTunes, Google Podcast, Castbox e muitas outras plataformas. Caso queira, você também pode ouvir o episódio aqui no post mesmo!

Conheça nosso convidado:

Vaner Vendramini — Field CTO na Magalu Cloud

Nossa Bancada Data Hackers:

Monique Femme — Head of Community Management na Data Hackers Allan Senne — Co-founder da Data Hackers e Co-Founder & CTO at Dadosfera.

Paulo Vasconcellos — Co-founder da Data Hackers e Principal Data Scientist na Hotmart. Gabriel Lages — Co-founder da Data Hackers e Data & Analytics Sr. Director na Hotmart.

Falamos no episódioLinks de referências:

Sobre o evento de lançamento da Magalu Cloud: https://www.magazineluiza.com.br/blog-da-lu/c/dl/dldc/magalu-cloud-a-nuvem-do-magazine-luiza/12434/ Cloud Alema citada pelo Vaner: https://www.stackit.de/en/ Estudo da McKinsey sobre o mercado de cloud Computing em 2030: https://www.mckinsey.com/br/our-insights/all-insights/computacao-em-nuvem-2030 Progressão do market sharing de Cloud, de 2018 até 2021, da digital cloud training: https://digitalcloud.training/comparison-of-aws-vs-azure-vs-google/ Página de parceiros da Magalu Cloud: https://magalu.cloud/solucoes/

Cloud computing concepts

2024-01-18 · CEH v12 - Ethical Hacking Free Course

talk

IBM SAN Volume Controller Model SV3 Product Guide (for IBM Storage Virtualize V8.6)

2024-01-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jon Herd , Hartmut Lonzer , Gucer Vasfi

IBM data data-engineering ibm-system-storage ibm-system-storage-san-volume-controller

This IBM® Redpaper® Product Guide describes the IBM SAN Volume Controller model SV3 solution, which is a next-generation IBM SAN Volume Controller. Built with IBM Storage Virtualize software and part of the IBM Storage family, IBM SAN Volume Controller is an enterprise-class storage system. It helps organizations achieve better data economics by supporting the large-scale workloads that are critical to success. Data centers often contain a mix of storage systems. This situation can arise as a result of company mergers or as a deliberate acquisition strategy. Regardless of how they arise, mixed configurations add complexity to the data center. Different systems have different data services, which make it difficult to move data from one to another without updating automation. Different user interfaces increase the need for training and can make errors more likely. Different approaches to hybrid cloud complicate modernization strategies. Also, many different systems mean more silos of capacity, which can lead to inefficiency. To simplify the data center and to improve flexibility and efficiency in deploying storage, enterprises of all types and sizes turn to IBM SAN Volume Controller, which is built with IBM Spectrum Virtualize software. This software simplifies infrastructure and eliminates differences in management, function, and even hybrid cloud support. IBM SAN Volume Controller introduces a common approach to storage management, function, replication, and hybrid cloud that is independent of storage type. It is the key to modernizing and revitalizing your storage, but is as easy to understand. IBM SAN Volume Controller provides a rich set of software-defined storage (SDS) features that are delivered by IBM Storage Virtualize, including the following examples: Data reduction and deduplication Dynamic tiering Thin-provisioning Snapshots Cloning Replication and data copy services Data-at-rest encryption Cyber resilience Transparent Cloud Tiering IBM HyperSwap® including three-site replication for high availability (HA) This Redpaper applies to IBM Storage Virtualize V8.6.

Designing Data Platforms For Fintech Companies

2024-01-01 · Data Engineering Podcast Listen

podcast_episode

by Andrey Korchak (Monite) , Tobias Macey

AI/ML Analytics Data Engineering Data Governance Data Lake Data Lakehouse Data Management Dataflow Delta Hudi Iceberg Microsoft +6 more

Summary

Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Andrey Korchak about how to manage data in a fintech environment

Interview

Introduction How did you get involved in the area of data management? Can you start by summarizing the data challenges that are particular to the fintech ecosystem? What are the primary sources and types of data that fintech organizations are working with?

What are the business-level capabilities that are dependent on this data?

How do the regulatory and business requirements influence the technology landscape in fintech organizations?

What does a typical build vs. buy decision process look like?

Fraud prediction in e.g. banks is one of the most well-established applications of machine learning in industry. What are some of the other ways that ML plays a part in fintech?

How does that influence the architectural design/capabilities for data platforms in those organizations?

Data governance is a notoriously challenging problem. What are some of the strategies that fintech companies are able to apply to this problem given their regulatory burdens? What are the most interesting, innovative, or unexpected approaches to data management that you have seen in the fintech sector? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data in fintech? What do you have planned for the future of your data capabilities at Monite?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Monite ISO 270001 Tesseract GitOps SWIFT Protocol

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Starburst: Starburst Logo

This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics.

Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. dataengineeringpodcast.com/starburstRudderstack:

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackMaterialize:

You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.

That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.

Go to materialize.com today and get 2 weeks free!Support Data Engineering Podcast

talk-data.com

Activity Trend

Top Events

Top Speakers

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

[AI and the Modern Data Stack] #183 Adding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake

Mastering Microsoft Fabric: SAASification of Analytics

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

5 Minute Friday - Everything Ends...Moving on From the Modern Data Stack

#180 How AI is Changing Cybersecurity with Brian Murphy, CEO of ReliaQuest

Data Sharing Across Business And Platform Boundaries

IBM and CMTG Cyber Resiliency: Building an Automated, VMware Aware Safeguarded Copy Solution to Provide Data Resilience

Secure Cloud, Edge, and Content Security

IBM Storage Virtualize, IBM Storage FlashSystem, and IBM SAN Volume Controller Security Feature Checklist - For IBM Storage Virtualize 8.6

Tackling Real Time Streaming Data With SQL Using RisingWave

Hands-On Entity Resolution

Cloud infrastructure for LLM

IBM Storage Fusion Multicloud Object Gateway

Tendências para Dados e AI em 2024 — Data Hackers Podcast 80

Modern Customer Data Platform Principles

Magalu Cloud: Por dentro da primeira Cloud Brasileira em Hiperescala - Data Hackers Episódio #79

Cloud computing concepts

IBM SAN Volume Controller Model SV3 Product Guide (for IBM Storage Virtualize V8.6)

Designing Data Platforms For Fintech Companies