talk-data.com talk-data.com

Topic

Cyber Security

cybersecurity information_security data_security privacy

2078

tagged

Activity Trend

297 peak/qtr
2020-Q1 2026-Q1

Activities

2078 activities · Newest first

Serverless Analytics with Amazon Athena

Delve into the serverless world of Amazon Athena with the comprehensive book 'Serverless Analytics with Amazon Athena'. This guide introduces you to the power of Athena, showing you how to efficiently query data in Amazon S3 using SQL without the hassle of managing infrastructure. With clear instructions and practical examples, you'll master querying structured, unstructured, and semi-structured data seamlessly. What this Book will help me do Effectively query and analyze both structured and unstructured data stored in S3 using Amazon Athena. Integrate Athena with other AWS services to create powerful, secure, and cost-efficient data workflows. Develop ETL pipelines and machine learning workflows leveraging Athena's compatibility with AWS Glue. Monitor and troubleshoot Athena queries for consistent performance and build scalable serverless data solutions. Implement security best practices and optimize costs when managing your Athena-driven data solutions. Author(s) None Virtuoso, along with co-authors Mert Turkay Hocanin None and None Wishnick, brings a wealth of experience in cloud solutions, serverless technologies, and data engineering. They excel in demystifying complex technical topics and have a passion for empowering readers with practical skills and knowledge. Who is it for? This book is tailored for business intelligence analysts, application developers, and system administrators who want to harness Amazon Athena for seamless, cost-efficient data analytics. It suits individuals with basic SQL knowledge looking to expand their capabilities in querying and processing data. Whether you're managing growing datasets or building data-driven applications, this book provides the know-how to get it right.

High Performance MySQL, 4th Edition

How can you realize MySQL's full power? With High Performance MySQL, you'll learn advanced techniques for everything from setting service-level objectives to designing schemas, indexes, and queries to tuning your server, operating system, and hardware to achieve your platform's full potential. This guide also teaches database administrators safe and practical ways to scale applications through replication, load balancing, high availability, and failover. Updated to reflect recent advances in cloud- and self-hosted MySQL, InnoDB performance, and new features and tools, this revised edition helps you design a relational data platform that will scale with your business. You'll learn best practices for database security along with hard-earned lessons in both performance and database stability. Dive into MySQL's architecture, including key facts about its storage engines Learn how server configuration works with your hardware and deployment choices Make query performance part of your software delivery process Examine enhancements to MySQL's replication and high availability Compare different MySQL offerings in managed cloud environments Explore MySQL's full stack optimization from application-side configuration to server tuning Turn traditional database management tasks into automated processes

Kafka: The Definitive Guide, 2nd Edition

Every enterprise application creates data, whether it consists of log messages, metrics, user activity, or outgoing messages. Moving all this data is just as important as the data itself. With this updated edition, application architects, developers, and production engineers new to the Kafka streaming platform will learn how to handle data in motion. Additional chapters cover Kafka's AdminClient API, transactions, new security features, and tooling changes. Engineers from Confluent and LinkedIn responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream processing applications with this platform. Through detailed examples, you'll learn Kafka's design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer. You'll examine: Best practices for deploying and configuring Kafka Kafka producers and consumers for writing and reading messages Patterns and use-case requirements to ensure reliable data delivery Best practices for building data pipelines and applications with Kafka How to perform monitoring, tuning, and maintenance tasks with Kafka in production The most critical metrics among Kafka's operational measurements Kafka's delivery capabilities for stream processing systems

Summary Business intelligence is often equated with a collection of dashboards that show various charts and graphs representing data for an organization. What is overlooked in that characterization is the level of complexity and effort that are required to collect and present that information, and the opportunities for providing those insights in other contexts. In this episode Telmo Silva explains how he co-founded ClicData to bring full featured business intelligence and reporting to every organization without having to build and maintain that capability on their own. This is a great conversation about the technical and organizational operations involved in building a comprehensive business intelligence system and the current state of the market.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Telmo Silva about ClicData,

Interview

Introduction How did you get involved in the area of data management? Can you describe what ClicData is and the story behind it? How would you characterize the current state of the market for business intelligence?

What are the systems/capabilities that are required to run a full-featured BI system?

What are the challenges that businesses face in developing in-house capacity for business intelligence? Can you describe how the ClicData platform is architected?

How has it changed or evolved since you first began working on it?

How are you approaching schema design and evolution in the storage layer? How do you handle questions of data security/privacy/regulations given that you are storing the information on behalf of the business? In your work with clients what are some of the challenges that businesses are facing when attempting to answer questions and gain insights from their data in a rep

In this episode, Bryce and Conor interview Dave Abrahams about how he went from programming BASIC to APL to C++! About the Guest: Dave Abrahams is a contributor to the C++ standard, a founding contributor of the Boost C++ Libraries project and of the BoostCon/C++Now conference, and was a principal designer of the Swift programming language. He recently spent seven years at Apple, culminating in the creation of the declarative SwiftUI framework, worked at Google on Swift for TensorFlow, and is now a principal scientist at Adobe, where he and Sean Parent are rebooting the Software Technology Lab. Date Recorded: 2021-10-03 Date Released: 2021-10-29 ADSP Episode 48: Special Guest Dave Abrahams!Algorithms + Data Structures = ProgramsNiklaus WirthCombinatory LogicStepanov’s “Notes on Higher Order Programming in Scheme”PDP-8BASIC Computer Games by David AhlRutgers UniversityPDP-10TECOAPLPrinceton UniversityAaron Hsu’s Co-dfns GPU CompilerSwift Programming LanguageConor’s Galaxy Brain Programming LanguagesBen Deane’s Six languages worth knowingLisp MachineEmacsComposer’s MosaicTHINK CException handling: a false sense of security - Tom GargillIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Snowflake Security: Securing Your Snowflake Data Cloud

This book is your complete guide to Snowflake security, covering account security, authentication, data access control, logging and monitoring, and more. It will help you make sure that you are using the security controls in a right way, are on top of access control, and making the most of the security features in Snowflake. Snowflake is the fastest growing cloud data warehouse in the world, and having the right methodology to protect the data is important both to data engineers and security teams. It allows for faster data enablement for organizations, as well as reducing security risks, meeting compliance requirements, and solving data privacy challenges. There are currently tens of thousands of people who are either data engineers/data ops in Snowflake-using organizations, or security people in such organizations. This book provides guidance when you want to apply certain capabilities, such as data masking, row-level security, column-level security, tackling rolehierarchy, building monitoring dashboards, etc., to your organizations. What You Will Learn Implement security best practices for Snowflake Set up user provisioning, MFA, OAuth, and SSO Set up a Snowflake security model Design roles architecture Use advanced access control such as row-based security and dynamic masking Audit and monitor your Snowflake Data Cloud Who This Book Is For Data engineers, data privacy professionals, and security teams either with security knowledge (preferably some data security knowledge) or with data engineering knowledge; in other words, either “Snowflake people” or “data people” who want to get security right, or “security people” who want to make sure that Snowflake gets handled right in terms of security

SQL in 24 Hours, Sams Teach Yourself, 7th Edition

In just 24 lessons of one hour or less, Sams Teach Yourself SQL in 24 Hours helps you use SQL to build effective databases, efficiently retrieve data, and manage everything from performance to security. This Seventh Edition is thoroughly revised and reorganized for faster learning and a deeper understanding of modern SQL development. Based on standardized SQL throughout, it teaches using new sample code based on the free, easy-to-use Oracle Database Express (XE). You’ll find more hands-on examples than ever, culminating in an all-new Bonus Exercises Workshop with even more real-world practice. This guide’s straightforward, step-by-step approach shows you how to work with database structures, objects, queries, tables, and more. In just hours, you will be applying advanced techniques, from transactions and joins to complex data retrieval using views and subqueries. Step-by-step instructions carefully walk you through the most common SQL tasks. Practical, hands-on examples show you how to apply what you learn. Quizzes and exercises help you test your knowledge and stretch your skills. Notes and tips point out shortcuts and solutions. Learn how to... * Master core relational database concepts, SQL concepts, and language components * Clearly understand your data * Set up databases and plan efficient database designs * Define entities and relationships, establish data structures, and create database objects * “Normalize” raw databases into logically organized tables * Edit relational data and tables and manage transactions * Write effective, well-performing queries * Categorize, summarize, sort, group, and restructure data * Work with dates and times * Join tables in queries, use subqueries, and combine multiple queries * Optimize performance with indexes and other techniques * Administer databases and manage users * Secure databases and data

Azure Databricks Cookbook

Azure Databricks is a robust analytics platform that leverages Apache Spark and seamlessly integrates with Azure services. In the Azure Databricks Cookbook, you'll find hands-on recipes to ingest data, build modern data pipelines, and perform real-time analytics while learning to optimize and secure your solutions. What this Book will help me do Design advanced data workflows integrating Azure Synapse, Cosmos DB, and streaming sources with Databricks. Gain proficiency in using Delta Tables and Spark for efficient data storage and analysis. Learn to create, deploy, and manage real-time dashboards with Databricks SQL. Master CI/CD pipelines for automating deployments of Databricks solutions. Understand security best practices for restricting access and monitoring Azure Databricks. Author(s) None Raj and None Jaiswal are experienced professionals in the field of big data and analytics. They are well-versed in implementing Azure Databricks solutions for real-world problems. Their collaborative writing approach ensures clarity and practical focus. Who is it for? This book is tailored for data engineers, scientists, and big data professionals who want to apply Azure Databricks and Apache Spark to their analytics workflows. A basic familiarity with Spark and Azure is recommended to make the best use of the recipes provided. If you're looking to scale and optimize your analytics pipelines, this book is for you.

Securing Data on Threat Detection by Using IBM Spectrum Scale and IBM QRadar: An Enhanced Cyber Resiliency Solution

Having appropriate storage for hosting business-critical data and advanced Security Information and Event Management (SIEM) software for deep inspection, detection, and prioritization of threats has become a necessity for any business. This IBM® Redpaper publication explains how the storage features of IBM Spectrum® Scale, when combined with the log analysis, deep inspection, and detection of threats that are provided by IBM QRadar®, help reduce the impact of incidents on business data. Such integration provides an excellent platform for hosting unstructured business data that is subject to regulatory compliance requirements. This paper describes how IBM Spectrum Scale File Audit Logging can be integrated with IBM QRadar. Using IBM QRadar, an administrator can monitor, inspect, detect, and derive insights for identifying potential threats to the data that is stored on IBM Spectrum Scale. When the threats are identified, you can quickly act on them to mitigate or reduce the impact of incidents. We further demonstrate how the threat detection by IBM QRadar can proactively trigger data snapshots or cyber resiliency workflow in IBM Spectrum Scale to protect the data during threat. This third edition has added the section "Ransomware threat detection", where we describe a ransomware attack scenario within an environment to leverage IBM Spectrum Scale File Audit logs integration with IBM QRadar. This paper is intended for chief technology officers, solution engineers, security architects, and systems administrators. This paper assumes a basic understanding of IBM Spectrum Scale and IBM QRadar and their administration.

Summary Gartner analysts are tasked with identifying promising companies each year that are making an impact in their respective categories. For businesses that are working in the data management and analytics space they recognized the efforts of Timbr.ai, Soda Data, Nexla, and Tada. In this episode the founders and leaders of each of these organizations share their perspective on the current state of the market, and the challenges facing businesses and data professionals today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Saket Saurabh, Maarten Masschelein, Akshay Deshpande, and Dan Weitzner about the challenges facing data practitioners today and the solutions that are being brought to market for addressing them, as well as the work they are doing that got them recognized as "cool vendors" by Gartner.

Interview

Introduction How did you get involved in the area of data management? Can you each describe what you view as the biggest challenge facing data professionals? Who are you building your solutions for and what are the most common data management problems are you all solving? What are different components of Data Management and why is it so complex? What will simplify this process, if any? The report covers a lot of new data management terminology – data governance, data observability, data fabric, data mesh, DataOps, MLOps, AIOps – what does this all mean and why is it important for data engineers? How has the data management space changed in recent times? Describe the current data management landscape and any key developments. From your perspective, what are the biggest challenges in the data management space today? What modern data management features are lacking in existing databases? Gartner imagines a future where data and analytics leaders need to be prepared to rely on data manage

Learning MySQL, 2nd Edition

Get a comprehensive overview on how to set up and design an effective database with MySQL. This thoroughly updated edition covers MySQL's latest version, including its most important aspects. Whether you're deploying an environment, troubleshooting an issue, or engaging in disaster recovery, this practical guide provides the insights and tools necessary to take full advantage of this powerful RDBMS. Authors Vinicius Grippa and Sergey Kuzmichev from Percona show developers and DBAs methods for minimizing costs and maximizing availability and performance. You'll learn how to perform basic and advanced querying, monitoring and troubleshooting, database management and security, backup and recovery, and tuning for improved efficiency. This edition includes new chapters on high availability, load balancing, and using MySQL in the cloud. Get started with MySQL and learn how to use it in production Deploy MySQL databases on bare metal, on virtual machines, and in the cloud Design database infrastructures Code highly efficient queries Monitor and troubleshoot MySQL databases Execute efficient backup and restore operations Optimize database costs in the cloud Understand database concepts, especially those pertaining to MySQL

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

This week on Making Data Simple, we have Elo Umeh, from Terragon Africa’s fastest-growing enterprise marketing technology company. Terragon uses its on-demand marketing cloud platform, attribution software, and deep analytics capability to enable thoughtful, targeted omni-channel access to 100m+ mobile-first African consumers. Elo is the Founder and CEO at Terragon Group. Elo career has spanned over 15 years where he has worked in the mobile and digital media across East and West Africa. He was part of the founding team at Mtech Communications. Elo holds a global executive MBA from IESE business of school where he graduated at the top of his class. Elo also has a Bachelor’s degree in Business Administration from Lagos State University. Show Notes 4:02 – What keeps you going? 6:15 – Lets dive into Terragon 8:40 – Who are your customers? 11:06 – Define pre-paid 14:40 – What kind of incites and security are you providing? 20:37- What kind of technology is Terragon using? 23:16 – What was it about the smart phone that made you want to go out on your own? 26:10 – Who’s your biggest competitor?  28:20 – What’s next for Terragon? 31:01 – What are the biggest mistakes entrepreneurs make? Terragon  Elo Umeh - LinkedIn

Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter.  Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Summary The reason that so much time and energy is spent on data integration is because of how our applications are designed. By making the software be the owner of the data that it generates, we have to go through the trouble of extracting the information to then be used elsewhere. The team at Cinchy are working to bring about a new paradigm of software architecture that puts the data as the central element. In this episode Dan DeMers, Cinchy’s CEO, explains how their concept of a "Dataware" platform eliminates the need for costly and error prone integration processes and the benefits that it can provide for transactional and analytical application design. This is a fascinating and unconventional approach to working with data, so definitely give this a listen to expand your thinking about how to build your systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Dan DeMers about Cinchy, a dataware platform aiming to simplify the work of data integration by eliminating ETL/ELT

Interview

Introduction How did you get involved in the area of data management? Can you describe what Cinchy is and the story behind it? In your experience working in data and building complex enterprise-grade systems, what are the shortcomings and negative externalities of an ETL/ELT approach to data integration? How is a Dataware platform from a data lake or data warehouses? What is it used for? What is Zero-Copy Integration? How does that work? Can you describe how customers start their Cinchy journey? What are the main use case patterns that you’re seeing with Dataware? Your platform offers unlimited users, including business users. What are some of the challenges that you face in building a user experience that doesn’t become overwhelming as an organization scales the number of data sources and processing flows? Wh

Developing Modern Applications with a Converged Database

Single-purpose databases were designed to address specific problems and use cases. Given this narrow focus, there are inherent tradeoffs required when trying to accommodate multiple datatypes or workloads in your enterprise environment. The result is data fragmentation that spills over into application development, IT operations, data security, system scalability, and availability. In this report, author Alice LaPlante explains why developing modern, data-driven applications may be easier and more synergistic when using a converged database. Senior developers, architects, and technical decision-makers will learn cloud-native application development techniques for working with both structured and unstructured data. You'll discover ways to run transactional and analytical workloads on a single, unified data platform. This report covers: Benefits and challenges of using a converged database to develop data-driven applications How to use one platform to work with both structured and unstructured data that includes JSON, XML, text and files, spatial and graph, Blockchain, IoT, time series, and relational data Modern development practices on a converged database, including API-driven development, containers, microservices, and event streaming Use case examples including online food delivery, real-time fraud detection, and marketing based on real-time analytics and geospatial targeting

Summary Data lakes have been gaining popularity alongside an increase in their sophistication and usability. Despite improvements in performance and data architecture they still require significant knowledge and experience to deploy and manage. In this episode Vikrant Dubey discusses his work on the Cuelake project which allows data analysts to build a lakehouse with SQL queries. By building on top of Zeppelin, Spark, and Iceberg he and his team at Cuebook have built an autoscaled cloud native system that abstracts the underlying complexity.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Vikrant Dubey about Cuebook and their Cuelake project for building ELT pipelines for your data lakehouse entirely in SQL

Interview

Introduction How did you get involved in the area of data management? Can you describe what Cuelake is and the story behind it? There are a number of platforms and projects for running SQL workloads and transformations on a data lake. What was lacking in those systems that you are addressing with Cuelake? Who are the target users of Cuelake and how has that influenced the features and design of the system? Can you describe how Cuelake is implemented?

What was your selection process for the various components?

What are some of the sharp edges that you have had to work around when integrating these components? What involved in getting Cuelake deployed? How are you using Cuelake in your work at Cuebook? Given your focus on machine learning for anomaly detection of business metrics, what are the challenges that you faced in using a data warehouse for those workloads?

What are the advantages that a data lake/lakehouse architecture maintains over a warehouse? What are the shortcomings of the lake/lakehouse approach that are solved by using a warehouse?

What are the most interesting, in

Summary The vast majority of data tools and platforms that you hear about are designed for working with structured, text-based data. What do you do when you need to manage unstructured information, or build a computer vision model? Activeloop was created for exactly that purpose. In this episode Davit Buniatyan, founder and CEO of Activeloop, explains why he is spending his time and energy on building a platform to simplify the work of getting your unstructured data ready for machine learning. He discusses the inefficiencies that teams run into from having to reprocess data multiple times, his work on the open source Hub library to solve this problem for everyone, and his thoughts on the vast potential that exists for using computer vision to solve hard and meaningful problems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Davit Buniatyan about Activeloop, a platform for hosting and delivering datasets optimized for machine learning

Interview

Introduction How did you get involved in the area of data management? Can you describe what Activeloop is and the story behind it? How does the form and function of data storage introduce friction in the development and deployment of machine learning projects? How does the work that you are doing at Activeloop compare to vector databases such as Pinecone? You have a focus on image oriented data and computer vision projects. How does the specific applications of ML/DL influence the format and interactions with the data? Can you describe how the Activeloop platform is architected?

How have the design and goals of the system changed or evolved since you began working on it?

What are the feature and performance tradeoffs between self-managed storage locations (e.g. S3, GCS) and the Activeloop platform? What is the process for sourcing, processing, and storing

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Alex Watson. Alex was previously a GM at AWS and is currently a Co-Founder at Gretel.ai. Gretel is a privacy startup that enables developers, researchers, and scientists to quickly create safe versions of data for use in pre-production environments and machine learning workloads, which are shareable across teams and organizations. These tools address head-on the massive data privacy bottleneck--which has stifled innovation across multiple industries for years—by equipping builders everywhere with the ability to create quality datasets that scale. In short, synthetic data levels the playing field for everyone. This democratization of data will foster competition, scientific discoveries, and the inventions that will drive the next revolution of our data economy.  The company recently closed their series-A funding, led by Greylock, for another $12 million and brought Jason Warner, the current CTO for GitHub, on as an investor. Gretel also launched its latest public beta, Beta2, which offers privacy engineering as a service for everyone, not just developers. Show Notes 2:03 – Alex’s background 4:36 – What time frame was Harvest AI? 7:14 – How does NLP play into Harvest AI? 10:50 – How can we not have enough knowledge? 14:08 – Does the tech exist today for security? 18:14 – Privacy issues 20:42 – What does Gretel stand for? 27:42 – Do you increase the opportunity for bias? 31:18 – Where is the sweet spot for Gretel? 33:30 – When do synthetic not work? 37:42 – What is practical privacy? Gretel Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter.  Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Summary Every organization needs to be able to use data to answer questions about their business. The trouble is that the data is usually spread across a wide and shifting array of systems, from databases to dashboards. The other challenge is that even if you do find the information you are seeking, there might not be enough context available to determine how to use it or what it means. Castor is building a data discovery platform aimed at solving this problem, allowing you to search for and document details about everything from a database column to a business intelligence dashboard. In this episode CTO Amaury Dumoulin shares his perspective on the complexity of letting everyone in the company find answers to their questions and how Castor is designed to help.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Amaury Dumoulin about Castor, a managed platform for easy data cataloging and discovery

Interview

Introduction How did you get involved in the area of data management? Can you describe what Castor is and the story behind it? The market for data catalogues is nascent but growing fast. What are the broad categories for the different products and projects in the space? What do you see as the core features that are required to be competitive?

In what ways has that changed in