Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

2022-12-26 · Data Engineering Podcast Listen

podcast_episode

by Rishabh Poddar (Opaque Systems) , Tobias Macey

AI/ML Airflow Analytics CDP CI/CD Data Analytics Data Engineering Data Lake Data Management Data Quality Datafold dbt +13 more

Summary

Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today an

Snowflake SnowPro Core Certification Study Guide

2022-12-20 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Hamid Mahmood Qureshi

DWH Cyber Security Snowflake data data-engineering

Prepare smarter, faster, and better with the premier study guide for Snowflake SnowPro Core certification Snowflake, a cloud-based data warehousing platform, has steadily gained popularity since its 2014 launch. Snowflake offers several certification exams, of which the SnowPro Core certification is the foundational exam. The SnowPro Core Certification validates an individual's grasp of Snowflake as a cloud data warehouse, its architectural fundamentals, and the ability to design, implement, and maintain secure, scalable Snowflake systems. The Snowflake SnowPro Core Certification Study Guide delivers comprehensive coverage of every relevant exam topic on the Snowflake SnowPro Core Certification test. Prepare efficiently and effectively for the exam with online practice tests and flashcards, a digital glossary, and concise and easy-to-follow instruction from the subject-matter experts at Sybex. You'll gain the necessary knowledge to help you succeed in the exam and will be able to apply the acquired practical skills to real-world Snowflake solutions. This Study Guide includes: Comprehensive understanding of Snowflake's unique shared data, multi-cluster architecture Guidance on loading structured and semi-structured data into Snowflake Utilizing data sharing, cloning, and time travel features Managing performance through clustering keys, scaling compute up, down & across Steps to account management and security configuration including RBAC & MFA All the info you need to obtain a highly valued credential for a rapidly growing new database software solution Access to the Sybex online learning center, with chapter review questions, full-length practice exams, hundreds of electronic flashcards, and a glossary of key terms Perfect for anyone considering a new career in cloud-based data warehouse solutions and related fields, Snowflake SnowPro Core Certification Study Guide is also a must-read for veteran database professionals seeking an understanding of one of the newest and fastest-growing niches in data.

Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle

2022-12-19 · Data Engineering Podcast Listen

podcast_episode

by Tim Gasper (data.world from ServiceNow) , Juan Sequeda (data.world) , Tobias Macey

Airflow Analytics CDP CI/CD Data Engineering Data Lake Data Management Data Quality Datafold dbt DWH Kubernetes +10 more

Summary

The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I'm interviewing Juan Sequeda and Tim Gasper about their views on the role of the data mesh paradigm for driving re-assessment of the foundational principles of data systems

Oracle Autonomous Database in Enterprise Architecture

2022-12-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bal Mukund Sharma , Krishnakumar KM , Rashmi Panda

API Data Management Linux Oracle Cyber Security SQL data data-engineering database-architecture

Explore the capabilities of Oracle Autonomous Database (ADB) to improve enterprise-level data management. Through this book, you will dive deep into deploying, managing, and securing ADBs using Oracle Cloud Infrastructure (OCI). Gain hands-on experience with high-availability setups, data migration methods, and advanced security measures to elevate your enterprise architecture. What this Book will help me do Understand the key considerations for planning, migrating, and maintaining Oracle Autonomous Databases. Learn to implement high availability solutions using Autonomous Data Guard in ADB environments. Master the configuration of backup, restore, and disaster recovery strategies within OCI. Implement advanced security practices including encryption and IAM policy management. Gain proficiency in leveraging ADB features like APEX, SQL Developer Web, and REST APIs for rapid application development. Author(s) The authors None Sharma, Krishnakumar KM, and None Panda are experts in database systems, particularly in Oracle technologies. With years of hands-on experience implementing enterprise solutions and training professionals, they have pooled their knowledge to craft a resource-rich guide filled with practical advice. Who is it for? This book is ideal for cloud architects, database administrators, and implementation consultants seeking to leverage Oracle's Autonomous Database for enhanced automation, security, and scalability. It is well-suited for professionals with foundational knowledge of Linux, OCI, and databases. Aspiring cloud engineers and students aiming to understand modern database management will also benefit greatly.

Run Your Applications Worldwide Without Worrying About The Database With Planetscale

2022-12-12 · Data Engineering Podcast Listen

podcast_episode

by Nick van Wiggeren (PlanetScale) , Tobias Macey

Airflow Analytics CDP CI/CD Data Engineering Data Lake Data Management Data Quality Datafold dbt DWH Kubernetes +10 more

Summary One of the most critical aspects of software projects is managing its data. Managing the operational concerns for your database can be complex and expensive, especially if you need to scale to large volumes of data, high traffic, or geographically distributed usage. Planetscale is a serverless option for your MySQL workloads that lets you focus on your applications without having to worry about managing the database or fight with differences between development and production. In this episode Nick van Wiggeren explains how the Planetscale platform is implemented, their strategies for balancing maintenance and improvements of the underlying Vitess project with their business goals, and how you can start using it today to free up the time you spend on database administration.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast l

The Cloud Data Lake

2022-12-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rukmani Gopalan (Google Cloud)

Analytics Big Data Data Governance Data Lake data data-engineering data-lake storage-repositories

More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights. This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance. Learn the benefits of a cloud-based big data strategy for your organization Get guidance and best practices for designing performant and scalable data lakes Examine architecture and design choices, and data governance principles and strategies Build a data strategy that scales as your organizational and business needs increase Implement a scalable data lake in the cloud Use cloud-based advanced analytics to gain more value from your data

Adopting Real-Time Data At Organizations Of Every Size

2022-12-05 · Data Engineering Podcast Listen

podcast_episode

by Arjun Narayan (Materialize) , Tobias Macey

Airflow Analytics CDP CI/CD Data Engineering Data Lake Data Management Data Quality Datafold dbt DWH Kubernetes +10 more

Summary The term "real-time data" brings with it a combination of excitement, uncertainty, and skepticism. The promise of insights that are always accurate and up to date is appealing to organizations, but the technical realities to make it possible have been complex and expensive. In this episode Arjun Narayan explains how the technical barriers to adopting real-time data in your analytics and applications have become surmountable by organizations of all sizes.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I’m interviewing Arjun Narayan about the benefits of real-time data for teams of all sizes

Interview

Introduction How did you ge

Teaching and Mentoring in Data Analytics - Irina Brudaru

2022-12-02 · DataTalks.Club Listen

podcast_episode

by Irina Brudaru (AI Guild)

AI/ML Analytics Data Analytics Data Science GitHub HTML

We talked about:

Irina’s background Irina as a mentor Designing curriculum and program management at AI Guild Other things Irina taught at AI Guild Why Irina likes teaching Students’ reluctance to learn cloud Irina as a manager Cohort analysis in a nutshell How Irina started teaching formally Irina’s diversity project in the works How DataTalks.Club can attract more female students to the Zoomcamps How to get technical feedback at work Antipatterns and overrated/overhyped topics in data analytics Advice for young women who want to get into data science/engineering Finding Irina online Fundamentals for data analysts Suggestions for DataTalks.club collaborations Conclusions

Links:

LinkedIn Account: https://www.linkedin.com/in/irinabrudaru/

ML Zoomcamp: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

Optimized Inferencing and Integration with AI on IBM zSystems: Introduction, Methodology, and Use Cases

2022-11-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Makenzie Manna , Erhan Mengusoglu , Markus Wolff , Thomas Rüter , Artem Minin , Pia Velazco , Krishna Teja Rekapalli

AI/ML IBM Cyber Security TensorFlow data data-engineering

In today's fast-paced, ever-growing digital world, you face various new and complex business problems. To help resolve these problems, enterprises are embedding artificial intelligence (AI) into their mission-critical business processes and applications to help improve operations, optimize performance, personalize the user experience, and differentiate themselves from the competition. Furthermore, the use of AI on the IBM® zSystems platform, where your mission-critical transactions, data, and applications are installed, is a key aspect of modernizing business-critical applications while maintaining strict service-level agreements (SLAs) and security requirements. This colocation of data and AI empowers your enterprise to optimally and easily deploy and infuse AI capabilities into your enterprise workloads with the most recent and relevant data available in real time, which enables a more transparent, accurate, and dependable AI experience. This IBM Redpaper publication introduces and explains AI technologies and hardware optimizations, such as IBM zSystems Integrated Accelerator for AI, and demonstrates how to leverage certain capabilities and components to enable solutions in business-critical use cases, such as fraud detection and credit risk scoring on the platform. Real-time inferencing with AI models, a capability that is critical to certain industries and use cases such as fraud detection, now can be implemented with optimized performance thanks to innovations like IBM zSystems Integrated Accelerator for AI embedded in the Telum chip within IBM z16™. This publication also describes and demonstrates the implementation and integration of the two end-to-end solutions (fraud detection and credit risk), from developing and training the AI models to deploying the models in an IBM z/OS® V2R5 environment on IBM z16 hardware, and to integrating AI functions into an application, for example an IBM z/OS Customer Information Control System (IBM CICS®) application. We describe performance optimization recommendations and considerations when leveraging AI technology on the IBM zSystems platform, including optimizations for micro-batching in IBM Watson® Machine Learning for z/OS (WMLz). The benefits that are derived from the solutions also are described in detail, which includes how the open-source AI framework portability of the IBM zSystems platform enables model development and training to be done anywhere, including on IBM zSystems, and the ability to easily integrate to deploy on IBM zSystems for optimal inferencing. You can uncover insights at the transaction level while taking advantage of the speed, depth, and securability of the platform. This publication is intended for technical specialists, site reliability engineers, architects, system programmers, and systems engineers. Technologies that are covered include TensorFlow Serving, WMLz, IBM Cloud Pak® for Data (CP4D), IBM z/OS Container Extensions (zCX), IBM Customer Information Control System (IBM CICS), Open Neural Network Exchange (ONNX), and IBM Deep Learning Compiler (zDLC).

Accelerate your Journey to the Cloud w/ Ludovic Francois

2022-11-29 · Data Unchained

podcast_episode

by Ludovic Francois (TrackIt)

AWS

As we prepare for the AWS re:Invent Conference for 2022, Ludovic Francois, CEO/CTO of TrackIt, joins us on this podcast episode of Data Unchained, to talk about AWS, the role TrackIt plays as a cloud integrator, and the bulding blocks we are seeing to take data storage and global access to the next level. Join us on this informative episode of Data Unchained as we look toward the future of AWS and data as a global resource.

reinvent #datascience #data #aws #global #amazon #conference #datastorage

Cyberpunk by jiglr | https://soundcloud.com/jiglrmusic Music promoted by https://www.free-stock-music.com Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/deed.en_US Hosted on Acast. See acast.com/privacy for more information.

Analyze Massive Data At Interactive Speeds With The Power Of Bitmaps Using FeatureBase

2022-11-28 · Data Engineering Podcast Listen

podcast_episode

by Matt Jaffee (FeatureBase) , Tobias Macey

Airflow Analytics CDP CI/CD Data Engineering Data Lake Data Management Data Quality Datafold dbt DWH Kubernetes +10 more

Summary The most expensive part of working with massive data sets is the work of retrieving and processing the files that contain the raw information. FeatureBase (formerly Pilosa) avoids that overhead by converting the data into bitmaps. In this episode Matt Jaffee explains how to model your data as bitmaps and the benefits that this representation provides for fast aggregate computation. He also discusses the improvements that have been incorporated into FeatureBase to simplify integration with the rest of your data stack, and the SQL interface that was added to make working with the product easier.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell. Your host is Tobias Macey

Pro SQL Server 2022 Administration: A Guide for the Modern DBA

2022-11-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Peter A Carter

Azure Linux Cyber Security SQL data data-engineering microsoft-sql-server relational-databases

Get your daily work done efficiently using this comprehensive guide for SQL Server DBAs that covers all that a practicing database administrator needs to know. Updated for SQL Server 2022, this edition includes coverage of new features, such as Ledger, which provides an immutable record of table history to protect you against malicious data tampering, and integration with cloud providers to support hybrid cloud scenarios. You’ll also find new content on performance optimizations, such as query pan feedback, and security controls, such as new database roles, which are restructured for modern ways of working. Coverage also includes Query Store, installation on Linux, and the use of containerized SQL. Pro SQL Server 2022 Administration takes DBAs on a journey that begins with planning their SQL Server deployment and runs through installing and configuring the instance, administering and optimizing database objects, and ensuring that data issecure and highly available. Readers will learn how to perform advanced maintenance and tuning techniques, and discover SQL Server's hybrid cloud functionality. This book teaches you how to make the most of new SQL Server 2022 functionality, including integration for hybrid cloud scenarios. The book promotes best-practice installation, shows how to configure for scalability and high availability, and demonstrates the gamut of database-level maintenance tasks, such as index maintenance, database consistency checks, and table optimizations. What You Will Learn Integrate SQL Server with Azure for hybrid cloud scenarios Audit changes and prevent malicious data changes with SQL Server’s Ledger Secure and encrypt data to protect against embarrassing data breaches Ensure 24 x 7 x 365 access through high availability and disaster recovery features in today’s hybrid world Use Azure tooling, including Arc, to gain insight into and manage your SQL Server enterprise Install and configure SQL Server on Windows, Linux, and in containers Perform routine maintenance tasks, such as backups and database consistency checks Optimize performance and undertake troubleshooting in the Database Engine Who This Book Is For SQL Server DBAs who manage on-premise installations of SQL Server. This book is also useful for DBAs who wish to learn advanced features, such as integration with Azure, Query Store, Extended Events, and Policy-Based Management, or those who need to install SQL Server in a variety of environments.

Microsoft Power BI Quick Start Guide - Third Edition

2022-11-25 · O'Reilly Data Science Books O'Reilly Amazon

book

by Mitchell Pearson , Bradley Schacht , Erin Ostrowsky , Devin Knight

BI DataViz Microsoft Power BI Cyber Security business-intelligence data data-science microsoft-power-platform power-bi

Discover the power of transforming raw data into actionable insights with "Microsoft Power BI Quick Start Guide." This comprehensive guide introduces you to the core functionalities of Power BI, emphasizing practical demonstration on building data models, visualizations, and streamlining business intelligence processes. By following this book, you'll elevate your data analysis and storytelling skills. What this Book will help me do Connect and import data from various sources into Power BI. Master the usage of Power Query Editor for efficient data cleansing. Create effective and visually appealing Power BI dashboards. Understand and implement data security features, such as row-level and column-level security. Administer a Power BI environment effectively, including tenant management and cloud deployments. Author(s) Devin Knight, Erin Ostrowsky, Mitchell Pearson, and Bradley Schacht are seasoned experts in the field of data analysis and business intelligence. With years of practical experience, they bring a wealth of knowledge in Power BI and data visualization. Their passion for educating others is evident in their clear, approachable, and structured writing style. Who is it for? This book is designed for professionals seeking to delve into Microsoft Power BI's functionalities. Ideal readers include business analysts, data professionals, or enthusiasts aiming to transition from Excel-based solutions to BI platforms. Both beginners wanting to learn BI concepts and intermediate users looking to solidify their Power BI skills will benefit greatly.

Offloading storage volumes from Safeguarded Copy to AWS S3 Object Storage with IBM FlashSystem Transparent Cloud Tiering

2022-11-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Shashank Shingornikaris , Manoj Kateja , Christopher Vollmar

AWS IBM S3 cloud-storage data data-engineering storage-repositories

The focus of this IBM® Blueprint is to showcase a method to store volumes that are created by using Safeguarded Copy off-premise to Amazon S3 object storage that uses the IBM FlashSystem Transparent cloud tiering (TCT) feature. TCT enables volume data to be copied and transferred to object storage. The TCT feature supports creating connections to cloud service providers to store copies of volume data in private or public clouds. This feature is useful for organizations of all sizes when planning for disaster recovery operations or storing a copy of data as extra backup. TCT provides seamless integration between the storage system and public or private clouds for Safeguarded Copy volumes and non-Safeguarded Copy volumes.

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

2022-11-21 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey , Salma Bakouk (Sifflet)

Airflow Analytics AWS Azure BigQuery CDP CI/CD Data Engineering Data Lake Data Management Data Quality Databricks +14 more

Summary The problems that are easiest to fix are the ones that you prevent from happening in the first place. Sifflet is a platform that brings your entire data stack into focus to improve the reliability of your data assets and empower collaboration across your teams. In this episode CEO and founder Salma Bakouk shares her views on the causes and impacts of "data entropy" and how you can tame it before it leads to failures.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Salma Bakouk about achieving data reliability and reducing entropy within your data stack with sifflet

Interview

Introduction How did you get involved in the area of data management? Can you describe what Sifflet is and the st

IBM Elastic Storage System Introduction Guide

2022-11-21 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Stieg Klein , Chris Maestas

Big Data ELK IBM data data-engineering

This IBM® Redpaper Redbookspublication provides an overview of the IBM Elastic Storage® Server (IBM ESS) and IBM Elastic Storage System (also IBM ESS). These scalable, high-performance data and file management solution, are built on IBM Spectrum® Scale technology. Providing reliability, performance, and scalability, IBM ESS can be implemented for a range of diverse requirements. The latest IBM ESS 3500 is the most innovative system that provides investment protection to expand or build a new Global Data Platform and use current storage. The system allows enhanced, non-disruptive upgrades to grow from flash to hybrid or from hard disk drives (HDDs) to hybrid. IBM ESS can scale up or out with two different storage mediums in the environment, and it is ready for technologies like 200 Gb Ethernet or InfiniBand NDR-200 connectivity. This publication helps you to understand the solution and its architecture. It describes ordering the best solution for your environment, planning the installation and integration of the solution into your environment, and correctly maintaining your solution. The solution is created from the following combination of physical and logical components: Hardware Operating system Storage Network Applications Knowledge of the IBM Elastic Storage Server and IBM Elastic Storage System components is key for planning an environment. This paper is targeted toward technical professionals (consultants, technical support staff, IT Architects, and IT specialists) who are responsible for delivering cost-effective cloud services and big data solutions. The content of this paper can help you to uncover insights among client's data so that you can take appropriate actions to optimize business results, product development, and scientific discoveries.

The Power of Real-Time Data with Lenley Hensarling at Aerospike

2022-11-16 · SaaS Scaled - Interviews about SaaS Startups, Analytics, & Operations Listen

podcast_episode

by Lenley Hensarling (Aerospike)

Analytics AWS BI SaaS Cyber Security

In today’s episode, we’re talking to Lenley Hensarling, Chief Product Officer at Aerospike, Inc. Aerospike is a real-time data platform that allows users to act in real time across billions of transactions while reducing their server footprint.

We talk about:

Lenley’s background and the problems Aerospike solves.
The particular domains and industries that benefit from this kind of technology.
How the cloud has impacted what Aerospike does.
Why some people might choose on-premise over the cloud.
Finding the balance between customer-centric and market-centric.
Balancing product management with tasks like customer interaction and engineering.

Lenley Hensarling - https://www.linkedin.com/in/lenleyhensarling/ Aerospike - https://www.linkedin.com/company/aerospike-inc-/

This episode is brought to you by Qrvey

The tools you need to take action with your data, on a platform built for maximum scalability, security, and cost efficiencies. If you’re ready to reduce complexity and dramatically lower costs, contact us today at qrvey.com.

Qrvey, the modern no-code analytics solution for SaaS companies on AWS.

saas #analytics #AWS #BI

Build Data Products Without A Data Team Using AgileData

2022-11-14 · Data Engineering Podcast Listen

podcast_episode

by Shane Gibson , Tobias Macey

Agile/Scrum Analytics BI Data Engineering Data Management Dataflow ETL/ELT Google Analytics Hevo Data Kubernetes Modern Data Stack MongoDB +4 more

Summary Building data products is an undertaking that has historically required substantial investments of time and talent. With the rise in cloud platforms and self-serve data technologies the barrier of entry is dropping. Shane Gibson co-founded AgileData to make analytics accessible to companies of all sizes. In this episode he explains the design of the platform and how it builds on agile development principles to help you focus on delivering value.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Shane Gibson about AgileData

Taking A Look Under The Hood At CreditKarma's Data Platform

2022-11-14 · Data Engineering Podcast Listen

podcast_episode

by Vishnu Venkataraman (CreditKarma) , Tobias Macey

Airflow Analytics AWS Azure BigQuery CDP CI/CD Data Engineering Data Lake Data Management Data Quality Databricks +14 more

Summary CreditKarma builds data products that help consumers take advantage of their credit and financial capabilities. To make that possible they need a reliable data platform that empowers all of the organization’s stakeholders. In this episode Vishnu Venkataraman shares the journey that he and his team have taken to build and evolve their systems and improve the product offerings that they are able to support.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Vishnu Venkataraman about building the data platform at CreditKarma and the forces that shaped the design

Interview

Introduction How did you get involved in the area of data management? Can you describe what CreditKarma is and the role

Learning Google Analytics

2022-11-11 · O'Reilly Data Science Books O'Reilly Amazon

book

by Mark Edmondson (/ IIH Nordic)

Analytics Data Modelling GCP Google Analytics Marketing data data-science google-analytics web-analytics

Why is Google Analytics 4 the most modern data model available for digital marketing analytics? Rather than simply reporting what has happened, GA4's new cloud integrations enable more data activation, linking online and offline data across all your streams to provide end-to-end marketing data. This practical book prepares you for the future of digital marketing by demonstrating how GA4 supports these additional cloud integrations. Author Mark Edmondson, Google developer expert for Google Analytics and Google Cloud, provides a concise yet comprehensive overview of GA4 and its cloud integrations. Data, business, and marketing analysts will learn major facets of GA4's powerful new analytics model, with topics including data architecture and strategy, and data ingestion, storage, and modeling. You'll explore common data activation use cases and get the guidance you need to implement them. You'll learn: How Google Cloud integrates with GA4 The potential use cases that GA4 integrations can enable Skills and resources needed to create GA4 integrations How much GA4 data capture is necessary to enable use cases The process of designing dataflows from strategy through data storage, modeling, and activation How to adapt the use cases to fit your business needs

talk-data.com

Cloud Computing

Activity Trend

Top Events

Top Speakers

Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

Snowflake SnowPro Core Certification Study Guide

Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle

Oracle Autonomous Database in Enterprise Architecture

Run Your Applications Worldwide Without Worrying About The Database With Planetscale

The Cloud Data Lake

Adopting Real-Time Data At Organizations Of Every Size

Teaching and Mentoring in Data Analytics - Irina Brudaru

Optimized Inferencing and Integration with AI on IBM zSystems: Introduction, Methodology, and Use Cases

Accelerate your Journey to the Cloud w/ Ludovic Francois

reinvent #datascience #data #aws #global #amazon #conference #datastorage

Analyze Massive Data At Interactive Speeds With The Power Of Bitmaps Using FeatureBase

Pro SQL Server 2022 Administration: A Guide for the Modern DBA

Microsoft Power BI Quick Start Guide - Third Edition

Offloading storage volumes from Safeguarded Copy to AWS S3 Object Storage with IBM FlashSystem Transparent Cloud Tiering

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

IBM Elastic Storage System Introduction Guide

The Power of Real-Time Data with Lenley Hensarling at Aerospike

saas #analytics #AWS #BI

Build Data Products Without A Data Team Using AgileData

Taking A Look Under The Hood At CreditKarma's Data Platform

Learning Google Analytics