talk-data.com talk-data.com

Topic

MongoDB

nosql_database document_database big_data

192

tagged

Activity Trend

27 peak/qtr
2020-Q1 2026-Q1

Activities

192 activities · Newest first

IBM Spectrum Protect Plus Protecting Database Applications

IBM® Spectrum Protect Plus is a data protection solution that provides near-instant recovery, replication, retention management, and reuse for virtual machines, databases, and application backups in hybrid multicloud environments. This IBM Redpaper publication focuses on protecting database applications. IBM Spectrum® Protect Plus supports backup, restore, and data reuse for multiple databases, such as Oracle, IBM Db2®, MongoDB, Microsoft Exchange, and Microsoft SQL Server. Although other IBM Spectrum Protect Plus features focus on virtual environments, the database and application support of IBM Spectrum Protect Plus includes databases on virtual physical servers.

David Daly, Performance Engineer at MongoDB, joins us today to discuss "The Use of Change Point Detection to Identify Software Performance Regressions in a Continuous Integration System". Works Mentioned The Use of Change Point Detection to Identify Software Performance Regressions in a Continuous Integration System by David Daly, William Brown, Henrik Ingo, Jim O'Leary, David BradfordSocial Media David's Website David's Twitter Mongodb

MongoDB Performance Tuning: Optimizing MongoDB Databases and their Applications

Use this fast and complete guide to optimize the performance of MongoDB databases and the applications that depend on them. You will be able to turbo-charge the performance of your MongoDB applications to provide a better experience for your users, reduce your running costs, and avoid application growing pains. MongoDB is the world’s most popular document database and the foundation for thousands of mission-critical applications. This book helps you get the best possible performance from MongoDB. MongoDB Performance Tuning takes a methodical and comprehensive approach to performance tuning that begins with application and schema design and goes on to cover optimization of code at all levels of an application. The book also explains how to configure MongoDB hardware and cluster configuration for optimal performance. The systematic approach in the book helps you treat the true causes of performance issues and get the best return on your tuninginvestment. Even when you’re under pressure and don’t know where to begin, simply follow the method in this book to set things right and get your MongoDB performance back on track. What You Will Learn Apply a methodical approach to MongoDB performance tuning Understand how to design an efficient MongoDB application Optimize MongoDB document design and indexing strategies Tune MongoDB queries, aggregation pipelines, and transactions Optimize MongoDB server resources: CPU, memory, disk Configure MongoDB Replica sets and Sharded clusters for optimal performance Who This Book Is For Developers and administrators of high-performance MongoDB applications who want to be sure they are getting the best possible performance from their MongoDB system. For developers who wish to create applications that are fast, scalable,and cost-effective. For administrators who want to optimize their MongoDB server and hardware configuration.

Summary A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way. After tasking some of his top engineers to consider the problem in a new light they created the Pilosa engine. In this episode H.O. explains how using Pilosa as the core he built the Molecula platform to eliminate the need to copy data between systems in able to make it accessible for analytical and machine learning purposes. He also discusses the challenges that he faces in helping potential users and customers understand the shift in thinking that this creates, and how the system is architected to make it possible. This is a fascinating conversation about what the future looks like when you revisit your assumptions about how systems are designed.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing H.O. Maycotte about Molecula, a cloud based feature store based on the open source Pilosa project

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Molecula and the story behind it?

What are the additional capabilities that Molecula offers on top of the open source Pilosa project?

What are the problems/use cases that Molecula solves for? What are some of the technologies or architectural patterns that Molecula might replace in a companies data platform? One of the use cases that is mentioned on the Molecula site is as a feature store for ML and AI. This is a category that has been seeing a lot of growth recently. Can you provide some context how Molecula fits in that market and how it compares to options such as Tecton, Iguazio, Feast, etc.?

What are the benefits of using a bitmap index for identifying and computing features?

Can you describe how the Molecula platform is architected?

How has the design and goal of Molecula changed or evolved since you first began working on it?

For someone who is using Molecula, can you describe the process of integrating it with their existing data sources? Can you describe the internal data model of Pilosa/Molecula?

How should users think about data modeling and architecture as they are loading information into the platform?

Once a user has data in Pilosa, what are the available mechanisms for performing analyses or feature engineering? What are some of the most underutilized or misunderstood capabilities of Molecula? What are some of the most interesting, unexpected, or innovative ways that you have seen the Molecula platform used? What are the most interesting, unexpected, or challenging lessons that you have learned from building and scaling Molecula? When is Molecula the wrong choice? What do you have planned for the future of the platform and business?

Contact Info

LinkedIn @maycotte on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Molecula Pilosa

Podcast Episode

The Social Dilemma Feature Store Cassandra Elasticsearch

Podcast Episode

Druid MongoDB SwimOS

Podcast Episode

Kafka Kafka Schema Registry

Podcast Episode

Homomorphic Encryption Lucene Solr

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

MongoDB Fundamentals

This book, "MongoDB Fundamentals", is the ideal hands-on guide to learning MongoDB. By starting from the basics of NoSQL databases and progressing to cloud integration using MongoDB Atlas, you will gain practical experience managing, querying, and visualizing data effectively for real-world applications. What this Book will help me do Set up and manage a MongoDB database with both local and cloud environments. Master querying and modifying data using the aggregation framework for complex operations. Implement effective database architecture with replication and sharding techniques. Ensure data security and resilience through user management and efficient backup/restore methods. Visualize data insights through dynamic reports and charts using MongoDB Charts. Author(s) Amit Phaltankar, Juned Ahsan, Michael Harrison, and Liviu Nedov are seasoned professionals in the field of database management systems, each bringing extensive experience working with MongoDB and cloud technologies. They excel at translating technical concepts into accessible, actionable insights, and have a passion for enabling IT professionals to create high-performance database solutions. Who is it for? "MongoDB Fundamentals" is tailored for developers, database administrators, system administrators, and cloud architects who are new to MongoDB but are looking to integrate it into their data processing workflows. It's perfect for those who aim to enhance their skills in handling data within cloud computing environments and have some basic programming or database experience.

Learn MongoDB 4.x

Explore the capabilities of MongoDB 4.x with this comprehensive guide designed for developers and administrators working with NoSQL databases. Dive into topics such as database design, advanced query handling, and security configuration, and gain hands-on experience through practical examples and insights. What this Book will help me do Learn to configure and install MongoDB 4.x for development and administration. Understand the principles of NoSQL schema design for optimal performance. Perform complex queries and operations to manage your MongoDB databases. Secure your MongoDB setup with role-based access control and encryption techniques. Monitor and optimize database performance for production environments. Author(s) None Bierer, the author of 'Learn MongoDB 4.x,' is a seasoned database expert with extensive experience in NoSQL technologies. With a focus on practicality and clear explanations, None brings deep insights into MongoDB's development and administration. Who is it for? This book is ideal for early-career developers, system administrators, and database enthusiasts eager to break into NoSQL technologies. If you are familiar with Python and basic database concepts, this book will guide you through mastering MongoDB. It's perfect for those building dynamic backend systems.

MongoDB Topology Design: Scalability, Security, and Compliance on a Global Scale

Create a world-class MongoDB cluster that is scalable, reliable, and secure. Comply with mission-critical regulatory regimes such as the European Union’s General Data Protection Regulation (GDPR). Whether you are thinking of migrating to MongoDB or need to meet legal requirements for an existing self-managed cluster, this book has you covered. It begins with the basics of replication and sharding, and quickly scales up to cover everything you need to know to control your data and keep it safe from unexpected data loss or downtime. This book covers best practices for stable MongoDB deployments. For example, a well-designed MongoDB cluster should have no single point of failure. The book covers common use cases when only one or two data centers are available. It goes into detail about creating geopolitical sharding configurations to cover the most stringent data protection regulation compliance. The book also covers different tools and approaches for automating and monitoring a cluster with Kubernetes, Docker, and popular cloud provider containers. What You Will Learn Get started with the basics of MongoDB clusters Protect and monitor a MongoDB deployment Deepen your expertise around replication and sharding Keep effective backups and plan ahead for disaster recovery Recognize and avoid problems that can occur in distributed databases Build optimal MongoDB deployments within hardware and data center limitations Who This Book Is For Solutions architects, DevOps architects and engineers, automation and cloud engineers, and database administrators who are new to MongoDB and distributed databases or who need to scale up simple deployments. This book is a complete guide to planning a deployment for optimal resilience, performance, and scaling, and covers all the details required to meet the new set of data protection regulations such as the GDPR. This book is particularly relevant for large global organizations such as financial and medical institutions, as well as government departments that need to control data in the whole stack and are prohibited from using managed cloud services.

IBM Spectrum Scale CSI Driver for Container Persistent Storage

IBM® Spectrum Scale is a proven, scalable, high-performance data and file management solution. It provides world-class storage management with extreme scalability, flash accelerated performance, automatic policy-based storage that has tiers of flash through disk to tape. It also provides support for various protocols, such as NFS, SMB, Object, HDFS, and iSCSI. Containers can leverage the performance, information lifecycle management (ILM), scalability, and multisite data management to give the full flexibility on storage as they experience on the runtime. Container adoption is increasing in all industries, and they sprawl across multiple nodes on a cluster. The effective management of containers is necessary because their number will probably reach a far greater number than virtual machines today. Kubernetes is the standard container management platform currently being used. Data management is of ultimate importance, and often is forgotten because the first workloads containerized are ephemeral. For data management, many drivers with different specifications were available. A specification named Container Storage Interface (CSI) was created and is now adopted by all major Container Orchestrator Systems available. Although other container orchestration systems exist, Kubernetes became the standard framework for container management. It is a very flexible open source platform used as the base for most cloud providers and software companies' container orchestration systems. Red Hat OpenShift is one of the most reliable enterprise-grade container orchestration systems based on Kubernetes, designed and optimized to easily deploy web applications and services. OpenShift enables developers to focus on the code, while the platform takes care of all of the complex IT operations and processes. This IBM Redbooks® publication describes how the CSI Driver for IBM file storage enables IBM Spectrum® Scale to be used as persistent storage for stateful applications running in Kubernetes clusters. Through the Container Storage Interface Driver for IBM file storage, Kubernetes persistent volumes (PVs) can be provisioned from IBM Spectrum Scale. Therefore, the containers can be used with stateful microservices, such as database applications (MongoDB, PostgreSQL, and so on).

MongoDB Recipes: With Data Modeling and Query Building Strategies

Get the most out of MongoDB using a problem-solution approach. This book starts with recipes on the MongoDB query language, including how to query various data structures stored within documents. These self-contained code examples allow you to solve your MongoDB problems without fuss. MongoDB Recipes describes how to use advanced querying in MongoDB, such as indexing and the aggregation framework. It demonstrates how to use the Compass function, a GUI client interacting with MongoDB, and how to apply data modeling to your MongoDB application. You’ll see recipes on the latest features of MongoDB 4 allowing you to manage data in an efficient manner using MongoDB. What You Will Learn Work with the MongoDB document model Design MongoDB schemas Use the MongoDB query language Harness the aggregation framework Create replica sets and sharding in MongoDB Who This Book Is For Developers and professionals who work with MongoDB.

MongoDB: The Definitive Guide, 3rd Edition

Manage your data with a system designed to support modern application development. Updated for MongoDB 4.2, the third edition of this authoritative and accessible guide shows you the advantages of using document-oriented databases. You’ll learn how this secure, high-performance system enables flexible data models, high availability, and horizontal scalability. Authors Shannon Bradshaw, Eoin Brazil, and Kristina Chodorow provide guidance for database developers, advanced configuration for system administrators, and use cases for a variety of projects. NoSQL newcomers and experienced MongoDB users will find updates on querying, indexing, aggregation, transactions, replica sets, ops management, sharding and data administration, durability, monitoring, and security. In six parts, this book shows you how to: Work with MongoDB, perform write operations, find documents, and create complex queries Index collections, aggregate data, and use transactions for your application Configure a local replica set and learn how replication interacts with your application Set up cluster components and choose a shard key for a variety of applications Explore aspects of application administration and configure authentication and authorization Use stats when monitoring, back up and restore deployments, and use system settings when deploying MongoDB

SQL Server 2019 Revealed: Including Big Data Clusters and Machine Learning

Get up to speed on the game-changing developments in SQL Server 2019. No longer just a database engine, SQL Server 2019 is cutting edge with support for machine learning (ML), big data analytics, Linux, containers, Kubernetes, Java, and data virtualization to Azure. This is not a book on traditional database administration for SQL Server. It focuses on all that is new for one of the most successful modernized data platforms in the industry. It is a book for data professionals who already know the fundamentals of SQL Server and want to up their game by building their skills in some of the hottest new areas in technology. SQL Server 2019 Revealed begins with a look at the project's team goal to integrate the world of big data with SQL Server into a major product release. The book then dives into the details of key new capabilities in SQL Server 2019 using a “learn by example” approach for Intelligent Performance, security, mission-criticalavailability, and features for the modern developer. Also covered are enhancements to SQL Server 2019 for Linux and gain a comprehensive look at SQL Server using containers and Kubernetes clusters. The book concludes by showing you how to virtualize your data access with Polybase to Oracle, MongoDB, Hadoop, and Azure, allowing you to reduce the need for expensive extract, transform, and load (ETL) applications. You will then learn how to take your knowledge of containers, Kubernetes, and Polybase to build a comprehensive solution called Big Data Clusters, which is a marquee feature of 2019. You will also learn how to gain access to Spark, SQL Server, and HDFS to build intelligence over your own data lake and deploy end-to-end machine learning applications. What You Will Learn Implement Big Data Clusters with SQL Server, Spark, and HDFS Create a Data Hub with connections to Oracle, Azure, Hadoop, and other sources Combine SQL and Spark to build a machine learning platform for AI applications Boost your performance with no application changes using Intelligent Performance Increase security of your SQL Server through Secure Enclaves and Data Classification Maximize database uptime through online indexing and Accelerated Database Recovery Build new modern applications with Graph, ML Services, and T-SQL Extensibility with Java Improve your ability to deploy SQL Server on Linux Gain in-depth knowledge to run SQL Server with containers and Kubernetes Know all the new database engine features for performance, usability, and diagnostics Use the latest tools and methods to migrate your database to SQL Server 2019 Apply your knowledge of SQL Server 2019 to Azure Who This Book Is For IT professionals and developers who understand the fundamentals of SQL Server and wish to focus on learning about the new, modern capabilities of SQL Server 2019. The book is for those who want to learn about SQL Server 2019 and the new Big Data Clusters and AI feature set, support for machine learning and Java, how to run SQL Server with containers and Kubernetes, and increased capabilities around Intelligent Performance, advanced security, and high availability.

Big Data Simplified
"Big Data Simplified blends technology with strategy and delves into applications of big data in specialized areas, such as recommendation engines, data science and Internet of Things (IoT) and enables a practitioner to make the right technology choice. The steps to strategize a big data implementation are also discussed in detail. This book presents a holistic approach to the topic, covering a wide landscape of big

data technologies like Hadoop 2.0 and package implementations, such as Cloudera. In-depth discussion of associated technologies, such as MapReduce, Hive, Pig, Oozie, ApacheZookeeper, Flume, Kafka, Spark, Python and NoSQL databases like Cassandra, MongoDB, GraphDB, etc., is also included.

Mastering MongoDB 4.x - Second Edition

This book, Mastering MongoDB 4.x, provides an in-depth exploration of MongoDB's features and capabilities, empowering readers to create high-performance and fault-tolerant database solutions. Through practical examples and clear explanations, you will learn how to implement complex queries, optimize database performance, manage large-scale clusters, and ensure robust failover and backup strategies. What this Book will help me do Understand advanced querying techniques and best practices in data indexing and management. Effectively configure and monitor MongoDB instances for scalability and optimized performance. Master techniques for replication and sharding to support high-availability systems. Deploy MongoDB-based applications seamlessly across on-premise and cloud environments. Learn to integrate MongoDB with modern technologies like big data platforms, containers, and IoT applications. Author(s) Alex Giamas is a seasoned database administrator and developer with significant experience in working with both relational and non-relational databases. Having authored numerous articles and given lectures on MongoDB and other data management technologies, Alex brings practical insights to his writing. He emphasizes real-world applications with examples drawn from his extensive career. Who is it for? This book is designed for developers and database administrators already familiar with MongoDB and basic database concepts, who are looking to enhance their expertise for implementing advanced MongoDB solutions. It is also suitable for professionals aspiring to earn MongoDB certifications and expand their skills to manage large, high-performance database systems efficiently.

Summary

The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.

Introduction

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year

Interview

Introduction How did you get involved in the area of data management? Can you refresh our memory about what TimescaleDB is? How has the market for timeseries databases changed since we last spoke? What has changed in the focus and features of the TimescaleDB project and company? Toward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone?

What were the most challenging aspects of reaching that goal?

In terms of timeseries workloads, what are some of the factors that differ across varying use cases?

How do those differences impact the ways in which Timescale is used by the end user, and built by your team?

What are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven? How have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product?

Have you been able to leverage some of the native improvements to simplify your implementation? Are there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale?

What is in store for the future of the Timescale product and organization?

Contact Info

Ajay

@acoustik on Twitter LinkedIn

Mike

LinkedIn Website @michaelfreedman on Twitter

Timescale

Website Documentation Careers timescaledb on GitHub @timescaledb on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

TimescaleDB Original Appearance on the Data Engineering Podcast 1.0 Release Blog Post PostgreSQL

Podcast Interview

RDS DB-Engines MongoDB IOT (Internet Of Things) AWS Timestream Kafka Pulsar

Podcast Episode

Spark

Podcast Episode

Flink

Podcast Episode

Hadoop DevOps PipelineDB

Podcast Interview

Grafana Tableau Prometheus OLTP (Online Transaction Processing) Oracle DB Data Lake

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Learning Apache Drill

Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster. In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data. With this book, you’ll learn how Drill helps you analyze data more effectively to drive down time to insight. Use Drill to clean, prepare, and summarize delimited data for further analysis Query file types including logfiles, Parquet, JSON, and other complex formats Query Hadoop, relational databases, MongoDB, and Kafka with standard SQL Connect to Drill programmatically using a variety of languages Use Drill even with challenging or ambiguous file formats Perform sophisticated analysis by extending Drill’s functionality with user-defined functions Facilitate data analysis for network security, image metadata, and machine learning

MongoDB 4 Quick Start Guide

"MongoDB 4 Quick Start Guide" is your gateway into understanding and utilizing MongoDB, the world's leading NoSQL database alternative. Through this approachable guide, you will quickly learn how to install, secure, and effectively perform database operations using MongoDB Version 4. What this Book will help me do Master the installation and configuration of MongoDB to prepare for secure database setups. Execute CRUD operations seamlessly to manage your data through the MongoDB shell. Construct queries using the aggregation pipeline for robust data analysis. Implement replication and sharding to ensure data safety and scaleability. Use the PHP MongoDB driver to integrate MongoDB effectively with web applications. Author(s) None Bierer is an expert in database technologies with extensive experience in NoSQL solutions, particularly MongoDB. Their passion for teaching developers new and efficient ways to work with databases shines through in this practical and hands-on guide. Who is it for? This book is perfect for web developers looking to enhance their understanding of modern databases, IT professionals interested in NoSQL solutions, and DBAs transitioning from relational databases to document-oriented databases. Prior experience with databases can be helpful, but this guide is accessible even for enthusiastic beginners seeking to learn MongoDB.

Cosmos DB for MongoDB Developers: Migrating to Azure Cosmos DB and Using the MongoDB API

Learn Azure Cosmos DB and its MongoDB API with hands-on samples and advanced features such as the multi-homing API, geo-replication, custom indexing, TTL, request units (RU), consistency levels, partitioning, and much more. Each chapter explains Azure Cosmos DB’s features and functionalities by comparing it to MongoDB with coding samples. Cosmos DB for MongoDB Developers starts with an overview of NoSQL and Azure Cosmos DB and moves on to demonstrate the difference between geo-replication of Azure Cosmos DB compared to MongoDB. Along the way you’ll cover subjects including indexing, partitioning, consistency, and sizing, all of which will help you understand the concepts of read units and how this calculation is derived from an existing MongoDB’s usage. The next part of the book shows you the process and strategies for migrating to Azure Cosmos DB. You will learn the day-to-day scenarios of using Azure Cosmos DB, its sizing strategies, and optimizing techniques for the MongoDB API. This information will help you when planning to migrate from MongoDB or if you would like to compare MongoDB to the Azure Cosmos DB MongoDB API before considering the switch. What You Will Learn Migrate to MongoDB and understand its strategies Develop a sample application using MongoDB’s client driver Make use of sizing best practices and performance optimization scenarios Optimize MongoDB’s partition mechanism and indexing Who This Book Is For MongoDB developers who wish to learn Azure Cosmos DB. It specifically caters to a technical audience, working on MongoDB.

Summary

Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Yair Weinberger about Alooma, a company providing data pipelines as a service

Interview

Introduction How did you get involved in the area of data management? What is Alooma and what is the origin story? How is the Alooma platform architected?

I want to go into stream VS batch here What are the most challenging components to scale?

How do you manage the underlying infrastructure to support your SLA of 5 nines? What are some of the complexities introduced by processing data from multiple customers with various compliance requirements?

How do you sandbox user’s processing code to avoid security exploits?

What are some of the potential pitfalls for automatic schema management in the target database? Given the large number of integrations, how do you maintain the

What are some challenges when creating integrations, isn’t it simply conforming with an external API?

For someone getting started with Alooma what does the workflow look like? What are some of the most challenging aspects of building and maintaining Alooma? What are your plans for the future of Alooma?

Contact Info

LinkedIn @yairwein on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Alooma Convert Media Data Integration ESB (Enterprise Service Bus) Tibco Mulesoft ETL (Extract, Transform, Load) Informatica Microsoft SSIS OLAP Cube S3 Azure Cloud Storage Snowflake DB Redshift BigQuery Salesforce Hubspot Zendesk Spark The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps RDBMS (Relational Database Management System) SaaS (Software as a Service) Change Data Capture Kafka Storm Google Cloud PubSub Amazon Kinesis Alooma Code Engine Zookeeper Idempotence Kafka Streams Kubernetes SOC2 Jython Docker Python Javascript Ruby Scala PII (Personally Identifiable Information) GDPR (General Data Protection Regulation) Amazon EMR (Elastic Map Reduce) Sequoia Capital Lightspeed Investors Redis Aerospike Cassandra MongoDB

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Data Science Fundamentals for Python and MongoDB

Build the foundational data science skills necessary to work with and better understand complex data science algorithms. This example-driven book provides complete Python coding examples to complement and clarify data science concepts, and enrich the learning experience. Coding examples include visualizations whenever appropriate. The book is a necessary precursor to applying and implementing machine learning algorithms. The book is self-contained. All of the math, statistics, stochastic, and programming skills required to master the content are covered. In-depth knowledge of object-oriented programming isn’t required because complete examples are provided and explained. Data Science Fundamentals with Python and MongoDB is an excellent starting point for those interested in pursuing a career in data science. Like any science, the fundamentals of data science are a prerequisite to competency. Without proficiency in mathematics, statistics, data manipulation, and coding, the path to success is “rocky” at best. The coding examples in this book are concise, accurate, and complete, and perfectly complement the data science concepts introduced. What You'll Learn Prepare for a career in data science Work with complex data structures in Python Simulate with Monte Carlo and Stochastic algorithms Apply linear algebra using vectors and matrices Utilize complex algorithms such as gradient descent and principal component analysis Wrangle, cleanse, visualize, and problem solve with data Use MongoDB and JSON to work with data Who This Book Is For The novice yearning to break into the data science world, and the enthusiast looking to enrich, deepen, and develop data science skills through mastering the underlying fundamentalsthat are sometimes skipped over in the rush to be productive. Some knowledge of object-oriented programming will make learning easier.

Seven NoSQL Databases in a Week

Learn the fundamentals of seven essential NoSQL databases in just one week with this book. Covering MongoDB, DynamoDB, Redis, Cassandra, Neo4j, InfluxDB, and HBase, you'll explore their functionalities and practical applications. Designed to give you a working understanding of NoSQL database types, this guide helps aspiring DBAs and developers comprehend and utilize modern data solutions. What this Book will help me do Master the fundamentals of MongoDB, including high-performance, high-availability, and scaling features. Gain hands-on experience with Neo4j to perform database queries and integrate with Python and Java applications. Learn efficient querying with Redis for storage and retrieval tasks. Understand Cassandra's powerful solution for scalable and fault-tolerant systems. Get well-versed with HBase for creating tables, and reading and writing data efficiently. Author(s) Sudarshan Kadambi and Xun (Brian) Wu bring a wealth of experience in database technologies. They have worked extensively in the software development and database management fields. With their practical and concise teaching approach, the authors make complex topics accessible for readers. Who is it for? This book is ideal for budding DBAs and developers looking to understand NoSQL databases. It is particularly useful for those transitioning from relational databases who want to learn about modern database technologies. Suitable for both beginners and those with some database knowledge, it aims to bridge skill gaps and expand the reader's technical expertise.